pep-0001 PEP Purpose and Guidelines

PEP:1
Title:PEP Purpose and Guidelines
Version:$Revision$
Last-Modified:$Date$
Author:Barry Warsaw, Jeremy Hylton, David Goodger, Nick Coghlan
Status:Active
Type:Process
Content-Type:text/x-rst
Created:13-Jun-2000
Post-History:21-Mar-2001, 29-Jul-2002, 03-May-2003, 05-May-2012, 07-Apr-2013

Contents

What is a PEP?

PEP stands for Python Enhancement Proposal. A PEP is a design document providing information to the Python community, or describing a new feature for Python or its processes or environment. The PEP should provide a concise technical specification of the feature and a rationale for the feature.

We intend PEPs to be the primary mechanisms for proposing major new features, for collecting community input on an issue, and for documenting the design decisions that have gone into Python. The PEP author is responsible for building consensus within the community and documenting dissenting opinions.

Because the PEPs are maintained as text files in a versioned repository, their revision history is the historical record of the feature proposal [1].

PEP Types

There are three kinds of PEP:

  1. A Standards Track PEP describes a new feature or implementation for Python. It may also describe an interoperability standard that will be supported outside the standard library for current Python versions before a subsequent PEP adds standard library support in a future version.
  2. An Informational PEP describes a Python design issue, or provides general guidelines or information to the Python community, but does not propose a new feature. Informational PEPs do not necessarily represent a Python community consensus or recommendation, so users and implementers are free to ignore Informational PEPs or follow their advice.
  3. A Process PEP describes a process surrounding Python, or proposes a change to (or an event in) a process. Process PEPs are like Standards Track PEPs but apply to areas other than the Python language itself. They may propose an implementation, but not to Python's codebase; they often require community consensus; unlike Informational PEPs, they are more than recommendations, and users are typically not free to ignore them. Examples include procedures, guidelines, changes to the decision-making process, and changes to the tools or environment used in Python development. Any meta-PEP is also considered a Process PEP.

PEP Workflow

Python's BDFL

There are several reference in this PEP to the "BDFL". This acronym stands for "Benevolent Dictator for Life" and refers to Guido van Rossum, the original creator of, and the final design authority for, the Python programming language.

PEP Editors

The PEP editors are individuals responsible for managing the administrative and editorial aspects of the PEP workflow (e.g. assigning PEP numbers and changing their status). See PEP Editor Responsibilities & Workflow for details. The current editors are:

  • Chris Angelico
  • Anthony Baxter
  • Georg Brandl
  • Brett Cannon
  • David Goodger
  • Jesse Noller
  • Berker Peksag
  • Guido van Rossum
  • Barry Warsaw

PEP editorship is by invitation of the current editors. The address <peps@python.org> is a mailing list for contacting the PEP editors. All email related to PEP administration (such as requesting a PEP number or providing an updated version of a PEP for posting) should be sent to this address (no cross-posting please).

Submitting a PEP

The PEP process begins with a new idea for Python. It is highly recommended that a single PEP contain a single key proposal or new idea. Small enhancements or patches often don't need a PEP and can be injected into the Python development workflow with a patch submission to the Python issue tracker [6]. The more focused the PEP, the more successful it tends to be. The PEP editors reserve the right to reject PEP proposals if they appear too unfocused or too broad. If in doubt, split your PEP into several well-focused ones.

Each PEP must have a champion -- someone who writes the PEP using the style and format described below, shepherds the discussions in the appropriate forums, and attempts to build community consensus around the idea. The PEP champion (a.k.a. Author) should first attempt to ascertain whether the idea is PEP-able. Posting to the comp.lang.python newsgroup (a.k.a. python-list@python.org mailing list) or the python-ideas mailing list is the best way to go about this.

Vetting an idea publicly before going as far as writing a PEP is meant to save the potential author time. Many ideas have been brought forward for changing Python that have been rejected for various reasons. Asking the Python community first if an idea is original helps prevent too much time being spent on something that is guaranteed to be rejected based on prior discussions (searching the internet does not always do the trick). It also helps to make sure the idea is applicable to the entire community and not just the author. Just because an idea sounds good to the author does not mean it will work for most people in most areas where Python is used.

Once the champion has asked the Python community as to whether an idea has any chance of acceptance, a draft PEP should be presented to python-ideas. This gives the author a chance to flesh out the draft PEP to make properly formatted, of high quality, and to address initial concerns about the proposal.

Following a discussion on python-ideas, the proposal should be sent as a draft PEP to the PEP editors <peps@python.org>. The draft must be written in PEP style as described below, else it will be sent back without further regard until proper formatting rules are followed (although minor errors will be corrected by the editors).

If the PEP editors approve, they will assign the PEP a number, label it as Standards Track, Informational, or Process, give it status "Draft", and create and check-in the initial draft of the PEP. The PEP editors will not unreasonably deny a PEP. Reasons for denying PEP status include duplication of effort, being technically unsound, not providing proper motivation or addressing backwards compatibility, or not in keeping with the Python philosophy. The BDFL can be consulted during the approval phase, and is the final arbiter of the draft's PEP-ability.

Developers with hg push privileges for the PEP repository [10] may claim PEP numbers directly by creating and committing a new PEP. When doing so, the developer must handle the tasks that would normally be taken care of by the PEP editors (see PEP Editor Responsibilities & Workflow). This includes ensuring the initial version meets the expected standards for submitting a PEP. Alternately, even developers may choose to submit PEPs through the PEP editors. When doing so, let the PEP editors know you have hg push privileges and they can guide you through the process of updating the PEP repository directly.

As updates are necessary, the PEP author can check in new versions if they (or a collaborating developer) have hg push privileges, or else they can email new PEP versions to the PEP editors for publication.

After a PEP number has been assigned, a draft PEP may be discussed further on python-ideas (getting a PEP number assigned early can be useful for ease of reference, especially when multiple draft PEPs are being considered at the same time). Eventually, all Standards Track PEPs must be sent to the python-dev list for review as described in the next section.

Standards Track PEPs consist of two parts, a design document and a reference implementation. It is generally recommended that at least a prototype implementation be co-developed with the PEP, as ideas that sound good in principle sometimes turn out to be impractical when subjected to the test of implementation.

PEP authors are responsible for collecting community feedback on a PEP before submitting it for review. However, wherever possible, long open-ended discussions on public mailing lists should be avoided. Strategies to keep the discussions efficient include: setting up a separate SIG mailing list for the topic, having the PEP author accept private comments in the early design phases, setting up a wiki page, etc. PEP authors should use their discretion here.

PEP Review & Resolution

Once the authors have completed a PEP, they may request a review for style and consistency from the PEP editors. However, the content and final acceptance of the PEP must be requested of the BDFL, usually via an email to the python-dev mailing list. PEPs are reviewed by the BDFL and his chosen consultants, who may accept or reject a PEP or send it back to the author(s) for revision. For a PEP that is predetermined to be acceptable (e.g., it is an obvious win as-is and/or its implementation has already been checked in) the BDFL may also initiate a PEP review, first notifying the PEP author(s) and giving them a chance to make revisions.

The final authority for PEP approval is the BDFL. However, whenever a new PEP is put forward, any core developer that believes they are suitably experienced to make the final decision on that PEP may offer to serve as the BDFL's delegate (or "PEP czar") for that PEP. If their self-nomination is accepted by the other core developers and the BDFL, then they will have the authority to approve (or reject) that PEP. This process happens most frequently with PEPs where the BDFL has granted in principle approval for something to be done, but there are details that need to be worked out before the PEP can be accepted.

If the final decision on a PEP is to be made by a delegate rather than directly by the BDFL, this will be recorded by including the "BDFL-Delegate" header in the PEP.

PEP review and resolution may also occur on a list other than python-dev (for example, distutils-sig for packaging related PEPs that don't immediately affect the standard library). In this case, the "Discussions-To" heading in the PEP will identify the appropriate alternative list where discussion, review and pronouncement on the PEP will occur.

For a PEP to be accepted it must meet certain minimum criteria. It must be a clear and complete description of the proposed enhancement. The enhancement must represent a net improvement. The proposed implementation, if applicable, must be solid and must not complicate the interpreter unduly. Finally, a proposed enhancement must be "pythonic" in order to be accepted by the BDFL. (However, "pythonic" is an imprecise term; it may be defined as whatever is acceptable to the BDFL. This logic is intentionally circular.) See PEP 2 [2] for standard library module acceptance criteria.

Once a PEP has been accepted, the reference implementation must be completed. When the reference implementation is complete and incorporated into the main source code repository, the status will be changed to "Final".

A PEP can also be assigned status "Deferred". The PEP author or an editor can assign the PEP this status when no progress is being made on the PEP. Once a PEP is deferred, a PEP editor can re-assign it to draft status.

A PEP can also be "Rejected". Perhaps after all is said and done it was not a good idea. It is still important to have a record of this fact. The "Withdrawn" status is similar - it means that the PEP author themselves has decided that the PEP is actually a bad idea, or has accepted that a competing proposal is a better alternative.

When a PEP is Accepted, Rejected or Withdrawn, the PEP should be updated accordingly. In addition to updating the status field, at the very least the Resolution header should be added with a link to the relevant post in the python-dev mailing list archives.

PEPs can also be superseded by a different PEP, rendering the original obsolete. This is intended for Informational PEPs, where version 2 of an API can replace version 1.

The possible paths of the status of PEPs are as follows:

pep-0001-1.png

Some Informational and Process PEPs may also have a status of "Active" if they are never meant to be completed. E.g. PEP 1 (this PEP).

PEP Maintenance

In general, Standards track PEPs are no longer modified after they have reached the Final state. Once a PEP has been completed, the Language and Standard Library References become the formal documentation of the expected behavior.

Informational and Process PEPs may be updated over time to reflect changes to development practices and other details. The precise process followed in these cases will depend on the nature and purpose of the PEP being updated.

What belongs in a successful PEP?

Each PEP should have the following parts:

  1. Preamble -- RFC 822 style headers containing meta-data about the PEP, including the PEP number, a short descriptive title (limited to a maximum of 44 characters), the names, and optionally the contact info for each author, etc.

  2. Abstract -- a short (~200 word) description of the technical issue being addressed.

  3. Copyright/public domain -- Each PEP must either be explicitly labeled as placed in the public domain (see this PEP as an example) or licensed under the Open Publication License [7].

  4. Specification -- The technical specification should describe the syntax and semantics of any new language feature. The specification should be detailed enough to allow competing, interoperable implementations for at least the current major Python platforms (CPython, Jython, IronPython, PyPy).

  5. Motivation -- The motivation is critical for PEPs that want to change the Python language. It should clearly explain why the existing language specification is inadequate to address the problem that the PEP solves. PEP submissions without sufficient motivation may be rejected outright.

  6. Rationale -- The rationale fleshes out the specification by describing what motivated the design and why particular design decisions were made. It should describe alternate designs that were considered and related work, e.g. how the feature is supported in other languages.

    The rationale should provide evidence of consensus within the community and discuss important objections or concerns raised during discussion.

  7. Backwards Compatibility -- All PEPs that introduce backwards incompatibilities must include a section describing these incompatibilities and their severity. The PEP must explain how the author proposes to deal with these incompatibilities. PEP submissions without a sufficient backwards compatibility treatise may be rejected outright.

  8. Reference Implementation -- The reference implementation must be completed before any PEP is given status "Final", but it need not be completed before the PEP is accepted. While there is merit to the approach of reaching consensus on the specification and rationale before writing code, the principle of "rough consensus and running code" is still useful when it comes to resolving many discussions of API details.

    The final implementation must include test code and documentation appropriate for either the Python language reference or the standard library reference.

PEP Formats and Templates

There are two PEP formats available to authors: plaintext and reStructuredText [8]. Both are UTF-8-encoded text files.

Plaintext PEPs are written with minimal structural markup that adheres to a rigid style. PEP 9 contains a instructions and a template [3] you can use to get started writing your plaintext PEP.

ReStructuredText [8] PEPs allow for rich markup that is still quite easy to read, but results in much better-looking and more functional HTML. PEP 12 contains instructions and a template [4] for reStructuredText PEPs.

There is a Python script that converts both styles of PEPs to HTML for viewing on the web [5]. Parsing and conversion of plaintext PEPs is self-contained within the script. reStructuredText PEPs are parsed and converted by Docutils [9] code called from the script.

PEP Header Preamble

Each PEP must begin with an RFC 822 style header preamble. The headers must appear in the following order. Headers marked with "*" are optional and are described below. All other headers are required.

  PEP: <pep number>
  Title: <pep title>
  Version: <version string>
  Last-Modified: <date string>
  Author: <list of authors' real names and optionally, email addrs>
* BDFL-Delegate: <PEP czar's real name>
* Discussions-To: <email address>
  Status: <Draft | Active | Accepted | Deferred | Rejected |
           Withdrawn | Final | Superseded>
  Type: <Standards Track | Informational | Process>
* Content-Type: <text/plain | text/x-rst>
* Requires: <pep numbers>
  Created: <date created on, in dd-mmm-yyyy format>
* Python-Version: <version number>
  Post-History: <dates of postings to python-list and python-dev>
* Replaces: <pep number>
* Superseded-By: <pep number>
* Resolution: <url>

The Author header lists the names, and optionally the email addresses of all the authors/owners of the PEP. The format of the Author header value must be

Random J. User <address@dom.ain>

if the email address is included, and just

Random J. User

if the address is not given. For historical reasons the format "address@dom.ain (Random J. User)" may appear in a PEP, however new PEPs must use the mandated format above, and it is acceptable to change to this format when PEPs are updated.

If there are multiple authors, each should be on a separate line following RFC 2822 continuation line conventions. Note that personal email addresses in PEPs will be obscured as a defense against spam harvesters.

The BDFL-Delegate field is used to record cases where the final decision to approve or reject a PEP rests with someone other than the BDFL. (The delegate's email address is currently omitted due to a limitation in the email address masking for reStructuredText PEPs)

Note: The Resolution header is required for Standards Track PEPs only. It contains a URL that should point to an email message or other web resource where the pronouncement about the PEP is made.

For a PEP where final pronouncement will be made on a list other than python-dev, a Discussions-To header will indicate the mailing list or URL where the pronouncement will occur. A temporary Discussions-To header may also be used when a draft PEP is being discussed prior to submission for pronouncement. No Discussions-To header is necessary if the PEP is being discussed privately with the author, or on the python-list, python-ideas or python-dev mailing lists. Note that email addresses in the Discussions-To header will not be obscured.

The Type header specifies the type of PEP: Standards Track, Informational, or Process.

The format of a PEP is specified with a Content-Type header. The acceptable values are "text/plain" for plaintext PEPs (see PEP 9 [3]) and "text/x-rst" for reStructuredText PEPs (see PEP 12 [4]). Plaintext ("text/plain") is the default if no Content-Type header is present.

The Created header records the date that the PEP was assigned a number, while Post-History is used to record the dates of when new versions of the PEP are posted to python-list and/or python-dev. Both headers should be in dd-mmm-yyyy format, e.g. 14-Aug-2001.

Standards Track PEPs will typically have a Python-Version header which indicates the version of Python that the feature will be released with. Standards Track PEPs without a Python-Version header indicate interoperability standards that will initially be supported through external libraries and tools, and then supplemented by a later PEP to add support to the standard library. Informational and Process PEPs do not need a Python-Version header.

PEPs may have a Requires header, indicating the PEP numbers that this PEP depends on.

PEPs may also have a Superseded-By header indicating that a PEP has been rendered obsolete by a later document; the value is the number of the PEP that replaces the current document. The newer PEP must have a Replaces header containing the number of the PEP that it rendered obsolete.

Auxiliary Files

PEPs may include auxiliary files such as diagrams. Such files must be named pep-XXXX-Y.ext, where "XXXX" is the PEP number, "Y" is a serial number (starting at 1), and "ext" is replaced by the actual file extension (e.g. "png").

Reporting PEP Bugs, or Submitting PEP Updates

How you report a bug, or submit a PEP update depends on several factors, such as the maturity of the PEP, the preferences of the PEP author, and the nature of your comments. For the early draft stages of the PEP, it's probably best to send your comments and changes directly to the PEP author. For more mature, or finished PEPs you may want to submit corrections to the Python issue tracker [6] so that your changes don't get lost. If the PEP author is a Python developer, assign the bug/patch to them, otherwise assign it to a PEP editor.

When in doubt about where to send your changes, please check first with the PEP author and/or a PEP editor.

PEP authors with hg push privileges for the PEP repository can update the PEPs themselves by using "hg push" to submit their changes.

Transferring PEP Ownership

It occasionally becomes necessary to transfer ownership of PEPs to a new champion. In general, it is preferable to retain the original author as a co-author of the transferred PEP, but that's really up to the original author. A good reason to transfer ownership is because the original author no longer has the time or interest in updating it or following through with the PEP process, or has fallen off the face of the 'net (i.e. is unreachable or not responding to email). A bad reason to transfer ownership is because the author doesn't agree with the direction of the PEP. One aim of the PEP process is to try to build consensus around a PEP, but if that's not possible, an author can always submit a competing PEP.

If you are interested in assuming ownership of a PEP, send a message asking to take over, addressed to both the original author and the PEP editors <peps@python.org>. If the original author doesn't respond to email in a timely manner, the PEP editors will make a unilateral decision (it's not like such decisions can't be reversed :).

PEP Editor Responsibilities & Workflow

A PEP editor must subscribe to the <peps@python.org> list. All correspondence related to PEP administration should be sent (or forwarded) to <peps@python.org> (but please do not cross-post!).

For each new PEP that comes in an editor does the following:

If the PEP isn't ready, an editor will send it back to the author for revision, with specific instructions.

Once the PEP is ready for the repository, a PEP editor will:

Updates to existing PEPs also come in to peps@python.org. Many PEP authors are not Python committers yet, so PEP editors do the commits for them.

Many PEPs are written and maintained by developers with write access to the Python codebase. The PEP editors monitor the python-checkins list for PEP changes, and correct any structure, grammar, spelling, or markup mistakes they see.

PEP editors don't pass judgment on PEPs. They merely do the administrative & editorial part (which is generally a low volume task).

Resources:

References and Footnotes

[1]This historical record is available by the normal hg commands for retrieving older revisions, and can also be browsed via HTTP here: http://hg.python.org/peps/
[2]PEP 2, Procedure for Adding New Modules, Faassen (http://www.python.org/dev/peps/pep-0002)
[3](1, 2) PEP 9, Sample Plaintext PEP Template, Warsaw (http://www.python.org/dev/peps/pep-0009)
[4](1, 2) PEP 12, Sample reStructuredText PEP Template, Goodger, Warsaw (http://www.python.org/dev/peps/pep-0012)
[5]The script referred to here is pep2pyramid.py, the successor to pep2html.py, both of which live in the same directory in the hg repo as the PEPs themselves. Try pep2html.py --help for details. The URL for viewing PEPs on the web is http://www.python.org/dev/peps/.
[6](1, 2) http://bugs.python.org/
[7]http://www.opencontent.org/openpub/
[8](1, 2) http://docutils.sourceforge.net/rst.html
[9]http://docutils.sourceforge.net/
[10]http://hg.python.org/peps

pep-0002 Procedure for Adding New Modules

PEP: 2
Title: Procedure for Adding New Modules
Version: $Revision$
Last-Modified: $Date$
Author: Martijn Faassen <faassen at infrae.com>
Status: Final
Type: Process
Created: 07-Jul-2001
Post-History: 07-Jul-2001, 09-Mar-2002

PEP Replacement

    This PEP has been superseded by the updated material in the Python
    Developer's Guide [1].

Introduction

    The Python Standard Library contributes significantly to Python's
    success.  The language comes with "batteries included", so it is
    easy for people to become productive with just the standard
    library alone.  It is therefore important that this library grows
    with the language, and that such growth is supported and
    encouraged.

    Many contributions to the library are not created by core
    developers but by people from the Python community who are experts
    in their particular field.  Furthermore, community members are
    also the users of the standard library, applying it in a great
    diversity of settings.  This makes the community well equipped to
    detect and report gaps in the library; things that are missing but
    should be added.

    New functionality is commonly added to the library in the form of
    new modules.  This PEP will describe the procedure for the
    _addition_ of new modules.  PEP 4 deals with procedures for
    deprecation of modules; the _removal_ of old and unused modules
    from the standard library.  Finally there is also the issue of
    _changing_ existing modules to make the picture of library
    evolution complete.  PEP 3 and PEP 5 give some guidelines on this.
    The continued maintenance of existing modules is an integral part
    of the decision on whether to add a new module to the standard
    library.  Therefore, this PEP also introduces concepts
    (integrators, maintainers) relevant to the maintenance issue.
    

Integrators

    The integrators are a group of people with the following
    responsibilities:

    - They determine if a proposed contribution should become part of
      the standard library.

    - They integrate accepted contributions into the standard library.

    - They produce standard library releases.

   This group of people shall be PythonLabs, led by Guido.


Maintainer(s)

    All contributions to the standard library need one or more
    maintainers.  This can be an individual, but it is frequently a
    group of people such as the XML-SIG.  Groups may subdivide
    maintenance tasks among themselves.  One ore more maintainers
    shall be the _head maintainer_ (usually this is also the main
    developer).  Head maintainers are convenient people the
    integrators can address if they want to resolve specific issues,
    such as the ones detailed later in this document.


Developers(s)

    Contributions to the standard library have been developed by one
    or more developers.  The initial maintainers are the original
    developers unless there are special circumstances (which should be
    detailed in the PEP proposing the contribution).


Acceptance Procedure

    When developers wish to have a contribution accepted into the
    standard library, they will first form a group of maintainers
    (normally initially consisting of themselves).

    Then, this group shall produce a PEP called a library PEP. A
    library PEP is a special form of standards track PEP.  The library
    PEP gives an overview of the proposed contribution, along with the
    proposed contribution as the reference implementation.  This PEP
    should also contain a motivation on why this contribution should
    be part of the standard library.

    One or more maintainers shall step forward as PEP champion (the
    people listed in the Author field are the champions).  The PEP
    champion(s) shall be the initial head maintainer(s).
    
    As described in PEP 1, a standards track PEP should consist of a
    design document and a reference implementation.  The library PEP
    differs from a normal standard track PEP in that the reference
    implementation should in this case always already have been
    written before the PEP is to be reviewed for inclusion by the
    integrators and to be commented upon by the community; the
    reference implementation _is_ the proposed contribution.

    This different requirement exists for the following reasons:

    - The integrators can only properly evaluate a contribution to the
      standard library when there is source code and documentation to
      look at; i.e. the reference implementation is always necessary
      to aid people in studying the PEP.

    - Even rejected contributions will be useful outside the standard
      library, so there will a lower risk of waste of effort by the
      developers.
  
    - It will impress the integrators of the seriousness of
      contribution and will help guard them against having to evaluate
      too many frivolous proposals.

    Once the library PEP has been submitted for review, the
    integrators will then evaluate it.  The PEP will follow the normal
    PEP work flow as described in PEP 1.  If the PEP is accepted, they
    will work through the head maintainers to make the contribution
    ready for integration.


Maintenance Procedure

    After a contribution has been accepted, the job is not over for
    both integrators and maintainers.  The integrators will forward
    any bug reports in the standard library to the appropriate head
    maintainers.

    Before the feature freeze preparing for a release of the standard
    library, the integrators will check with the head maintainers for
    all contributions, to see if there are any updates to be included
    in the next release.  The integrators will evaluate any such
    updates for issues like backwards compatibility and may require
    PEPs if the changes are deemed to be large.

    The head maintainers should take an active role in keeping up to
    date with the Python development process.  If a head maintainer is
    unable to function in this way, he or she should announce the
    intention to step down to the integrators and the rest of the
    maintainers, so that a replacement can step forward.  The
    integrators should at all times be capable of reaching the head
    maintainers by email.

    In the case where no head maintainer can be found (possibly
    because there are no maintainers left), the integrators will issue
    a call to the community at large asking for new maintainers to
    step forward.  If no one does, the integrators can decide to
    declare the contribution deprecated as described in PEP 4.


Open issues

    There needs to be some procedure so that the integrators can
    always reach the maintainers (or at least the head maintainers).
    This could be accomplished by a mailing list to which all head
    maintainers should be subscribed (this could be python-dev).
    Another possibility, which may be useful in any case, is the
    maintenance of a list similar to that of the list of PEPs which
    lists all the contributions and their head maintainers with
    contact info.  This could in fact be part of the list of the PEPs,
    as a new contribution requires a PEP.  But since the
    authors/owners of a PEP introducing a new module may eventually be
    different from those who maintain it, this wouldn't resolve all
    issues yet.

    Should there be a list of what criteria integrators use for
    evaluating contributions?  (Source code but also things like
    documentation and a test suite, as well as such vague things like
    'dependability of the maintainers'.)
    
    This relates to all the technical issues; check-in privileges,
    coding style requirements, documentation requirements, test suite
    requirements.  These are preferably part of another PEP.

    Should the current standard library be subdivided among
    maintainers?  Many parts already have (informal) maintainers; it
    may be good to make this more explicit.

    Perhaps there is a better word for 'contribution'; the word
    'contribution' may not imply enough that the process (of
    development and maintenance) does not stop after the contribution
    is accepted and integrated into the library.

    Relationship to the mythical Catalog?

References

    [1] Adding to the Stdlib
        http://docs.python.org/devguide/stdlibchanges.html

Copyright

    This document has been placed in the public domain.



pep-0003 Guidelines for Handling Bug Reports

PEP: 3
Title: Guidelines for Handling Bug Reports
Version: $Revision$
Last-Modified: $Date$
Author: Jeremy Hylton <jeremy at alum.mit.edu>
Status: Withdrawn
Type: Process
Created: 25-Sep-2000
Post-History: 

Introduction

    This PEP contained guidelines for handling bug reports in 
    the Python bug tracker.  It has been replaced by the Developer's
    Guide description of issue triaging at

        https://docs.python.org/devguide/triaging.html

    Guidelines for people submitting Python bugs are at

        http://docs.python.org/bugs.html

Original Guidelines

    1. Make sure the bug category and bug group are correct.  If they
       are correct, it is easier for someone interested in helping to
       find out, say, what all the open Tkinter bugs are.

    2. If it's a minor feature request that you don't plan to address
       right away, add it to PEP 42 or ask the owner to add it for
       you.  If you add the bug to PEP 42, mark the bug as "feature
       request", "later", and "closed"; and add a comment to the bug
       saying that this is the case (mentioning the PEP explicitly).

       XXX do we prefer the tracker or PEP 42?

    3. Assign the bug a reasonable priority.  We don't yet have a
       clear sense of what each priority should mean.  One rule,
       however, is that bugs with priority "urgent" or higher must
       be fixed before the next release.

    4. If a bug report doesn't have enough information to allow you to
       reproduce or diagnose it, ask the original submitter for more
       information.  If the original report is really thin and your
       email doesn't get a response after a reasonable waiting period,
       you can close the bug.

    5. If you fix a bug, mark the status as "Fixed" and close it.  In
       the comments, include the SVN revision numbers of the commit(s).
       In the SVN checkin message, include the issue number *and* a
       normal description of the change, mentioning the contributor
       if a patch was applied.

    6. If you are assigned a bug that you are unable to deal with,
       assign it to someone else if you think they will be able to
       deal with it, otherwise it's probably best to unassign it.


References

    [1] http://bugs.python.org/

pep-0004 Deprecation of Standard Modules

PEP:4
Title:Deprecation of Standard Modules
Version:$Revision$
Last-Modified:$Date$
Author:Martin von Lรถwis <martin at v.loewis.de>
Status:Active
Type:Process
Content-Type:text/x-rst
Created:1-Oct-2000
Post-History:

Introduction

When new modules were added to the standard Python library in the past, it was not possible to foresee whether they would still be useful in the future. Even though Python "Comes With Batteries Included", batteries may discharge over time. Carrying old modules around is a burden on the maintainer, especially when there is no interest in the module anymore.

At the same time, removing a module from the distribution is difficult, as it is not known in general whether anybody is still using it. This PEP defines a procedure for removing modules from the standard Python library. Usage of a module may be 'deprecated', which means that it may be removed from a future Python release. The rationale for deprecating a module is also collected in this PEP. If the rationale turns out faulty, the module may become 'undeprecated'.

Procedure for declaring a module deprecated

Since the status of module deprecation is recorded in this PEP, proposals for deprecating modules MUST be made by providing a change to the text of this PEP, which SHOULD be a patch posted to bugs.python.org.

A proposal for deprecation of the module MUST include the date of the proposed deprecation and a rationale for deprecating it. In addition, the proposal MUST include a change to the documentation of the module; deprecation is indicated by saying that the module is "obsolete" or "deprecated". The proposal SHOULD include a patch for the module's source code to indicate deprecation there as well, by raising a DeprecationWarning. The proposal MUST include patches to remove any use of the deprecated module from the standard library.

It is expected that deprecated modules are included in the Python releases that immediately follows the deprecation; later releases may ship without the deprecated modules.

Procedure for declaring a module undeprecated

When a module becomes deprecated, a rationale is given for its deprecation. In some cases, an alternative interface for the same functionality is provided, so the old interface is deprecated. In other cases, the need for having the functionality of the module may not exist anymore.

If the rationale is faulty, again a change to this PEP's text MUST be submitted. This change MUST include the date of undeprecation and a rationale for undeprecation. Modules that are undeprecated under this procedure MUST be listed in this PEP for at least one major release of Python.

Obsolete modules

A number of modules are already listed as obsolete in the library documentation. These are listed here for completeness.

cl, sv, timing

All these modules have been declared as obsolete in Python 2.0, some even earlier.

The following obsolete modules were removed in Python 2.5:

addpack, cmp, cmpcache, codehack, dircmp, dump, find, fmt, grep, lockfile, newdir, ni, packmail, Para, poly, rand, reconvert, regex, regsub, statcache, tb, tzparse, util, whatsound, whrandom, zmod

The following modules were removed in Python 2.6:

gopherlib, rgbimg, macfs

The following modules currently lack a DeprecationWarning:

rfc822, mimetools, multifile

Deprecated modules

Module name:   posixfile
Rationale:     Locking is better done by fcntl.lockf().
Date:          Before 1-Oct-2000.
Documentation: Already documented as obsolete.  Deprecation
               warning added in Python 2.6.

Module name:   gopherlib
Rationale:     The gopher protocol is not in active use anymore.
Date:          1-Oct-2000.
Documentation: Documented as deprecated since Python 2.5.  Removed
               in Python 2.6.

Module name:   rgbimgmodule
Rationale:     In a 2001-04-24 c.l.py post, Jason Petrone mentions
               that he occasionally uses it; no other references to
               its use can be found as of 2003-11-19.
Date:          1-Oct-2000
Documentation: Documented as deprecated since Python 2.5.  Removed
               in Python 2.6.

Module name:   pre
Rationale:     The underlying PCRE engine doesn't support Unicode, and
               has been unmaintained since Python 1.5.2.
Date:          10-Apr-2002
Documentation: It was only mentioned as an implementation detail,
               and never had a section of its own.   This mention
               has now been removed.

Module name:   whrandom
Rationale:     The module's default seed computation was
               inherently insecure; the random module should be
               used instead.
Date:          11-Apr-2002
Documentation: This module has been documented as obsolete since
               Python 2.1, but listing in this PEP was neglected.
               The deprecation warning will be added to the module
               one year after Python 2.3 is released, and the
               module will be removed one year after that.

Module name:   rfc822
Rationale:     Supplanted by Python 2.2's email package.
Date:          18-Mar-2002
Documentation: Documented as "deprecated since release 2.3" since
               Python 2.2.2.

Module name:   mimetools
Rationale:     Supplanted by Python 2.2's email package.
Date:          18-Mar-2002
Documentation: Documented as "deprecated since release 2.3" since
               Python 2.2.2.

Module name:   MimeWriter
Rationale:     Supplanted by Python 2.2's email package.
Date:          18-Mar-2002
Documentation: Documented as "deprecated since release 2.3" since
               Python 2.2.2.  Raises a DeprecationWarning as of
               Python 2.6.

Module name:   mimify
Rationale:     Supplanted by Python 2.2's email package.
Date:          18-Mar-2002
Documentation: Documented as "deprecated since release 2.3" since
               Python 2.2.2.  Raises a DeprecationWarning as of
               Python 2.6.

Module name:   rotor
Rationale:     Uses insecure algorithm.
Date:          24-Apr-2003
Documentation: The documentation has been removed from the library
               reference in Python 2.4.

Module name:   TERMIOS.py
Rationale:     The constants in this file are now in the 'termios' module.
Date:          10-Aug-2004
Documentation: This module has been documented as obsolete since
               Python 2.1, but listing in this PEP was neglected.
               Removed from the library reference in Python 2.4.

Module name:   statcache
Rationale:     Using the cache can be fragile and error-prone;
               applications should just use os.stat() directly.
Date:          10-Aug-2004
Documentation: This module has been documented as obsolete since
               Python 2.2, but listing in this PEP was neglected.
               Removed from the library reference in Python 2.5.

Module name:   mpz
Rationale:     Third-party packages provide similiar features
               and wrap more of GMP's API.
Date:          10-Aug-2004
Documentation: This module has been documented as obsolete since
               Python 2.2, but listing in this PEP was neglected.
               Removed from the library reference in Python 2.4.

Module name:   xreadlines
Rationale:     Using 'for line in file', introduced in 2.3, is preferable.
Date:          10-Aug-2004
Documentation: This module has been documented as obsolete since
               Python 2.3, but listing in this PEP was neglected.
               Removed from the library reference in Python 2.4.

Module name:   multifile
Rationale:     Supplanted by the email package.
Date:          21-Feb-2006
Documentation: Documented as deprecated as of Python 2.5.

Module name:   sets
Rationale:     The built-in set/frozenset types, introduced in
               Python 2.4, supplant the module.
Date:          12-Jan-2007
Documentation: Documented as deprecated as of Python 2.6.

Module name:   buildtools
Rationale:     Unknown.
Date:          15-May-2007
Documentation: Documented as deprecated as of Python 2.3, but
               listing in this PEP was neglected.  Raised a
               DeprecationWarning as of Python 2.6.

Module name:   cfmfile
Rationale:     Unknown.
Date:          15-May-2007
Documentation: Documented as deprecated as of Python 2.4, but
               listing in this PEP was neglected.  A
               DeprecationWarning was added in Python 2.6.

Module name:   macfs
Rationale:     Unknown.
Date:          15-May-2007
Documentation: Documented as deprecated as of Python 2.3, but
               listing in this PEP was neglected.  Removed in
               Python 2.6.

Module name:   md5
Rationale:     Replaced by the 'hashlib' module.
Date:          15-May-2007
Documentation: Documented as deprecated as of Python 2.5, but
               listing in this PEP was neglected.
               DeprecationWarning raised as of Python 2.6.

Module name:   sha
Rationale:     Replaced by the 'hashlib' module.
Date:          15-May-2007
Documentation: Documented as deprecated as of Python 2.5, but
               listing in this PEP was neglected.
               DeprecationWarning added in Python 2.6.

Module name:   plat-freebsd2/IN and plat-freebsd3/IN
Rationale:     Platforms are obsolete (last released in 2000)
               Removed from 2.6
Date:          15-May-2007
Documentation: None

Module name:   plat-freebsd4/IN and possibly plat-freebsd5/IN
Rationale:     Platforms are obsolete/unsupported
Date:          15-May-2007
               Remove from 2.7
Documentation: None

Module name:   formatter
Rationale:     Lack of use in the community, no tests to keep
               code working.
Documentation: Deprecated as of Python 3.4 by raising
               PendingDeprecationWarning. Slated for removal in
               Python 3.6.

Deprecation of modules removed in Python 3.0

PEP 3108 lists all modules that have been removed from Python 3.0. They all are documented as deprecated in Python 2.6, and raise a DeprecationWarning if the -3 flag is activated.

pep-0005 Guidelines for Language Evolution

PEP: 5
Title: Guidelines for Language Evolution
Version: $Revision$
Last-Modified: $Date$
Author: Paul Prescod <paul at prescod.net>
Status: Active
Type: Process
Created: 26-Oct-2000
Post-History: 

Abstract

    In the natural evolution of programming languages it is sometimes
    necessary to make changes that modify the behavior of older
    programs.  This PEP proposes a policy for implementing these
    changes in a manner respectful of the installed base of Python
    users.


Implementation Details

    Implementation of this PEP requires the addition of a formal
    warning and deprecation facility that will be described in another
    proposal.


Scope

    These guidelines apply to future versions of Python that introduce
    backward-incompatible behavior.  Backward incompatible behavior is
    a major deviation in Python interpretation from an earlier
    behavior described in the standard Python documentation.  Removal
    of a feature also constitutes a change of behavior.

    This PEP does not replace or preclude other compatibility
    strategies such as dynamic loading of backwards-compatible
    parsers.  On the other hand, if execution of "old code" requires a
    special switch or pragma then that is indeed a change of behavior
    from the point of view of the user and that change should be
    implemented according to these guidelines.

    In general, common sense must prevail in the implementation of
    these guidelines.  For instance changing "sys.copyright" does not
    constitute a backwards-incompatible change of behavior!


Steps For Introducing Backwards-Incompatible Features

    1. Propose backwards-incompatible behavior in a PEP.  The PEP must
       include a section on backwards compatibility that describes in
       detail a plan to complete the remainder of these steps.

    2. Once the PEP is accepted as a productive direction, implement
       an alternate way to accomplish the task previously provided by
       the feature that is being removed or changed.  For instance if
       the addition operator were scheduled for removal, a new version
       of Python could implement an "add()" built-in function.

    3. Formally deprecate the obsolete construct in the Python
       documentation.

    4. Add an optional warning mode to the parser that will inform
       users when the deprecated construct is used.  In other words,
       all programs that will behave differently in the future must
       trigger warnings in this mode.  Compile-time warnings are
       preferable to runtime warnings.  The warning messages should
       steer people from the deprecated construct to the alternative
       construct.

    5. There must be at least a one-year transition period between the
       release of the transitional version of Python and the release
       of the backwards incompatible version.  Users will have at
       least a year to test their programs and migrate them from use
       of the deprecated construct to the alternative one.


pep-0006 Bug Fix Releases

PEP: 6
Title: Bug Fix Releases
Version: $Revision$
Last-Modified: $Date$
Author: Aahz <aahz at pythoncraft.com>, Anthony Baxter <anthony at interlink.com.au>
Status: Active
Type: Process
Created: 15-Mar-2001
Post-History: 15-Mar-2001 18-Apr-2001 19-Aug-2004

Abstract

    Python has historically had only a single fork of development,
    with releases having the combined purpose of adding new features
    and delivering bug fixes (these kinds of releases will be referred
    to as "major releases").  This PEP describes how to fork off
    maintenance, or bug fix, releases of old versions for the primary 
    purpose of fixing bugs.

    This PEP is not, repeat NOT, a guarantee of the existence of bug fix
    releases; it only specifies a procedure to be followed if bug fix
    releases are desired by enough of the Python community willing to
    do the work.


Motivation

    With the move to SourceForge, Python development has accelerated.
    There is a sentiment among part of the community that there was
    too much acceleration, and many people are uncomfortable with
    upgrading to new versions to get bug fixes when so many features
    have been added, sometimes late in the development cycle.

    One solution for this issue is to maintain the previous major
    release, providing bug fixes until the next major release.  This
    should make Python more attractive for enterprise development,
    where Python may need to be installed on hundreds or thousands of
    machines.


Prohibitions

    Bug fix releases are required to adhere to the following restrictions:

    1. There must be zero syntax changes.  All .pyc and .pyo files
       must work (no regeneration needed) with all bugfix releases
       forked off from a major release.

    2. There must be zero pickle changes.

    3. There must be no incompatible C API changes.  All extensions
       must continue to work without recompiling in all bugfix releases
       in the same fork as a major release.

    Breaking any of these prohibitions requires a BDFL proclamation
    (and a prominent warning in the release notes). 


Not-Quite-Prohibitions

    Where possible, bug fix releases should also:

    1. Have no new features. The purpose of a bug fix release is to 
       fix bugs, not add the latest and greatest whizzo feature from
       the HEAD of the CVS root.

    2. Be a painless upgrade. Users should feel confident that an
       upgrade from 2.x.y to 2.x.(y+1) will not break their running
       systems. This means that, unless it is necessary to fix a bug,
       the standard library should not change behavior, or worse yet,
       APIs.


Applicability of Prohibitions

    The above prohibitions and not-quite-prohibitions apply both
    for a final release to a bugfix release (for instance, 2.4 to
    2.4.1) and for one bugfix release to the next in a series 
    (for instance 2.4.1 to 2.4.2).

    Following the prohibitions listed in this PEP should help keep
    the community happy that a bug fix release is a painless and safe
    upgrade.


Helping the Bug Fix Releases Happen

    Here's a few pointers on helping the bug fix release process along.

    1. Backport bug fixes. If you fix a bug, and it seems appropriate,
       port it to the CVS branch for the current bug fix release. If
       you're unwilling or unable to backport it yourself, make a note
       in the commit message, with words like 'Bugfix candidate' or
       'Backport candidate'.

    2. If you're not sure, ask. Ask the person managing the current bug
       fix releases if they think a particular fix is appropriate.

    3. If there's a particular bug you'd particularly like fixed in a
       bug fix release, jump up and down and try to get it done. Do not
       wait until 48 hours before a bug fix release is due, and then
       start asking for bug fixes to be included.


Version Numbers

    Starting with Python 2.0, all major releases are required to have
    a version number of the form X.Y; bugfix releases will always be of
    the form X.Y.Z.

    The current major release under development is referred to as
    release N; the just-released major version is referred to as N-1.

    In CVS, the bug fix releases happen on a branch. For release 2.x,
    the branch is named 'release2x-maint'. For example, the branch for
    the 2.3 maintenance releases is release23-maint


Procedure

    The process for managing bugfix releases is modeled in part on the
    Tcl system [1].

    The Patch Czar is the counterpart to the BDFL for bugfix releases.
    However, the BDFL and designated appointees retain veto power over
    individual patches. A Patch Czar might only be looking after a single
    branch of development - it's quite possible that a different person
    might be maintaining the 2.3.x and the 2.4.x releases.

    As individual patches get contributed to the current trunk of CVS,
    each patch committer is requested to consider whether the patch is
    a bug fix suitable for inclusion in a bugfix release. If the patch is
    considered suitable, the committer can either commit the release to
    the maintenance branch, or else mark the patch in the commit message. 

    In addition, anyone from the Python community is free to suggest
    patches for inclusion. Patches may be submitted specifically for
    bugfix releases; they should follow the guidelines in PEP 3 [2].
    In general, though, it's probably better that a bug in a specific
    release also be fixed on the HEAD as well as the branch.

    The Patch Czar decides when there are a sufficient number of patches
    to warrant a release. The release gets packaged up, including a
    Windows installer, and made public. If any new bugs are found, they
    must be fixed immediately and a new bugfix release publicized (with
    an incremented version number). For the 2.3.x cycle, the Patch Czar
    (Anthony) has been trying for a release approximately every six 
    months, but this should not be considered binding in any way on 
    any future releases. 

    Bug fix releases are expected to occur at an interval of roughly
    six months. This is only a guideline, however - obviously, if a
    major bug is found, a bugfix release may be appropriate sooner. In
    general, only the N-1 release will be under active maintenance at
    any time. That is, during Python 2.4's development, Python 2.3 gets
    bugfix releases. If, however, someone qualified wishes to continue
    the work to maintain an older release, they should be encouraged.


Patch Czar History

    Anthony Baxter is the Patch Czar for 2.3.1 through 2.3.4.

    Barry Warsaw is the Patch Czar for 2.2.3.

    Guido van Rossum is the Patch Czar for 2.2.2.

    Michael Hudson is the Patch Czar for 2.2.1.

    Anthony Baxter is the Patch Czar for 2.1.2 and 2.1.3.

    Thomas Wouters is the Patch Czar for 2.1.1.

    Moshe Zadka is the Patch Czar for 2.0.1.


History

    This PEP started life as a proposal on comp.lang.python.  The
    original version suggested a single patch for the N-1 release to
    be released concurrently with the N release.  The original version
    also argued for sticking with a strict bug fix policy.

    Following feedback from the BDFL and others, the draft PEP was
    written containing an expanded bugfix release cycle that permitted
    any previous major release to obtain patches and also relaxed
    the strict bug fix requirement (mainly due to the example of PEP
    235 [3], which could be argued as either a bug fix or a feature).

    Discussion then mostly moved to python-dev, where BDFL finally
    issued a proclamation basing the Python bugfix release process on
    Tcl's, which essentially returned to the original proposal in
    terms of being only the N-1 release and only bug fixes, but
    allowing multiple bugfix releases until release N is published.

    Anthony Baxter then took this PEP and revised it, based on 
    lessons from the 2.3 release cycle. 


References

    [1] http://www.tcl.tk/cgi-bin/tct/tip/28.html

    [2] PEP 3, Guidelines for Handling Bug Reports, Hylton
        http://www.python.org/dev/peps/pep-0003/

    [3] PEP 235, Import on Case-Insensitive Platforms, Peters
        http://www.python.org/dev/peps/pep-0235/


Copyright

    This document has been placed in the public domain.



pep-0007 Style Guide for C Code

PEP:7
Title:Style Guide for C Code
Version:$Revision$
Last-Modified:$Date$
Author:Guido van Rossum <guido at python.org>
Status:Active
Type:Process
Content-Type:text/x-rst
Created:05-Jul-2001
Post-History:

Introduction

This document gives coding conventions for the C code comprising the C implementation of Python. Please see the companion informational PEP describing style guidelines for Python code [1].

Note, rules are there to be broken. Two good reasons to break a particular rule:

  1. When applying the rule would make the code less readable, even for someone who is used to reading code that follows the rules.
  2. To be consistent with surrounding code that also breaks it (maybe for historic reasons) -- although this is also an opportunity to clean up someone else's mess (in true XP style).

C dialect

  • Use ANSI/ISO standard C (the 1989 version of the standard). This means (amongst many other things) that all declarations must be at the top of a block (not necessarily at the top of function).
  • Don't use GCC extensions (e.g. don't write multi-line strings without trailing backslashes).
  • All function declarations and definitions must use full prototypes (i.e. specify the types of all arguments).
  • Never use C++ style // one-line comments.
  • No compiler warnings with major compilers (gcc, VC++, a few others).

Code lay-out

  • Use 4-space indents and no tabs at all.

  • No line should be longer than 79 characters. If this and the previous rule together don't give you enough room to code, your code is too complicated -- consider using subroutines.

  • No line should end in whitespace. If you think you need significant trailing whitespace, think again -- somebody's editor might delete it as a matter of routine.

  • Function definition style: function name in column 1, outermost curly braces in column 1, blank line after local variable declarations.

    static int
    extra_ivars(PyTypeObject *type, PyTypeObject *base)
    {
        int t_size = PyType_BASICSIZE(type);
        int b_size = PyType_BASICSIZE(base);
    
        assert(t_size >= b_size); /* type smaller than base! */
        ...
        return 1;
    }
    
  • Code structure: one space between keywords like if, for and the following left paren; no spaces inside the paren; braces may be omitted where C permits but when present, they should be formatted as shown:

    if (mro != NULL) {
        ...
    }
    else {
        ...
    }
    
  • The return statement should not get redundant parentheses:

    return Py_None; /* correct */
    return(Py_None); /* incorrect */
    
  • Function and macro call style: foo(a, b, c) -- no space before the open paren, no spaces inside the parens, no spaces before commas, one space after each comma.

  • Always put spaces around assignment, Boolean and comparison operators. In expressions using a lot of operators, add spaces around the outermost (lowest-priority) operators.

  • Breaking long lines: if you can, break after commas in the outermost argument list. Always indent continuation lines appropriately, e.g.:

    PyErr_Format(PyExc_TypeError,
                 "cannot create '%.100s' instances",
                 type->tp_name);
    
  • When you break a long expression at a binary operator, the operator goes at the end of the previous line, e.g.:

    if (type->tp_dictoffset != 0 && base->tp_dictoffset == 0 &&
        type->tp_dictoffset == b_size &&
        (size_t)t_size == b_size + sizeof(PyObject *))
        return 0; /* "Forgive" adding a __dict__ only */
    
  • Put blank lines around functions, structure definitions, and major sections inside functions.

  • Comments go before the code they describe.

  • All functions and global variables should be declared static unless they are to be part of a published interface

  • For external functions and variables, we always have a declaration in an appropriate header file in the "Include" directory, which uses the PyAPI_FUNC() macro, like this:

    PyAPI_FUNC(PyObject *) PyObject_Repr(PyObject *);
    

Naming conventions

  • Use a Py prefix for public functions; never for static functions. The Py_ prefix is reserved for global service routines like Py_FatalError; specific groups of routines (e.g. specific object type APIs) use a longer prefix, e.g. PyString_ for string functions.
  • Public functions and variables use MixedCase with underscores, like this: PyObject_GetAttr, Py_BuildValue, PyExc_TypeError.
  • Occasionally an "internal" function has to be visible to the loader; we use the _Py prefix for this, e.g.: _PyObject_Dump.
  • Macros should have a MixedCase prefix and then use upper case, for example: PyString_AS_STRING, Py_PRINT_RAW.

Documentation Strings

  • Use the PyDoc_STR() or PyDoc_STRVAR() macro for docstrings to support building Python without docstrings (./configure --without-doc-strings).

    For C code that needs to support versions of Python older than 2.3, you can include this after including Python.h:

    #ifndef PyDoc_STR
    #define PyDoc_VAR(name)         static char name[]
    #define PyDoc_STR(str)          (str)
    #define PyDoc_STRVAR(name, str) PyDoc_VAR(name) = PyDoc_STR(str)
    #endif
    
  • The first line of each fuction docstring should be a "signature line" that gives a brief synopsis of the arguments and return value. For example:

    PyDoc_STRVAR(myfunction__doc__,
    "myfunction(name, value) -> bool\n\n\
    Determine whether name and value make a valid pair.");
    

    Always include a blank line between the signature line and the text of the description.

    If the return value for the function is always None (because there is no meaningful return value), do not include the indication of the return type.

  • When writing multi-line docstrings, be sure to always use backslash continuations, as in the example above, or string literal concatenation:

    PyDoc_STRVAR(myfunction__doc__,
    "myfunction(name, value) -> bool\n\n"
    "Determine whether name and value make a valid pair.");
    

    Though some C compilers accept string literals without either:

    /* BAD -- don't do this! */
    PyDoc_STRVAR(myfunction__doc__,
    "myfunction(name, value) -> bool\n\n
    Determine whether name and value make a valid pair.");
    

    not all do; the MSVC compiler is known to complain about this.

References

[1]PEP 8, "Style Guide for Python Code", van Rossum, Warsaw (http://www.python.org/dev/peps/pep-0008)

pep-0008 Style Guide for Python Code

PEP:8
Title:Style Guide for Python Code
Version:$Revision$
Last-Modified:$Date$
Author:Guido van Rossum <guido at python.org>, Barry Warsaw <barry at python.org>, Nick Coghlan <ncoghlan at gmail.com>
Status:Active
Type:Process
Content-Type:text/x-rst
Created:05-Jul-2001
Post-History:05-Jul-2001, 01-Aug-2013

Introduction

This document gives coding conventions for the Python code comprising the standard library in the main Python distribution. Please see the companion informational PEP describing style guidelines for the C code in the C implementation of Python [1].

This document and PEP 257 (Docstring Conventions) were adapted from Guido's original Python Style Guide essay, with some additions from Barry's style guide [2].

This style guide evolves over time as additional conventions are identified and past conventions are rendered obsolete by changes in the language itself.

Many projects have their own coding style guidelines. In the event of any conflicts, such project-specific guides take precedence for that project.

A Foolish Consistency is the Hobgoblin of Little Minds

One of Guido's key insights is that code is read much more often than it is written. The guidelines provided here are intended to improve the readability of code and make it consistent across the wide spectrum of Python code. As PEP 20 says, "Readability counts".

A style guide is about consistency. Consistency with this style guide is important. Consistency within a project is more important. Consistency within one module or function is most important.

But most importantly: know when to be inconsistent -- sometimes the style guide just doesn't apply. When in doubt, use your best judgment. Look at other examples and decide what looks best. And don't hesitate to ask!

In particular: do not break backwards compatibility just to comply with this PEP!

Some other good reasons to ignore a particular guideline:

  1. When applying the guideline would make the code less readable, even for someone who is used to reading code that follows this PEP.
  2. To be consistent with surrounding code that also breaks it (maybe for historic reasons) -- although this is also an opportunity to clean up someone else's mess (in true XP style).
  3. Because the code in question predates the introduction of the guideline and there is no other reason to be modifying that code.
  4. When the code needs to remain compatible with older versions of Python that don't support the feature recommended by the style guide.

Code lay-out

Indentation

Use 4 spaces per indentation level.

Continuation lines should align wrapped elements either vertically using Python's implicit line joining inside parentheses, brackets and braces, or using a hanging indent [5]. When using a hanging indent the following considerations should be applied; there should be no arguments on the first line and further indentation should be used to clearly distinguish itself as a continuation line.

Yes:

# Aligned with opening delimiter.
foo = long_function_name(var_one, var_two,
                         var_three, var_four)

# More indentation included to distinguish this from the rest.
def long_function_name(
        var_one, var_two, var_three,
        var_four):
    print(var_one)

# Hanging indents should add a level.
foo = long_function_name(
    var_one, var_two,
    var_three, var_four)

No:

# Arguments on first line forbidden when not using vertical alignment.
foo = long_function_name(var_one, var_two,
    var_three, var_four)

# Further indentation required as indentation is not distinguishable.
def long_function_name(
    var_one, var_two, var_three,
    var_four):
    print(var_one)

The 4-space rule is optional for continuation lines.

Optional:

# Hanging indents *may* be indented to other than 4 spaces.
foo = long_function_name(
  var_one, var_two,
  var_three, var_four)

When the conditional part of an if-statement is long enough to require that it be written across multiple lines, it's worth noting that the combination of a two character keyword (i.e. if), plus a single space, plus an opening parenthesis creates a natural 4-space indent for the subsequent lines of the multiline conditional. This can produce a visual conflict with the indented suite of code nested inside the if-statement, which would also naturally be indented to 4 spaces. This PEP takes no explicit position on how (or whether) to further visually distinguish such conditional lines from the nested suite inside the if-statement. Acceptable options in this situation include, but are not limited to:

# No extra indentation.
if (this_is_one_thing and
    that_is_another_thing):
    do_something()

# Add a comment, which will provide some distinction in editors
# supporting syntax highlighting.
if (this_is_one_thing and
    that_is_another_thing):
    # Since both conditions are true, we can frobnicate.
    do_something()

# Add some extra indentation on the conditional continuation line.
if (this_is_one_thing
        and that_is_another_thing):
    do_something()

The closing brace/bracket/parenthesis on multi-line constructs may either line up under the first non-whitespace character of the last line of list, as in:

my_list = [
    1, 2, 3,
    4, 5, 6,
    ]
result = some_function_that_takes_arguments(
    'a', 'b', 'c',
    'd', 'e', 'f',
    )

or it may be lined up under the first character of the line that starts the multi-line construct, as in:

my_list = [
    1, 2, 3,
    4, 5, 6,
]
result = some_function_that_takes_arguments(
    'a', 'b', 'c',
    'd', 'e', 'f',
)

Tabs or Spaces?

Spaces are the preferred indentation method.

Tabs should be used solely to remain consistent with code that is already indented with tabs.

Python 3 disallows mixing the use of tabs and spaces for indentation.

Python 2 code indented with a mixture of tabs and spaces should be converted to using spaces exclusively.

When invoking the Python 2 command line interpreter with the -t option, it issues warnings about code that illegally mixes tabs and spaces. When using -tt these warnings become errors. These options are highly recommended!

Maximum Line Length

Limit all lines to a maximum of 79 characters.

For flowing long blocks of text with fewer structural restrictions (docstrings or comments), the line length should be limited to 72 characters.

Limiting the required editor window width makes it possible to have several files open side-by-side, and works well when using code review tools that present the two versions in adjacent columns.

The default wrapping in most tools disrupts the visual structure of the code, making it more difficult to understand. The limits are chosen to avoid wrapping in editors with the window width set to 80, even if the tool places a marker glyph in the final column when wrapping lines. Some web based tools may not offer dynamic line wrapping at all.

Some teams strongly prefer a longer line length. For code maintained exclusively or primarily by a team that can reach agreement on this issue, it is okay to increase the nominal line length from 80 to 100 characters (effectively increasing the maximum length to 99 characters), provided that comments and docstrings are still wrapped at 72 characters.

The Python standard library is conservative and requires limiting lines to 79 characters (and docstrings/comments to 72).

The preferred way of wrapping long lines is by using Python's implied line continuation inside parentheses, brackets and braces. Long lines can be broken over multiple lines by wrapping expressions in parentheses. These should be used in preference to using a backslash for line continuation.

Backslashes may still be appropriate at times. For example, long, multiple with-statements cannot use implicit continuation, so backslashes are acceptable:

with open('/path/to/some/file/you/want/to/read') as file_1, \
     open('/path/to/some/file/being/written', 'w') as file_2:
    file_2.write(file_1.read())

(See the previous discussion on multiline if-statements for further thoughts on the indentation of such multiline with-statements.)

Another such case is with assert statements.

Make sure to indent the continued line appropriately. The preferred place to break around a binary operator is after the operator, not before it. Some examples:

class Rectangle(Blob):

    def __init__(self, width, height,
                 color='black', emphasis=None, highlight=0):
        if (width == 0 and height == 0 and
                color == 'red' and emphasis == 'strong' or
                highlight > 100):
            raise ValueError("sorry, you lose")
        if width == 0 and height == 0 and (color == 'red' or
                                           emphasis is None):
            raise ValueError("I don't think so -- values are %s, %s" %
                             (width, height))
        Blob.__init__(self, width, height,
                      color, emphasis, highlight)

Blank Lines

Surround top-level function and class definitions with two blank lines.

Method definitions inside a class are surrounded by a single blank line.

Extra blank lines may be used (sparingly) to separate groups of related functions. Blank lines may be omitted between a bunch of related one-liners (e.g. a set of dummy implementations).

Use blank lines in functions, sparingly, to indicate logical sections.

Python accepts the control-L (i.e. ^L) form feed character as whitespace; Many tools treat these characters as page separators, so you may use them to separate pages of related sections of your file. Note, some editors and web-based code viewers may not recognize control-L as a form feed and will show another glyph in its place.

Source File Encoding

Code in the core Python distribution should always use UTF-8 (or ASCII in Python 2).

Files using ASCII (in Python 2) or UTF-8 (in Python 3) should not have an encoding declaration.

In the standard library, non-default encodings should be used only for test purposes or when a comment or docstring needs to mention an author name that contains non-ASCII characters; otherwise, using \x, \u, \U, or \N escapes is the preferred way to include non-ASCII data in string literals.

For Python 3.0 and beyond, the following policy is prescribed for the standard library (see PEP 3131): All identifiers in the Python standard library MUST use ASCII-only identifiers, and SHOULD use English words wherever feasible (in many cases, abbreviations and technical terms are used which aren't English). In addition, string literals and comments must also be in ASCII. The only exceptions are (a) test cases testing the non-ASCII features, and (b) names of authors. Authors whose names are not based on the latin alphabet MUST provide a latin transliteration of their names.

Open source projects with a global audience are encouraged to adopt a similar policy.

Imports

  • Imports should usually be on separate lines, e.g.:

    Yes: import os
         import sys
    
    No:  import sys, os
    

    It's okay to say this though:

    from subprocess import Popen, PIPE
    
  • Imports are always put at the top of the file, just after any module comments and docstrings, and before module globals and constants.

    Imports should be grouped in the following order:

    1. standard library imports
    2. related third party imports
    3. local application/library specific imports

    You should put a blank line between each group of imports.

    Put any relevant __all__ specification after the imports.

  • Absolute imports are recommended, as they are usually more readable and tend to be better behaved (or at least give better error messages) if the import system is incorrectly configured (such as when a directory inside a package ends up on sys.path):

    import mypkg.sibling
    from mypkg import sibling
    from mypkg.sibling import example
    

    However, explicit relative imports are an acceptable alternative to absolute imports, especially when dealing with complex package layouts where using absolute imports would be unnecessarily verbose:

    from . import sibling
    from .sibling import example
    

    Standard library code should avoid complex package layouts and always use absolute imports.

    Implicit relative imports should never be used and have been removed in Python 3.

  • When importing a class from a class-containing module, it's usually okay to spell this:

    from myclass import MyClass
    from foo.bar.yourclass import YourClass
    

    If this spelling causes local name clashes, then spell them

    import myclass
    import foo.bar.yourclass
    

    and use "myclass.MyClass" and "foo.bar.yourclass.YourClass".

  • Wildcard imports (from <module> import *) should be avoided, as they make it unclear which names are present in the namespace, confusing both readers and many automated tools. There is one defensible use case for a wildcard import, which is to republish an internal interface as part of a public API (for example, overwriting a pure Python implementation of an interface with the definitions from an optional accelerator module and exactly which definitions will be overwritten isn't known in advance).

    When republishing names this way, the guidelines below regarding public and internal interfaces still apply.

String Quotes

In Python, single-quoted strings and double-quoted strings are the same. This PEP does not make a recommendation for this. Pick a rule and stick to it. When a string contains single or double quote characters, however, use the other one to avoid backslashes in the string. It improves readability.

For triple-quoted strings, always use double quote characters to be consistent with the docstring convention in PEP 257.

Whitespace in Expressions and Statements

Pet Peeves

Avoid extraneous whitespace in the following situations:

  • Immediately inside parentheses, brackets or braces.

    Yes: spam(ham[1], {eggs: 2})
    No:  spam( ham[ 1 ], { eggs: 2 } )
    
  • Immediately before a comma, semicolon, or colon:

    Yes: if x == 4: print x, y; x, y = y, x
    No:  if x == 4 : print x , y ; x , y = y , x
    
  • However, in a slice the colon acts like a binary operator, and should have equal amounts on either side (treating it as the operator with the lowest priority). In an extended slice, both colons must have the same amount of spacing applied. Exception: when a slice parameter is omitted, the space is omitted.

    Yes:

    ham[1:9], ham[1:9:3], ham[:9:3], ham[1::3], ham[1:9:]
    ham[lower:upper], ham[lower:upper:], ham[lower::step]
    ham[lower+offset : upper+offset]
    ham[: upper_fn(x) : step_fn(x)], ham[:: step_fn(x)]
    ham[lower + offset : upper + offset]
    

    No:

    ham[lower + offset:upper + offset]
    ham[1: 9], ham[1 :9], ham[1:9 :3]
    ham[lower : : upper]
    ham[ : upper]
    
  • Immediately before the open parenthesis that starts the argument list of a function call:

    Yes: spam(1)
    No:  spam (1)
    
  • Immediately before the open parenthesis that starts an indexing or slicing:

    Yes: dct['key'] = lst[index]
    No:  dct ['key'] = lst [index]
    
  • More than one space around an assignment (or other) operator to align it with another.

    Yes:

    x = 1
    y = 2
    long_variable = 3
    

    No:

    x             = 1
    y             = 2
    long_variable = 3
    

Other Recommendations

  • Always surround these binary operators with a single space on either side: assignment (=), augmented assignment (+=, -= etc.), comparisons (==, <, >, !=, <>, <=, >=, in, not in, is, is not), Booleans (and, or, not).

  • If operators with different priorities are used, consider adding whitespace around the operators with the lowest priority(ies). Use your own judgment; however, never use more than one space, and always have the same amount of whitespace on both sides of a binary operator.

    Yes:

    i = i + 1
    submitted += 1
    x = x*2 - 1
    hypot2 = x*x + y*y
    c = (a+b) * (a-b)
    

    No:

    i=i+1
    submitted +=1
    x = x * 2 - 1
    hypot2 = x * x + y * y
    c = (a + b) * (a - b)
    
  • Don't use spaces around the = sign when used to indicate a keyword argument or a default parameter value.

    Yes:

    def complex(real, imag=0.0):
        return magic(r=real, i=imag)
    

    No:

    def complex(real, imag = 0.0):
        return magic(r = real, i = imag)
    
  • Do use spaces around the = sign of an annotated function definition. Additionally, use a single space after the :, as well as a single space on either side of the -> sign representing an annotated return value.

    Yes:

    def munge(input: AnyStr):
    def munge(sep: AnyStr = None):
    def munge() -> AnyStr:
    def munge(input: AnyStr, sep: AnyStr = None, limit=1000):
    

    No:

    def munge(input: AnyStr=None):
    def munge(input:AnyStr):
    def munge(input: AnyStr)->PosInt:
    
  • Compound statements (multiple statements on the same line) are generally discouraged.

    Yes:

    if foo == 'blah':
        do_blah_thing()
    do_one()
    do_two()
    do_three()
    

    Rather not:

    if foo == 'blah': do_blah_thing()
    do_one(); do_two(); do_three()
    
  • While sometimes it's okay to put an if/for/while with a small body on the same line, never do this for multi-clause statements. Also avoid folding such long lines!

    Rather not:

    if foo == 'blah': do_blah_thing()
    for x in lst: total += x
    while t < 10: t = delay()
    

    Definitely not:

    if foo == 'blah': do_blah_thing()
    else: do_non_blah_thing()
    
    try: something()
    finally: cleanup()
    
    do_one(); do_two(); do_three(long, argument,
                                 list, like, this)
    
    if foo == 'blah': one(); two(); three()
    

Comments

Comments that contradict the code are worse than no comments. Always make a priority of keeping the comments up-to-date when the code changes!

Comments should be complete sentences. If a comment is a phrase or sentence, its first word should be capitalized, unless it is an identifier that begins with a lower case letter (never alter the case of identifiers!).

If a comment is short, the period at the end can be omitted. Block comments generally consist of one or more paragraphs built out of complete sentences, and each sentence should end in a period.

You should use two spaces after a sentence-ending period.

When writing English, follow Strunk and White.

Python coders from non-English speaking countries: please write your comments in English, unless you are 120% sure that the code will never be read by people who don't speak your language.

Block Comments

Block comments generally apply to some (or all) code that follows them, and are indented to the same level as that code. Each line of a block comment starts with a # and a single space (unless it is indented text inside the comment).

Paragraphs inside a block comment are separated by a line containing a single #.

Inline Comments

Use inline comments sparingly.

An inline comment is a comment on the same line as a statement. Inline comments should be separated by at least two spaces from the statement. They should start with a # and a single space.

Inline comments are unnecessary and in fact distracting if they state the obvious. Don't do this:

x = x + 1                 # Increment x

But sometimes, this is useful:

x = x + 1                 # Compensate for border

Documentation Strings

Conventions for writing good documentation strings (a.k.a. "docstrings") are immortalized in PEP 257.

  • Write docstrings for all public modules, functions, classes, and methods. Docstrings are not necessary for non-public methods, but you should have a comment that describes what the method does. This comment should appear after the def line.

  • PEP 257 describes good docstring conventions. Note that most importantly, the """ that ends a multiline docstring should be on a line by itself, e.g.:

    """Return a foobang
    
    Optional plotz says to frobnicate the bizbaz first.
    """
    
  • For one liner docstrings, please keep the closing """ on the same line.

Version Bookkeeping

If you have to have Subversion, CVS, or RCS crud in your source file, do it as follows.

__version__ = "$Revision$"
# $Source$

These lines should be included after the module's docstring, before any other code, separated by a blank line above and below.

Naming Conventions

The naming conventions of Python's library are a bit of a mess, so we'll never get this completely consistent -- nevertheless, here are the currently recommended naming standards. New modules and packages (including third party frameworks) should be written to these standards, but where an existing library has a different style, internal consistency is preferred.

Overriding Principle

Names that are visible to the user as public parts of the API should follow conventions that reflect usage rather than implementation.

Descriptive: Naming Styles

There are a lot of different naming styles. It helps to be able to recognize what naming style is being used, independently from what they are used for.

The following naming styles are commonly distinguished:

  • b (single lowercase letter)

  • B (single uppercase letter)

  • lowercase

  • lower_case_with_underscores

  • UPPERCASE

  • UPPER_CASE_WITH_UNDERSCORES

  • CapitalizedWords (or CapWords, or CamelCase -- so named because of the bumpy look of its letters [3]). This is also sometimes known as StudlyCaps.

    Note: When using abbreviations in CapWords, capitalize all the letters of the abbreviation. Thus HTTPServerError is better than HttpServerError.

  • mixedCase (differs from CapitalizedWords by initial lowercase character!)

  • Capitalized_Words_With_Underscores (ugly!)

There's also the style of using a short unique prefix to group related names together. This is not used much in Python, but it is mentioned for completeness. For example, the os.stat() function returns a tuple whose items traditionally have names like st_mode, st_size, st_mtime and so on. (This is done to emphasize the correspondence with the fields of the POSIX system call struct, which helps programmers familiar with that.)

The X11 library uses a leading X for all its public functions. In Python, this style is generally deemed unnecessary because attribute and method names are prefixed with an object, and function names are prefixed with a module name.

In addition, the following special forms using leading or trailing underscores are recognized (these can generally be combined with any case convention):

  • _single_leading_underscore: weak "internal use" indicator. E.g. from M import * does not import objects whose name starts with an underscore.

  • single_trailing_underscore_: used by convention to avoid conflicts with Python keyword, e.g.

    Tkinter.Toplevel(master, class_='ClassName')
    
  • __double_leading_underscore: when naming a class attribute, invokes name mangling (inside class FooBar, __boo becomes _FooBar__boo; see below).

  • __double_leading_and_trailing_underscore__: "magic" objects or attributes that live in user-controlled namespaces. E.g. __init__, __import__ or __file__. Never invent such names; only use them as documented.

Prescriptive: Naming Conventions

Names to Avoid

Never use the characters 'l' (lowercase letter el), 'O' (uppercase letter oh), or 'I' (uppercase letter eye) as single character variable names.

In some fonts, these characters are indistinguishable from the numerals one and zero. When tempted to use 'l', use 'L' instead.

Package and Module Names

Modules should have short, all-lowercase names. Underscores can be used in the module name if it improves readability. Python packages should also have short, all-lowercase names, although the use of underscores is discouraged.

Since module names are mapped to file names, and some file systems are case insensitive and truncate long names, it is important that module names be chosen to be fairly short -- this won't be a problem on Unix, but it may be a problem when the code is transported to older Mac or Windows versions, or DOS.

When an extension module written in C or C++ has an accompanying Python module that provides a higher level (e.g. more object oriented) interface, the C/C++ module has a leading underscore (e.g. _socket).

Class Names

Class names should normally use the CapWords convention.

The naming convention for functions may be used instead in cases where the interface is documented and used primarily as a callable.

Note that there is a separate convention for builtin names: most builtin names are single words (or two words run together), with the CapWords convention used only for exception names and builtin constants.

Exception Names

Because exceptions should be classes, the class naming convention applies here. However, you should use the suffix "Error" on your exception names (if the exception actually is an error).

Global Variable Names

(Let's hope that these variables are meant for use inside one module only.) The conventions are about the same as those for functions.

Modules that are designed for use via from M import * should use the __all__ mechanism to prevent exporting globals, or use the older convention of prefixing such globals with an underscore (which you might want to do to indicate these globals are "module non-public").

Function Names

Function names should be lowercase, with words separated by underscores as necessary to improve readability.

mixedCase is allowed only in contexts where that's already the prevailing style (e.g. threading.py), to retain backwards compatibility.

Function and method arguments

Always use self for the first argument to instance methods.

Always use cls for the first argument to class methods.

If a function argument's name clashes with a reserved keyword, it is generally better to append a single trailing underscore rather than use an abbreviation or spelling corruption. Thus class_ is better than clss. (Perhaps better is to avoid such clashes by using a synonym.)

Method Names and Instance Variables

Use the function naming rules: lowercase with words separated by underscores as necessary to improve readability.

Use one leading underscore only for non-public methods and instance variables.

To avoid name clashes with subclasses, use two leading underscores to invoke Python's name mangling rules.

Python mangles these names with the class name: if class Foo has an attribute named __a, it cannot be accessed by Foo.__a. (An insistent user could still gain access by calling Foo._Foo__a.) Generally, double leading underscores should be used only to avoid name conflicts with attributes in classes designed to be subclassed.

Note: there is some controversy about the use of __names (see below).

Constants

Constants are usually defined on a module level and written in all capital letters with underscores separating words. Examples include MAX_OVERFLOW and TOTAL.

Designing for inheritance

Always decide whether a class's methods and instance variables (collectively: "attributes") should be public or non-public. If in doubt, choose non-public; it's easier to make it public later than to make a public attribute non-public.

Public attributes are those that you expect unrelated clients of your class to use, with your commitment to avoid backward incompatible changes. Non-public attributes are those that are not intended to be used by third parties; you make no guarantees that non-public attributes won't change or even be removed.

We don't use the term "private" here, since no attribute is really private in Python (without a generally unnecessary amount of work).

Another category of attributes are those that are part of the "subclass API" (often called "protected" in other languages). Some classes are designed to be inherited from, either to extend or modify aspects of the class's behavior. When designing such a class, take care to make explicit decisions about which attributes are public, which are part of the subclass API, and which are truly only to be used by your base class.

With this in mind, here are the Pythonic guidelines:

  • Public attributes should have no leading underscores.

  • If your public attribute name collides with a reserved keyword, append a single trailing underscore to your attribute name. This is preferable to an abbreviation or corrupted spelling. (However, notwithstanding this rule, 'cls' is the preferred spelling for any variable or argument which is known to be a class, especially the first argument to a class method.)

    Note 1: See the argument name recommendation above for class methods.

  • For simple public data attributes, it is best to expose just the attribute name, without complicated accessor/mutator methods. Keep in mind that Python provides an easy path to future enhancement, should you find that a simple data attribute needs to grow functional behavior. In that case, use properties to hide functional implementation behind simple data attribute access syntax.

    Note 1: Properties only work on new-style classes.

    Note 2: Try to keep the functional behavior side-effect free, although side-effects such as caching are generally fine.

    Note 3: Avoid using properties for computationally expensive operations; the attribute notation makes the caller believe that access is (relatively) cheap.

  • If your class is intended to be subclassed, and you have attributes that you do not want subclasses to use, consider naming them with double leading underscores and no trailing underscores. This invokes Python's name mangling algorithm, where the name of the class is mangled into the attribute name. This helps avoid attribute name collisions should subclasses inadvertently contain attributes with the same name.

    Note 1: Note that only the simple class name is used in the mangled name, so if a subclass chooses both the same class name and attribute name, you can still get name collisions.

    Note 2: Name mangling can make certain uses, such as debugging and __getattr__(), less convenient. However the name mangling algorithm is well documented and easy to perform manually.

    Note 3: Not everyone likes name mangling. Try to balance the need to avoid accidental name clashes with potential use by advanced callers.

Public and internal interfaces

Any backwards compatibility guarantees apply only to public interfaces. Accordingly, it is important that users be able to clearly distinguish between public and internal interfaces.

Documented interfaces are considered public, unless the documentation explicitly declares them to be provisional or internal interfaces exempt from the usual backwards compatibility guarantees. All undocumented interfaces should be assumed to be internal.

To better support introspection, modules should explicitly declare the names in their public API using the __all__ attribute. Setting __all__ to an empty list indicates that the module has no public API.

Even with __all__ set appropriately, internal interfaces (packages, modules, classes, functions, attributes or other names) should still be prefixed with a single leading underscore.

An interface is also considered internal if any containing namespace (package, module or class) is considered internal.

Imported names should always be considered an implementation detail. Other modules must not rely on indirect access to such imported names unless they are an explicitly documented part of the containing module's API, such as os.path or a package's __init__ module that exposes functionality from submodules.

Programming Recommendations

  • Code should be written in a way that does not disadvantage other implementations of Python (PyPy, Jython, IronPython, Cython, Psyco, and such).

    For example, do not rely on CPython's efficient implementation of in-place string concatenation for statements in the form a += b or a = a + b. This optimization is fragile even in CPython (it only works for some types) and isn't present at all in implementations that don't use refcounting. In performance sensitive parts of the library, the ''.join() form should be used instead. This will ensure that concatenation occurs in linear time across various implementations.

  • Comparisons to singletons like None should always be done with is or is not, never the equality operators.

    Also, beware of writing if x when you really mean if x is not None -- e.g. when testing whether a variable or argument that defaults to None was set to some other value. The other value might have a type (such as a container) that could be false in a boolean context!

  • Use is not operator rather than not ... is. While both expressions are functionally identical, the former is more readable and preferred.

    Yes:

    if foo is not None:
    

    No:

    if not foo is None:
    
  • When implementing ordering operations with rich comparisons, it is best to implement all six operations (__eq__, __ne__, __lt__, __le__, __gt__, __ge__) rather than relying on other code to only exercise a particular comparison.

    To minimize the effort involved, the functools.total_ordering() decorator provides a tool to generate missing comparison methods.

    PEP 207 indicates that reflexivity rules are assumed by Python. Thus, the interpreter may swap y > x with x < y, y >= x with x <= y, and may swap the arguments of x == y and x != y. The sort() and min() operations are guaranteed to use the < operator and the max() function uses the > operator. However, it is best to implement all six operations so that confusion doesn't arise in other contexts.

  • Always use a def statement instead of an assignment statement that binds a lambda expression directly to an identifier.

    Yes:

    def f(x): return 2*x
    

    No:

    f = lambda x: 2*x
    

    The first form means that the name of the resulting function object is specifically 'f' instead of the generic '<lambda>'. This is more useful for tracebacks and string representations in general. The use of the assignment statement eliminates the sole benefit a lambda expression can offer over an explicit def statement (i.e. that it can be embedded inside a larger expression)

  • Derive exceptions from Exception rather than BaseException. Direct inheritance from BaseException is reserved for exceptions where catching them is almost always the wrong thing to do.

    Design exception hierarchies based on the distinctions that code catching the exceptions is likely to need, rather than the locations where the exceptions are raised. Aim to answer the question "What went wrong?" programmatically, rather than only stating that "A problem occurred" (see PEP 3151 for an example of this lesson being learned for the builtin exception hierarchy)

    Class naming conventions apply here, although you should add the suffix "Error" to your exception classes if the exception is an error. Non-error exceptions that are used for non-local flow control or other forms of signaling need no special suffix.

  • Use exception chaining appropriately. In Python 3, "raise X from Y" should be used to indicate explicit replacement without losing the original traceback.

    When deliberately replacing an inner exception (using "raise X" in Python 2 or "raise X from None" in Python 3.3+), ensure that relevant details are transferred to the new exception (such as preserving the attribute name when converting KeyError to AttributeError, or embedding the text of the original exception in the new exception message).

  • When raising an exception in Python 2, use raise ValueError('message') instead of the older form raise ValueError, 'message'.

    The latter form is not legal Python 3 syntax.

    The paren-using form also means that when the exception arguments are long or include string formatting, you don't need to use line continuation characters thanks to the containing parentheses.

  • When catching exceptions, mention specific exceptions whenever possible instead of using a bare except: clause.

    For example, use:

    try:
        import platform_specific_module
    except ImportError:
        platform_specific_module = None
    

    A bare except: clause will catch SystemExit and KeyboardInterrupt exceptions, making it harder to interrupt a program with Control-C, and can disguise other problems. If you want to catch all exceptions that signal program errors, use except Exception: (bare except is equivalent to except BaseException:).

    A good rule of thumb is to limit use of bare 'except' clauses to two cases:

    1. If the exception handler will be printing out or logging the traceback; at least the user will be aware that an error has occurred.
    2. If the code needs to do some cleanup work, but then lets the exception propagate upwards with raise. try...finally can be a better way to handle this case.
  • When binding caught exceptions to a name, prefer the explicit name binding syntax added in Python 2.6:

    try:
        process_data()
    except Exception as exc:
        raise DataProcessingFailedError(str(exc))
    

    This is the only syntax supported in Python 3, and avoids the ambiguity problems associated with the older comma-based syntax.

  • When catching operating system errors, prefer the explicit exception hierarchy introduced in Python 3.3 over introspection of errno values.

  • Additionally, for all try/except clauses, limit the try clause to the absolute minimum amount of code necessary. Again, this avoids masking bugs.

    Yes:

    try:
        value = collection[key]
    except KeyError:
        return key_not_found(key)
    else:
        return handle_value(value)
    

    No:

    try:
        # Too broad!
        return handle_value(collection[key])
    except KeyError:
        # Will also catch KeyError raised by handle_value()
        return key_not_found(key)
    
  • When a resource is local to a particular section of code, use a with statement to ensure it is cleaned up promptly and reliably after use. A try/finally statement is also acceptable.

  • Context managers should be invoked through separate functions or methods whenever they do something other than acquire and release resources. For example:

    Yes:

    with conn.begin_transaction():
        do_stuff_in_transaction(conn)
    

    No:

    with conn:
        do_stuff_in_transaction(conn)
    

    The latter example doesn't provide any information to indicate that the __enter__ and __exit__ methods are doing something other than closing the connection after a transaction. Being explicit is important in this case.

  • Be consistent in return statements. Either all return statements in a function should return an expression, or none of them should. If any return statement returns an expression, any return statements where no value is returned should explicitly state this as return None, and an explicit return statement should be present at the end of the function (if reachable).

    Yes:

    def foo(x):
        if x >= 0:
            return math.sqrt(x)
        else:
            return None
    
    def bar(x):
        if x < 0:
            return None
        return math.sqrt(x)
    

    No:

    def foo(x):
        if x >= 0:
            return math.sqrt(x)
    
    def bar(x):
        if x < 0:
            return
        return math.sqrt(x)
    
  • Use string methods instead of the string module.

    String methods are always much faster and share the same API with unicode strings. Override this rule if backward compatibility with Pythons older than 2.0 is required.

  • Use ''.startswith() and ''.endswith() instead of string slicing to check for prefixes or suffixes.

    startswith() and endswith() are cleaner and less error prone. For example:

    Yes: if foo.startswith('bar'):
    No:  if foo[:3] == 'bar':
    
  • Object type comparisons should always use isinstance() instead of comparing types directly.

    Yes: if isinstance(obj, int):
    
    No:  if type(obj) is type(1):
    

    When checking if an object is a string, keep in mind that it might be a unicode string too! In Python 2, str and unicode have a common base class, basestring, so you can do:

    if isinstance(obj, basestring):
    

    Note that in Python 3, unicode and basestring no longer exist (there is only str) and a bytes object is no longer a kind of string (it is a sequence of integers instead)

  • For sequences, (strings, lists, tuples), use the fact that empty sequences are false.

    Yes: if not seq:
         if seq:
    
    No: if len(seq)
        if not len(seq)
    
  • Don't write string literals that rely on significant trailing whitespace. Such trailing whitespace is visually indistinguishable and some editors (or more recently, reindent.py) will trim them.

  • Don't compare boolean values to True or False using ==.

    Yes:   if greeting:
    No:    if greeting == True:
    Worse: if greeting is True:
    
  • The Python standard library will not use function annotations as that would result in a premature commitment to a particular annotation style. Instead, the annotations are left for users to discover and experiment with useful annotation styles.

    It is recommended that third party experiments with annotations use an associated decorator to indicate how the annotation should be interpreted.

    Early core developer attempts to use function annotations revealed inconsistent, ad-hoc annotation styles. For example:

    • [str] was ambiguous as to whether it represented a list of strings or a value that could be either str or None.
    • The notation open(file:(str,bytes)) was used for a value that could be either bytes or str rather than a 2-tuple containing a str value followed by a bytes value.
    • The annotation seek(whence:int) exhibited a mix of over-specification and under-specification: int is too restrictive (anything with __index__ would be allowed) and it is not restrictive enough (only the values 0, 1, and 2 are allowed). Likewise, the annotation write(b: bytes) was also too restrictive (anything supporting the buffer protocol would be allowed).
    • Annotations such as read1(n: int=None) were self-contradictory since None is not an int. Annotations such as source_path(self, fullname:str) -> object were confusing about what the return type should be.
    • In addition to the above, annotations were inconsistent in the use of concrete types versus abstract types: int versus Integral and set/frozenset versus MutableSet/Set.
    • Some annotations in the abstract base classes were incorrect specifications. For example, set-to-set operations require other to be another instance of Set rather than just an Iterable.
    • A further issue was that annotations become part of the specification but weren't being tested.
    • In most cases, the docstrings already included the type specifications and did so with greater clarity than the function annotations. In the remaining cases, the docstrings were improved once the annotations were removed.
    • The observed function annotations were too ad-hoc and inconsistent to work with a coherent system of automatic type checking or argument validation. Leaving these annotations in the code would have made it more difficult to make changes later so that automated utilities could be supported.

Footnotes

[5]Hanging indentation is a type-setting style where all the lines in a paragraph are indented except the first line. In the context of Python, the term is used to describe a style where the opening parenthesis of a parenthesized statement is the last non-whitespace character of the line, with subsequent lines being indented until the closing parenthesis.

References

[1]PEP 7, Style Guide for C Code, van Rossum
[2]Barry's GNU Mailman style guide http://barry.warsaw.us/software/STYLEGUIDE.txt
[3]http://www.wikipedia.com/wiki/CamelCase
[4]PEP 8 modernisation, July 2013 http://bugs.python.org/issue18472

pep-0009 Sample Plaintext PEP Template

PEP: 9
Title: Sample Plaintext PEP Template
Version: $Revision$
Last-Modified: $Date$
Author: Barry Warsaw <barry at python.org>
Status: Active
Type: Process
Content-Type: text/plain
Created: 14-Aug-2001
Post-History: 

Abstract

    This PEP provides a boilerplate or sample template for creating
    your own plaintext PEPs.  In conjunction with the content
    guidelines in PEP 1 [1], this should make it easy for you to
    conform your own PEPs to the format outlined below.

    Note: if you are reading this PEP via the web, you should first
    grab the plaintext source of this PEP in order to complete the
    steps below.  DO NOT USE THE HTML FILE AS YOUR TEMPLATE!

    To get the source this (or any) PEP, look at the top of the HTML
    page and click on the date & time on the "Last-Modified" line.  It
    is a link to the source text in the Python repository.

    If you would prefer to use lightweight markup in your PEP, please
    see PEP 12, "Sample reStructuredText PEP Template" [2].


Rationale

    PEP submissions come in a wide variety of forms, not all adhering
    to the format guidelines set forth below.  Use this template, in
    conjunction with the content guidelines in PEP 1, to ensure that
    your PEP submission won't get automatically rejected because of
    form.


How to Use This Template

    To use this template you must first decide whether your PEP is
    going to be an Informational or Standards Track PEP.  Most PEPs
    are Standards Track because they propose a new feature for the
    Python language or standard library.  When in doubt, read PEP 1
    for details or contact the PEP editors <peps@python.org>.

    Once you've decided which type of PEP yours is going to be, follow
    the directions below.

    - Make a copy of this file (.txt file, not HTML!) and perform the
      following edits.

    - Replace the "PEP: 9" header with "PEP: XXX" since you don't yet
      have a PEP number assignment.

    - Change the Title header to the title of your PEP.

    - Leave the Version and Last-Modified headers alone; we'll take
      care of those when we check your PEP into Python's Subversion
      repository.  These headers consist of keywords ("Revision" and
      "Date" enclosed in "$"-signs) which are automatically expanded
      by the repository.  Please do not edit the expanded date or
      revision text.

    - Change the Author header to include your name, and optionally
      your email address.  Be sure to follow the format carefully:
      your name must appear first, and it must not be contained in
      parentheses.  Your email address may appear second (or it can be
      omitted) and if it appears, it must appear in angle brackets.
      It is okay to obfuscate your email address.

    - If there is a mailing list for discussion of your new feature,
      add a Discussions-To header right after the Author header.  You
      should not add a Discussions-To header if the mailing list to be
      used is either python-list@python.org or python-dev@python.org,
      or if discussions should be sent to you directly.  Most
      Informational PEPs don't have a Discussions-To header.

    - Change the Status header to "Draft".

    - For Standards Track PEPs, change the Type header to "Standards
      Track".

    - For Informational PEPs, change the Type header to
      "Informational".

    - For Standards Track PEPs, if your feature depends on the
      acceptance of some other currently in-development PEP, add a
      Requires header right after the Type header.  The value should
      be the PEP number of the PEP yours depends on.  Don't add this
      header if your dependent feature is described in a Final PEP.

    - Change the Created header to today's date.  Be sure to follow
      the format carefully: it must be in dd-mmm-yyyy format, where
      the mmm is the 3 English letter month abbreviation, e.g. one of
      Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec.

    - For Standards Track PEPs, after the Created header, add a
      Python-Version header and set the value to the next planned
      version of Python, i.e. the one your new feature will hopefully
      make its first appearance in.  Do not use an alpha or beta
      release designation here.  Thus, if the last version of Python
      was 2.2 alpha 1 and you're hoping to get your new feature into
      Python 2.2, set the header to:

      Python-Version: 2.2

    - Leave Post-History alone for now; you'll add dates to this
      header each time you post your PEP to python-list@python.org or
      python-dev@python.org.  E.g. if you posted your PEP to the lists
      on August 14, 2001 and September 3, 2001, the Post-History
      header would look like:

      Post-History: 14-Aug-2001, 03-Sept-2001

      You must manually add new dates and check them in.  If you don't
      have check-in privileges, send your changes to the PEP editor.

    - Add a Replaces header if your PEP obsoletes an earlier PEP.  The
      value of this header is the number of the PEP that your new PEP
      is replacing.  Only add this header if the older PEP is in
      "final" form, i.e. is either Accepted, Final, or Rejected.  You
      aren't replacing an older open PEP if you're submitting a
      competing idea.

    - Now write your Abstract, Rationale, and other content for your
      PEP, replacing all this gobbledygook with your own text. Be sure
      to adhere to the format guidelines below, specifically on the
      prohibition of tab characters and the indentation requirements.

    - Update your References and Copyright section.  Usually you'll
      place your PEP into the public domain, in which case just leave
      the "Copyright" section alone.  Alternatively, you can use the
      Open Publication License[3], but public domain is still strongly
      preferred.

    - Leave the little Emacs turd at the end of this file alone,
      including the formfeed character ("^L", or \f).

    - Send your PEP submission to the PEP editors (peps@python.org),
      along with $100k in unmarked pennies.  (Just kidding, I wanted
      to see if you were still awake. :)


Plaintext PEP Formatting Requirements

    PEP headings must begin in column zero and the initial letter of
    each word must be capitalized as in book titles.  Acronyms should
    be in all capitals.  The body of each section must be indented 4
    spaces.  Code samples inside body sections should be indented a
    further 4 spaces, and other indentation can be used as required to
    make the text readable.  You must use two blank lines between the
    last line of a section's body and the next section heading.

    You must adhere to the Emacs convention of adding two spaces at
    the end of every sentence.  You should fill your paragraphs to
    column 70, but under no circumstances should your lines extend
    past column 79.  If your code samples spill over column 79, you
    should rewrite them.

    Tab characters must never appear in the document at all.  A PEP
    should include the standard Emacs stanza included by example at
    the bottom of this PEP.

    When referencing an external web page in the body of a PEP, you
    should include the title of the page in the text, with a
    footnote reference to the URL.  Do not include the URL in the body
    text of the PEP.  E.g.

        Refer to the Python Language web site [1] for more details.
        ...
        [1] http://www.python.org

    When referring to another PEP, include the PEP number in the body
    text, such as "PEP 1".  The title may optionally appear.  Add a
    footnote reference, a number in square brackets.  The footnote
    body should include the PEP's title and author.  It may optionally
    include the explicit URL on a separate line, but only in the
    References section.  Note that the pep2html.py script will
    calculate URLs automatically.  For example:

            ...
            Refer to PEP 1 [7] for more information about PEP style
            ...

        References

            [7] PEP 1, PEP Purpose and Guidelines, Warsaw, Hylton
                http://www.python.org/dev/peps/pep-0001/

    If you decide to provide an explicit URL for a PEP, please use
    this as the URL template:

        http://www.python.org/dev/peps/pep-xxxx/

    PEP numbers in URLs must be padded with zeros from the left, so as
    to be exactly 4 characters wide, however PEP numbers in the text
    are never padded.


References

    [1] PEP 1, PEP Purpose and Guidelines, Warsaw, Hylton
        http://www.python.org/dev/peps/pep-0001/

    [2] PEP 12, Sample reStructuredText PEP Template, Goodger, Warsaw
        http://www.python.org/dev/peps/pep-0012/

    [3] http://www.opencontent.org/openpub/



Copyright

    This document has been placed in the public domain.



pep-0010 Voting Guidelines

PEP: 10
Title: Voting Guidelines
Version: $Revision$
Last-Modified: $Date$
Author: Barry Warsaw <barry at python.org>
Status: Active
Type: Process
Created: 07-Mar-2002
Post-History: 07-Mar-2002

Abstract

    This PEP outlines the python-dev voting guidelines.  These
    guidelines serve to provide feedback or gauge the "wind direction"
    on a particular proposal, idea, or feature.  They don't have a
    binding force.


Rationale

    When a new idea, feature, patch, etc. is floated in the Python
    community, either through a PEP or on the mailing lists (most
    likely on python-dev [1]), it is sometimes helpful to gauge the
    community's general sentiment.  Sometimes people just want to
    register their opinion of an idea.  Sometimes the BDFL wants to
    take a straw poll.  Whatever the reason, these guidelines have
    been adopted so as to provide a common language for developers.

    While opinions are (sometimes) useful, but they are never binding.
    Opinions that are accompanied by rationales are always valued
    higher than bare scores (this is especially true with -1 votes).


Voting Scores

    The scoring guidelines are loosely derived from the Apache voting
    procedure [2], with of course our own spin on things.  There are 4
    possible vote scores:

    +1 I like it

    +0 I don't care, but go ahead

    -0 I don't care, so why bother?

    -1 I hate it

    You may occasionally see wild flashes of enthusiasm (either for or
    against) with vote scores like +2, +1000, or -1000.  These aren't
    really valued much beyond the above scores, but it's nice to see
    people get excited about such geeky stuff.


References

    [1] Python Developer's Guide,
	http://www.python.org/dev/

    [2] Apache Project Guidelines and Voting Rules
	http://httpd.apache.org/dev/guidelines.html


Copyright

    This document has been placed in the public domain.



pep-0011 Removing support for little used platforms

PEP:11
Title:Removing support for little used platforms
Version:$Revision$
Last-Modified:$Date$
Author:Martin von Lรถwis <martin at v.loewis.de>, Brett Cannon <brett at python.org>
Status:Active
Type:Process
Content-Type:text/x-rst
Created:07-Jul-2002
Post-History:18-Aug-2007 16-May-2014 20-Feb-2015

Abstract

This PEP documents how an operating system (platform) becomes supported in CPython and documents past support.

Rationale

Over time, the CPython source code has collected various pieces of platform-specific code, which, at some point in time, was considered necessary to use Python on a specific platform. Without access to this platform, it is not possible to determine whether this code is still needed. As a result, this code may either break during Python's evolution, or it may become unnecessary as the platforms evolve as well.

The growing amount of these fragments poses the risk of unmaintainability: without having experts for a large number of platforms, it is not possible to determine whether a certain change to the CPython source code will work on all supported platforms.

To reduce this risk, this PEP specifies what is required for a platform to be considered supported by Python as well as providing a procedure to remove code for platforms with few or no Python users.

Supporting platforms

Gaining official platform support requires two things. First, a core developer needs to volunteer to maintain platform-specific code. This core developer can either already be a member of the Python development team or be given contributor rights on the basis of maintaining platform support (it is at the discretion of the Python development team to decide if a person is ready to have such rights even if it is just for supporting a specific platform).

Second, a stable buildbot must be provided [2]. This guarantees that platform support will not be accidentally broken by a Python core developer who does not have personal access to the platform. For a buildbot to be considered stable it requires that the machine be reliably up and functioning (but it is up to the Python core developers to decide whether to promote a buildbot to being considered stable).

This policy does not disqualify supporting other platforms indirectly. Patches which are not platform-specific but still done to add platform support will be considered for inclusion. For example, if platform-independent changes were necessary in the configure script which were motivated to support a specific platform that could be accepted. Patches which add platform-specific code such as the name of a specific platform to the configure script will generally not be accepted without the platform having official support.

CPU architecture and compiler support are viewed in a similar manner as platforms. For example, to consider the ARM architecture supported a buildbot running on ARM would be required along with support from the Python development team. In general it is not required to have a CPU architecture run under every possible platform in order to be considered supported.

Unsupporting platforms

If a certain platform that currently has special code in CPython is deemed to be without enough Python users or lacks proper support from the Python development team and/or a buildbot, a note must be posted in this PEP that this platform is no longer actively supported. This note must include:

  • the name of the system
  • the first release number that does not support this platform anymore, and
  • the first release where the historical support code is actively removed

In some cases, it is not possible to identify the specific list of systems for which some code is used (e.g. when autoconf tests for absence of some feature which is considered present on all supported systems). In this case, the name will give the precise condition (usually a preprocessor symbol) that will become unsupported.

At the same time, the CPython source code must be changed to produce a build-time error if somebody tries to install Python on this platform. On platforms using autoconf, configure must fail. This gives potential users of the platform a chance to step forward and offer maintenance.

Re-supporting platforms

If a user of a platform wants to see this platform supported again, he may volunteer to maintain the platform support. Such an offer must be recorded in the PEP, and the user can submit patches to remove the build-time errors, and perform any other maintenance work for the platform.

Microsoft Windows

Microsoft has established a policy called product support lifecycle [1]. Each product's lifecycle has a mainstream support phase, where the product is generally commercially available, and an extended support phase, where paid support is still available, and certain bug fixes are released (in particular security fixes).

CPython's Windows support now follows this lifecycle. A new feature release X.Y.0 will support all Windows releases whose extended support phase is not yet expired. Subsequent bug fix releases will support the same Windows releases as the original feature release (even if the extended support phase has ended).

Because of this policy, no further Windows releases need to be listed in this PEP.

Each feature release is built by a specific version of Microsoft Visual Studio. That version should have mainstream support when the release is made. Developers of extension modules will generally need to use the same Visual Studio release; they are concerned both with the availability of the versions they need to use, and with keeping the zoo of versions small. The CPython source tree will keep unmaintained build files for older Visual Studio releases, for which patches will be accepted. Such build files will be removed from the source tree 3 years after the extended support for the compiler has ended (but continue to remain available in revision control).

No-longer-supported platforms

  • Name: MS-DOS, MS-Windows 3.x
    Unsupported in: Python 2.0
    Code removed in: Python 2.1
  • Name: SunOS 4
    Unsupported in: Python 2.3
    Code removed in: Python 2.4
  • Name: DYNIX
    Unsupported in: Python 2.3
    Code removed in: Python 2.4
  • Name: dgux
    Unsupported in: Python 2.3
    Code removed in: Python 2.4
  • Name: Minix
    Unsupported in: Python 2.3
    Code removed in: Python 2.4
  • Name: Irix 4 and --with-sgi-dl
    Unsupported in: Python 2.3
    Code removed in: Python 2.4
  • Name: Linux 1
    Unsupported in: Python 2.3
    Code removed in: Python 2.4
  • Name: Systems defining __d6_pthread_create (configure.in)
    Unsupported in: Python 2.3
    Code removed in: Python 2.4
  • Name: Systems defining PY_PTHREAD_D4, PY_PTHREAD_D6, or PY_PTHREAD_D7 in thread_pthread.h
    Unsupported in: Python 2.3
    Code removed in: Python 2.4
  • Name: Systems using --with-dl-dld
    Unsupported in: Python 2.3
    Code removed in: Python 2.4
  • Name: Systems using --without-universal-newlines,
    Unsupported in: Python 2.3
    Code removed in: Python 2.4
  • Name: MacOS 9
    Unsupported in: Python 2.4
    Code removed in: Python 2.4
  • Name: Systems using --with-wctype-functions
    Unsupported in: Python 2.6
    Code removed in: Python 2.6
  • Name: Win9x, WinME, NT4
    Unsupported in: Python 2.6 (warning in 2.5 installer)
    Code removed in: Python 2.6
  • Name: AtheOS
    Unsupported in: Python 2.6 (with "AtheOS" changed to "Syllable")
    Build broken in: Python 2.7 (edit configure to reenable)
    Code removed in: Python 3.0
  • Name: BeOS
    Unsupported in: Python 2.6 (warning in configure)
    Build broken in: Python 2.7 (edit configure to reenable)
    Code removed in: Python 3.0
  • Name: Systems using Mach C Threads
    Unsupported in: Python 3.2
    Code removed in: Python 3.3
  • Name: SunOS lightweight processes (LWP)
    Unsupported in: Python 3.2
    Code removed in: Python 3.3
  • Name: Systems using --with-pth (GNU pth threads)
    Unsupported in: Python 3.2
    Code removed in: Python 3.3
  • Name: Systems using Irix threads
    Unsupported in: Python 3.2
    Code removed in: Python 3.3
  • Name: OSF* systems (issue 8606)
    Unsupported in: Python 3.2
    Code removed in: Python 3.3
  • Name: OS/2 (issue 16135)
    Unsupported in: Python 3.3
    Code removed in: Python 3.4
  • Name: VMS (issue 16136)
    Unsupported in: Python 3.3
    Code removed in: Python 3.4
  • Name: Windows 2000
    Unsupported in: Python 3.3
    Code removed in: Python 3.4
  • Name: Windows systems where COMSPEC points to command.com
    Unsupported in: Python 3.3
    Code removed in: Python 3.4
  • Name: RISC OS
    Unsupported in: Python 3.0 (some code actually removed)
    Code removed in: Python 3.4

pep-0012 Sample reStructuredText PEP Template

PEP:12
Title:Sample reStructuredText PEP Template
Version:$Revision$
Last-Modified:$Date$
Author:David Goodger <goodger at python.org>, Barry Warsaw <barry at python.org>
Status:Active
Type:Process
Content-Type:text/x-rst
Created:05-Aug-2002
Post-History:30-Aug-2002

Abstract

This PEP provides a boilerplate or sample template for creating your own reStructuredText PEPs. In conjunction with the content guidelines in PEP 1 [1], this should make it easy for you to conform your own PEPs to the format outlined below.

Note: if you are reading this PEP via the web, you should first grab the text (reStructuredText) source of this PEP in order to complete the steps below. DO NOT USE THE HTML FILE AS YOUR TEMPLATE!

The source for this (or any) PEP can be found in the PEPs repository, viewable on the web at https://hg.python.org/peps/file/tip .

If you would prefer not to use markup in your PEP, please see PEP 9, "Sample Plaintext PEP Template" [2].

Rationale

PEP submissions come in a wide variety of forms, not all adhering to the format guidelines set forth below. Use this template, in conjunction with the format guidelines below, to ensure that your PEP submission won't get automatically rejected because of form.

ReStructuredText is offered as an alternative to plaintext PEPs, to allow PEP authors more functionality and expressivity, while maintaining easy readability in the source text. The processed HTML form makes the functionality accessible to readers: live hyperlinks, styled text, tables, images, and automatic tables of contents, among other advantages. For an example of a PEP marked up with reStructuredText, see PEP 287.

How to Use This Template

To use this template you must first decide whether your PEP is going to be an Informational or Standards Track PEP. Most PEPs are Standards Track because they propose a new feature for the Python language or standard library. When in doubt, read PEP 1 for details or contact the PEP editors <peps@python.org>.

Once you've decided which type of PEP yours is going to be, follow the directions below.

  • Make a copy of this file (.txt file, not HTML!) and perform the following edits.

  • Replace the "PEP: 12" header with "PEP: XXX" since you don't yet have a PEP number assignment.

  • Change the Title header to the title of your PEP.

  • Leave the Version and Last-Modified headers alone; we'll take care of those when we check your PEP into Python's Subversion repository. These headers consist of keywords ("Revision" and "Date" enclosed in "$"-signs) which are automatically expanded by the repository. Please do not edit the expanded date or revision text.

  • Change the Author header to include your name, and optionally your email address. Be sure to follow the format carefully: your name must appear first, and it must not be contained in parentheses. Your email address may appear second (or it can be omitted) and if it appears, it must appear in angle brackets. It is okay to obfuscate your email address.

  • If there is a mailing list for discussion of your new feature, add a Discussions-To header right after the Author header. You should not add a Discussions-To header if the mailing list to be used is either python-list@python.org or python-dev@python.org, or if discussions should be sent to you directly. Most Informational PEPs don't have a Discussions-To header.

  • Change the Status header to "Draft".

  • For Standards Track PEPs, change the Type header to "Standards Track".

  • For Informational PEPs, change the Type header to "Informational".

  • For Standards Track PEPs, if your feature depends on the acceptance of some other currently in-development PEP, add a Requires header right after the Type header. The value should be the PEP number of the PEP yours depends on. Don't add this header if your dependent feature is described in a Final PEP.

  • Change the Created header to today's date. Be sure to follow the format carefully: it must be in dd-mmm-yyyy format, where the mmm is the 3 English letter month abbreviation, i.e. one of Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec.

  • For Standards Track PEPs, after the Created header, add a Python-Version header and set the value to the next planned version of Python, i.e. the one your new feature will hopefully make its first appearance in. Do not use an alpha or beta release designation here. Thus, if the last version of Python was 2.2 alpha 1 and you're hoping to get your new feature into Python 2.2, set the header to:

    Python-Version: 2.2
    
  • Leave Post-History alone for now; you'll add dates to this header each time you post your PEP to python-list@python.org or python-dev@python.org. If you posted your PEP to the lists on August 14, 2001 and September 3, 2001, the Post-History header would look like:

    Post-History: 14-Aug-2001, 03-Sept-2001
    

    You must manually add new dates and check them in. If you don't have check-in privileges, send your changes to the PEP editors.

  • Add a Replaces header if your PEP obsoletes an earlier PEP. The value of this header is the number of the PEP that your new PEP is replacing. Only add this header if the older PEP is in "final" form, i.e. is either Accepted, Final, or Rejected. You aren't replacing an older open PEP if you're submitting a competing idea.

  • Now write your Abstract, Rationale, and other content for your PEP, replacing all this gobbledygook with your own text. Be sure to adhere to the format guidelines below, specifically on the prohibition of tab characters and the indentation requirements.

  • Update your References and Copyright section. Usually you'll place your PEP into the public domain, in which case just leave the Copyright section alone. Alternatively, you can use the Open Publication License [6], but public domain is still strongly preferred.

  • Leave the Emacs stanza at the end of this file alone, including the formfeed character ("^L", or \f).

  • Send your PEP submission to the PEP editors at peps@python.org.

ReStructuredText PEP Formatting Requirements

The following is a PEP-specific summary of reStructuredText syntax. For the sake of simplicity and brevity, much detail is omitted. For more detail, see Resources below. Literal blocks (in which no markup processing is done) are used for examples throughout, to illustrate the plaintext markup.

General

You must adhere to the Emacs convention of adding two spaces at the end of every sentence. You should fill your paragraphs to column 70, but under no circumstances should your lines extend past column 79. If your code samples spill over column 79, you should rewrite them.

Tab characters must never appear in the document at all. A PEP should include the standard Emacs stanza included by example at the bottom of this PEP.

Section Headings

PEP headings must begin in column zero and the initial letter of each word must be capitalized as in book titles. Acronyms should be in all capitals. Section titles must be adorned with an underline, a single repeated punctuation character, which begins in column zero and must extend at least as far as the right edge of the title text (4 characters minimum). First-level section titles are underlined with "=" (equals signs), second-level section titles with "-" (hyphens), and third-level section titles with "'" (single quotes or apostrophes). For example:

First-Level Title
=================

Second-Level Title
------------------

Third-Level Title
'''''''''''''''''

If there are more than three levels of sections in your PEP, you may insert overline/underline-adorned titles for the first and second levels as follows:

============================
First-Level Title (optional)
============================

-----------------------------
Second-Level Title (optional)
-----------------------------

Third-Level Title
=================

Fourth-Level Title
------------------

Fifth-Level Title
'''''''''''''''''

You shouldn't have more than five levels of sections in your PEP. If you do, you should consider rewriting it.

You must use two blank lines between the last line of a section's body and the next section heading. If a subsection heading immediately follows a section heading, a single blank line in-between is sufficient.

The body of each section is not normally indented, although some constructs do use indentation, as described below. Blank lines are used to separate constructs.

Paragraphs

Paragraphs are left-aligned text blocks separated by blank lines. Paragraphs are not indented unless they are part of an indented construct (such as a block quote or a list item).

Inline Markup

Portions of text within paragraphs and other text blocks may be styled. For example:

Text may be marked as *emphasized* (single asterisk markup,
typically shown in italics) or **strongly emphasized** (double
asterisks, typically boldface).  ``Inline literals`` (using double
backquotes) are typically rendered in a monospaced typeface.  No
further markup recognition is done within the double backquotes,
so they're safe for any kind of code snippets.

Block Quotes

Block quotes consist of indented body elements. For example:

This is a paragraph.

    This is a block quote.

    A block quote may contain many paragraphs.

Block quotes are used to quote extended passages from other sources. Block quotes may be nested inside other body elements. Use 4 spaces per indent level.

Literal Blocks

Literal blocks are used for code samples or preformatted ASCII art. To indicate a literal block, preface the indented text block with "::" (two colons). The literal block continues until the end of the indentation. Indent the text block by 4 spaces. For example:

This is a typical paragraph.  A literal block follows.

::

    for a in [5,4,3,2,1]:   # this is program code, shown as-is
        print a
    print "it's..."
    # a literal block continues until the indentation ends

The paragraph containing only "::" will be completely removed from the output; no empty paragraph will remain. "::" is also recognized at the end of any paragraph. If immediately preceded by whitespace, both colons will be removed from the output. When text immediately precedes the "::", one colon will be removed from the output, leaving only one colon visible (i.e., "::" will be replaced by ":"). For example, one colon will remain visible here:

Paragraph::

    Literal block

Lists

Bullet list items begin with one of "-", "*", or "+" (hyphen, asterisk, or plus sign), followed by whitespace and the list item body. List item bodies must be left-aligned and indented relative to the bullet; the text immediately after the bullet determines the indentation. For example:

This paragraph is followed by a list.

* This is the first bullet list item.  The blank line above the
  first list item is required; blank lines between list items
  (such as below this paragraph) are optional.

* This is the first paragraph in the second item in the list.

  This is the second paragraph in the second item in the list.
  The blank line above this paragraph is required.  The left edge
  of this paragraph lines up with the paragraph above, both
  indented relative to the bullet.

  - This is a sublist.  The bullet lines up with the left edge of
    the text blocks above.  A sublist is a new list so requires a
    blank line above and below.

* This is the third item of the main list.

This paragraph is not part of the list.

Enumerated (numbered) list items are similar, but use an enumerator instead of a bullet. Enumerators are numbers (1, 2, 3, ...), letters (A, B, C, ...; uppercase or lowercase), or Roman numerals (i, ii, iii, iv, ...; uppercase or lowercase), formatted with a period suffix ("1.", "2."), parentheses ("(1)", "(2)"), or a right-parenthesis suffix ("1)", "2)"). For example:

1. As with bullet list items, the left edge of paragraphs must
   align.

2. Each list item may contain multiple paragraphs, sublists, etc.

   This is the second paragraph of the second list item.

   a) Enumerated lists may be nested.
   b) Blank lines may be omitted between list items.

Definition lists are written like this:

what
    Definition lists associate a term with a definition.

how
    The term is a one-line phrase, and the definition is one
    or more paragraphs or body elements, indented relative to
    the term.

Tables

Simple tables are easy and compact:

=====  =====  =======
  A      B    A and B
=====  =====  =======
False  False  False
True   False  False
False  True   False
True   True   True
=====  =====  =======

There must be at least two columns in a table (to differentiate from section titles). Column spans use underlines of hyphens ("Inputs" spans the first two columns):

=====  =====  ======
   Inputs     Output
------------  ------
  A      B    A or B
=====  =====  ======
False  False  False
True   False  True
False  True   True
True   True   True
=====  =====  ======

Text in a first-column cell starts a new row. No text in the first column indicates a continuation line; the rest of the cells may consist of multiple lines. For example:

=====  =========================
col 1  col 2
=====  =========================
1      Second column of row 1.
2      Second column of row 2.
       Second line of paragraph.
3      - Second column of row 3.

       - Second item in bullet
         list (row 3, column 2).
=====  =========================

Footnotes

Footnote references consist of a left square bracket, a number, a right square bracket, and a trailing underscore:

This sentence ends with a footnote reference [1]_.

Whitespace must precede the footnote reference. Leave a space between the footnote reference and the preceding word.

When referring to another PEP, include the PEP number in the body text, such as "PEP 1". The title may optionally appear. Add a footnote reference following the title. For example:

Refer to PEP 1 [2]_ for more information.

Add a footnote that includes the PEP's title and author. It may optionally include the explicit URL on a separate line, but only in the References section. Footnotes begin with ".. " (the explicit markup start), followed by the footnote marker (no underscores), followed by the footnote body. For example:

References
==========

.. [2] PEP 1, "PEP Purpose and Guidelines", Warsaw, Hylton
   (http://www.python.org/dev/peps/pep-0001)

If you decide to provide an explicit URL for a PEP, please use this as the URL template:

http://www.python.org/dev/peps/pep-xxxx

PEP numbers in URLs must be padded with zeros from the left, so as to be exactly 4 characters wide, however PEP numbers in the text are never padded.

During the course of developing your PEP, you may have to add, remove, and rearrange footnote references, possibly resulting in mismatched references, obsolete footnotes, and confusion. Auto-numbered footnotes allow more freedom. Instead of a number, use a label of the form "#word", where "word" is a mnemonic consisting of alphanumerics plus internal hyphens, underscores, and periods (no whitespace or other characters are allowed). For example:

Refer to PEP 1 [#PEP-1]_ for more information.

References
==========

.. [#PEP-1] PEP 1, "PEP Purpose and Guidelines", Warsaw, Hylton

   http://www.python.org/dev/peps/pep-0001

Footnotes and footnote references will be numbered automatically, and the numbers will always match. Once a PEP is finalized, auto-numbered labels should be replaced by numbers for simplicity.

Images

If your PEP contains a diagram, you may include it in the processed output using the "image" directive:

.. image:: diagram.png

Any browser-friendly graphics format is possible: .png, .jpeg, .gif, .tiff, etc.

Since this image will not be visible to readers of the PEP in source text form, you should consider including a description or ASCII art alternative, using a comment (below).

Comments

A comment block is an indented block of arbitrary text immediately following an explicit markup start: two periods and whitespace. Leave the ".." on a line by itself to ensure that the comment is not misinterpreted as another explicit markup construct. Comments are not visible in the processed document. For the benefit of those reading your PEP in source form, please consider including a descriptions of or ASCII art alternatives to any images you include. For example:

.. image:: dataflow.png

..
   Data flows from the input module, through the "black box"
   module, and finally into (and through) the output module.

The Emacs stanza at the bottom of this document is inside a comment.

Escaping Mechanism

reStructuredText uses backslashes ("\") to override the special meaning given to markup characters and get the literal characters themselves. To get a literal backslash, use an escaped backslash ("\\"). There are two contexts in which backslashes have no special meaning: literal blocks and inline literals (see Inline Markup above). In these contexts, no markup recognition is done, and a single backslash represents a literal backslash, without having to double up.

If you find that you need to use a backslash in your text, consider using inline literals or a literal block instead.

Habits to Avoid

Many programmers who are familiar with TeX often write quotation marks like this:

`single-quoted' or ``double-quoted''

Backquotes are significant in reStructuredText, so this practice should be avoided. For ordinary text, use ordinary 'single-quotes' or "double-quotes". For inline literal text (see Inline Markup above), use double-backquotes:

``literal text: in here, anything goes!``

Resources

Many other constructs and variations are possible. For more details about the reStructuredText markup, in increasing order of thoroughness, please see:

The processing of reStructuredText PEPs is done using Docutils [3]. If you have a question or require assistance with reStructuredText or Docutils, please post a message [4] to the Docutils-users mailing list [5]. The Docutils project web site [3] has more information.

pep-0020 The Zen of Python

PEP: 20
Title: The Zen of Python
Version: $Revision$
Last-Modified: $Date$
Author: Tim Peters <tim at zope.com>
Status: Active
Type: Informational
Content-Type: text/plain
Created: 19-Aug-2004
Post-History: 22-Aug-2004

Abstract

    Long time Pythoneer Tim Peters succinctly channels the BDFL's
    guiding principles for Python's design into 20 aphorisms, only 19
    of which have been written down.


The Zen of Python

    Beautiful is better than ugly.
    Explicit is better than implicit.
    Simple is better than complex.
    Complex is better than complicated.
    Flat is better than nested.
    Sparse is better than dense.
    Readability counts.
    Special cases aren't special enough to break the rules.
    Although practicality beats purity.
    Errors should never pass silently.
    Unless explicitly silenced.
    In the face of ambiguity, refuse the temptation to guess.
    There should be one-- and preferably only one --obvious way to do it.
    Although that way may not be obvious at first unless you're Dutch.
    Now is better than never.
    Although never is often better than *right* now.
    If the implementation is hard to explain, it's a bad idea.
    If the implementation is easy to explain, it may be a good idea.
    Namespaces are one honking great idea -- let's do more of those!


Easter Egg

    >>> import this


Copyright

    This document has been placed in the public domain.



pep-0042 Feature Requests

PEP: 42
Title: Feature Requests
Version: $Revision$
Last-Modified: $Date$
Author: Jeremy Hylton <jeremy at alum.mit.edu>
Status: Final
Type: Process
Created: 12-Sep-2000
Post-History: 

Introduction

    This PEP contains a list of feature requests that may be
    considered for future versions of Python.  Large feature requests
    should not be included here, but should be described in separate
    PEPs; however a large feature request that doesn't have its own
    PEP can be listed here until its own PEP is created.  See
    PEP 0 for details.

    This PEP was created to allow us to close bug reports that are really
    feature requests.  Marked as Open, they distract from the list of real
    bugs (which should ideally be less than a page).  Marked as Closed, they
    tend to be forgotten.  The procedure now is:  if a bug report is really
    a feature request, add the feature request to this PEP; mark the bug as
    "feature request", "later", and "closed"; and add a comment to the bug
    saying that this is the case (mentioning the PEP explicitly).  It is
    also acceptable to move large feature requests directly from the bugs
    database to a separate PEP.

    This PEP should really be separated into four different categories
    (categories due to Laura Creighton):

    1. BDFL rejects as a bad idea.  Don't come back with it.

    2. BDFL will put in if somebody writes the code.  (Or at any rate,
       BDFL will say 'change this and I will put it in' if you show up
       with code.)

      (possibly divided into:

            2a)  BDFL would really like to see some code!

            2b)  BDFL is never going to be enthusiastic about this, but
                 will work it in when it's easy.
       )

    3. If you show up with code, BDFL will make a pronouncement.  It
       might be ICK.

    4. This is too vague.  This is rejected, but only on the grounds
       of vagueness.  If you like this enhancement, make a new PEP.


Core Language / Builtins

    - The parser should handle more deeply nested parse trees.

      The following will fail -- eval("["*50 + "]"*50) -- because the
      parser has a hard-coded limit on stack size.  This limit should
      be raised or removed.  Removal would be hard because the
      current compiler can overflow the C stack if the nesting is too
      deep.

      http://www.python.org/sf/215555

    - Non-accidental IEEE-754 support (Infs, NaNs, settable traps, etc).
      Big project.

    - Windows:  Trying to create (or even access) files with certain magic
      names can hang or crash Windows systems.  This is really a bug in the
      OSes, but some apps try to shield users from it.  When it happens,
      the symptoms are very confusing.

      Hang using files named prn.txt, etc
      http://www.python.org/sf/481171

    - eval and free variables: It might be useful if there was a way
      to pass bindings for free variables to eval when a code object
      with free variables is passed.
      http://www.python.org/sf/443866

Standard Library

    - The urllib module should support proxies which require
      authentication.  See SourceForge bug #210619 for information:

      http://www.python.org/sf/210619

    - os.rename() should be modified to handle EXDEV errors on
      platforms that don't allow rename() to operate across filesystem
      boundaries by copying the file over and removing the original.
      Linux is one system that requires this treatment.

      http://www.python.org/sf/212317

    - signal handling doesn't always work as expected.  E.g. if
      sys.stdin.readline() is interrupted by a (returning) signal
      handler, it returns "".  It would be better to make it raise an
      exception (corresponding to EINTR) or to restart.  But these
      changes would have to applied to all places that can do blocking
      interruptable I/O.  So it's a big project.

      http://www.python.org/sf/210599

    - Extend Windows utime to accept directory paths.

      http://www.python.org/sf/214245

    - Extend copy.py to module & function types.

      http://www.python.org/sf/214553

    - Better checking for bad input to marshal.load*().

      http://www.python.org/sf/214754

    - rfc822.py should be more lenient than the spec in the types of
      address fields it parses.  Specifically, an invalid address of
      the form "From: Amazon.com <delivers-news2@amazon.com>" should
      be parsed correctly.

      http://www.python.org/sf/210678

    - cgi.py's FieldStorage class should be more conservative with
      memory in the face of large binary file uploads.

      http://www.python.org/sf/210674

      There are two issues here: first, because
      read_lines_to_outerboundary() uses readline() it is possible
      that a large amount of data will be read into memory for a
      binary file upload.  This should probably look at the
      Content-Type header of the section and do a chunked read if it's
      a binary type.

      The second issue was related to the self.lines attribute, which
      was removed in revision 1.56 of cgi.py (see also):

      http://www.python.org/sf/219806

    - urllib should support proxy definitions that contain just the
      host and port

      http://www.python.org/sf/210849

    - urlparse should be updated to comply with RFC 2396, which
      defines optional parameters for each segment of the path.

      http://www.python.org/sf/210834

    - The exceptions raised by pickle and cPickle are currently
      different; these should be unified (probably the exceptions
      should be defined in a helper module that's imported by both).
      [No bug report; I just thought of this.]

    - More standard library routines should support Unicode.  For
      example, urllib.quote() could convert Unicode strings to UTF-8
      and then do the usual %HH conversion.  But this is not the only
      one!

      http://www.python.org/sf/216716

    - There should be a way to say that you don't mind if str() or
      __str__() return a Unicode string object.  Or a different
      function -- ustr() has been proposed.  Or something...

      http://sf.net/patch/?func=detailpatch&patch_id=101527&group_id=5470

    - Killing a thread from another thread.  Or maybe sending a
      signal.  Or maybe raising an asynchronous exception.

      http://www.python.org/sf/221115

    - The debugger (pdb) should understand packages.

      http://www.python.org/sf/210631

    - Jim Fulton suggested the following:

        I wonder if it would be a good idea to have a new kind of
        temporary file that stored data in memory unless:

        - The data exceeds some size, or

        - Somebody asks for a fileno.

        Then the cgi module (and other apps) could use this thing in a
        uniform way.

      http://www.python.org/sf/415692

    - Jim Fulton pointed out that binascii's b2a_base64() function
      has situations where it makes sense not to append a newline,
      or to append something else than a newline.

      Proposal:

        - add an optional argument giving the delimiter string to be
          appended, defaulting to "\n"

        - possibly special-case None as the delimiter string to avoid
          adding the pad bytes too???

      http://www.python.org/sf/415694

    - pydoc should be integrated with the HTML docs, or at least
      be able to link to them.

      http://www.python.org/sf/405554

    - Distutils should deduce dependencies for .c and .h files.

      http://www.python.org/sf/472881

    - asynchat is buggy in the face of multithreading.

      http://www.python.org/sf/595217

    - It would be nice if the higher level modules (httplib, smtplib,
      nntplib, etc.) had options for setting socket timeouts.

      http://www.python.org/sf/723287

    - The curses library is missing two important calls: newterm() and 
      delscreen()

      http://www.python.org/sf/665572, http://bugs.debian.org/175590

    - It would be nice if the built-in SSL socket type could be used
      for non-blocking SSL I/O.  Currently packages such as Twisted 
      which implement async servers using SSL have to require third-party
      packages such as pyopenssl.  

    - reST as a standard library module

    - The import lock could use some redesign.
    
      http://www.python.org/sf/683658

    - A nicer API to open text files, replacing the ugly (in some
      people's eyes) "U" mode flag.  There's a proposal out there to
      have a new built-in type textfile(filename, mode, encoding).
      (Shouldn't it have a bufsize argument too?)

    - Support new widgets and/or parameters for Tkinter

    - For a class defined inside another class, the __name__ should be
      "outer.inner", and pickling should work.  (GvR is no longer certain
      this is easy or even right.)

      http://www.python.org/sf/633930

    - Decide on a clearer deprecation policy (especially for modules)
      and act on it.

      http://mail.python.org/pipermail/python-dev/2002-April/023165.html

    - Provide alternatives for common uses of the types module;
      Skip Montanaro has posted a proto-PEP for this idea:

      http://mail.python.org/pipermail/python-dev/2002-May/024346.html

    - Use pending deprecation for the types and string modules.  This
      requires providing alternatives for the parts that aren't
      covered yet (e.g. string.whitespace and types.TracebackType).
      It seems we can't get consensus on this.

    - Lazily tracking tuples?

      http://mail.python.org/pipermail/python-dev/2002-May/023926.html
      http://www.python.org/sf/558745

    - Make 'as' a keyword.  It has been a pseudo-keyword long enough.
      (It's deprecated in 2.5, and will become a keyword in 2.6.)


C API wishes

    - Add C API functions to help Windows users who are building
      embedded applications where the FILE * structure does not match
      the FILE * the interpreter was compiled with.

      http://www.python.org/sf/210821

      See this bug report for a specific suggestion that will allow a
      Borland C++ builder application to interact with a python.dll
      build with MSVC.


Tools

    - Python could use a GUI builder.

      http://www.python.org/sf/210820


Building and Installing

    - Modules/makesetup should make sure the 'config.c' file it
      generates from the various Setup files, is valid C. It currently
      accepts module names with characters that are not allowable in
      Python or C identifiers.

      http://www.python.org/sf/216326

    - Building from source should not attempt to overwrite the
      Include/graminit.h and Parser/graminit.c files, at least for
      people downloading a source release rather than working from
      Subversion or snapshots.  Some people find this a problem in unusual
      build environments.

      http://www.python.org/sf/219221

    - The configure script has probably grown a bit crufty with age and may
      not track autoconf's more recent features very well.  It should be
      looked at and possibly cleaned up.

      http://mail.python.org/pipermail/python-dev/2004-January/041790.html

    - Make Python compliant to the FHS (the Filesystem Hierarchy Standard)

      http://bugs.python.org/issue588756


pep-0100 Python Unicode Integration

PEP: 100
Title: Python Unicode Integration
Version: $Revision$
Last-Modified: $Date$
Author: Marc-AndrĂŠ Lemburg <mal at lemburg.com>
Status: Final
Type: Standards Track
Created: 10-Mar-2000
Python-Version: 2.0
Post-History: 

Historical Note

    This document was first written by Marc-Andre in the pre-PEP days,
    and was originally distributed as Misc/unicode.txt in Python
    distributions up to and included Python 2.1.  The last revision of
    the proposal in that location was labeled version 1.7 (CVS
    revision 3.10).  Because the document clearly serves the purpose
    of an informational PEP in the post-PEP era, it has been moved
    here and reformatted to comply with PEP guidelines.  Future
    revisions will be made to this document, while Misc/unicode.txt
    will contain a pointer to this PEP.

    -Barry Warsaw, PEP editor


Introduction

    The idea of this proposal is to add native Unicode 3.0 support to
    Python in a way that makes use of Unicode strings as simple as
    possible without introducing too many pitfalls along the way.

    Since this goal is not easy to achieve -- strings being one of the
    most fundamental objects in Python -- we expect this proposal to
    undergo some significant refinements.

    Note that the current version of this proposal is still a bit
    unsorted due to the many different aspects of the Unicode-Python
    integration.

    The latest version of this document is always available at:

            http://starship.python.net/~lemburg/unicode-proposal.txt

    Older versions are available as:

            http://starship.python.net/~lemburg/unicode-proposal-X.X.txt

    [ed. note: new revisions should be made to this PEP document,
     while the historical record previous to version 1.7 should be
     retrieved from MAL's url, or Misc/unicode.txt]


Conventions

    - In examples we use u = Unicode object and s = Python string

    - 'XXX' markings indicate points of discussion (PODs)


General Remarks

    - Unicode encoding names should be lower case on output and
      case-insensitive on input (they will be converted to lower case
      by all APIs taking an encoding name as input).

    - Encoding names should follow the name conventions as used by the
      Unicode Consortium: spaces are converted to hyphens, e.g. 'utf
      16' is written as 'utf-16'.

    - Codec modules should use the same names, but with hyphens
      converted to underscores, e.g. utf_8, utf_16, iso_8859_1.


Unicode Default Encoding

    The Unicode implementation has to make some assumption about the
    encoding of 8-bit strings passed to it for coercion and about the
    encoding to as default for conversion of Unicode to strings when
    no specific encoding is given.  This encoding is called <default
    encoding> throughout this text.

    For this, the implementation maintains a global which can be set
    in the site.py Python startup script.  Subsequent changes are not
    possible.  The <default encoding> can be set and queried using the
    two sys module APIs:

      sys.setdefaultencoding(encoding)
        --> Sets the <default encoding> used by the Unicode implementation.
            encoding has to be an encoding which is supported by the
            Python installation, otherwise, a LookupError is raised.

            Note: This API is only available in site.py!  It is
            removed from the sys module by site.py after usage.

      sys.getdefaultencoding()
        --> Returns the current <default encoding>.

    If not otherwise defined or set, the <default encoding> defaults
    to 'ascii'.  This encoding is also the startup default of Python
    (and in effect before site.py is executed).

    Note that the default site.py startup module contains disabled
    optional code which can set the <default encoding> according to
    the encoding defined by the current locale.  The locale module is
    used to extract the encoding from the locale default settings
    defined by the OS environment (see locale.py).  If the encoding
    cannot be determined, is unkown or unsupported, the code defaults
    to setting the <default encoding> to 'ascii'.  To enable this
    code, edit the site.py file or place the appropriate code into the
    sitecustomize.py module of your Python installation.


Unicode Constructors

    Python should provide a built-in constructor for Unicode strings
    which is available through __builtins__:

    u = unicode(encoded_string[,encoding=<default encoding>][,errors="strict"])

    u = u'<unicode-escape encoded Python string>'

    u = ur'<raw-unicode-escape encoded Python string>'

    With the 'unicode-escape' encoding being defined as:

    - all non-escape characters represent themselves as Unicode
      ordinal (e.g. 'a' -> U+0061).

    - all existing defined Python escape sequences are interpreted as
      Unicode ordinals; note that \xXXXX can represent all Unicode
      ordinals, and \OOO (octal) can represent Unicode ordinals up to
      U+01FF.

    - a new escape sequence, \uXXXX, represents U+XXXX; it is a syntax
      error to have fewer than 4 digits after \u.

    For an explanation of possible values for errors see the Codec
    section below.

    Examples:

      u'abc'          -> U+0061 U+0062 U+0063
      u'\u1234'       -> U+1234
      u'abc\u1234\n'  -> U+0061 U+0062 U+0063 U+1234 U+005c

    The 'raw-unicode-escape' encoding is defined as follows:

    - \uXXXX sequence represent the U+XXXX Unicode character if and
      only if the number of leading backslashes is odd

    - all other characters represent themselves as Unicode ordinal
      (e.g. 'b' -> U+0062)

    Note that you should provide some hint to the encoding you used to
    write your programs as pragma line in one the first few comment
    lines of the source file (e.g. '# source file encoding: latin-1').
    If you only use 7-bit ASCII then everything is fine and no such
    notice is needed, but if you include Latin-1 characters not
    defined in ASCII, it may well be worthwhile including a hint since
    people in other countries will want to be able to read your source
    strings too.


Unicode Type Object

    Unicode objects should have the type UnicodeType with type name
    'unicode', made available through the standard types module.


Unicode Output

    Unicode objects have a method .encode([encoding=<default encoding>])
    which returns a Python string encoding the Unicode string using the
    given scheme (see Codecs).

      print u := print u.encode()   # using the <default encoding>

      str(u)  := u.encode()         # using the <default encoding>

      repr(u) := "u%s" % repr(u.encode('unicode-escape'))

    Also see Internal Argument Parsing and Buffer Interface for
    details on how other APIs written in C will treat Unicode objects.


Unicode Ordinals

    Since Unicode 3.0 has a 32-bit ordinal character set, the
    implementation should provide 32-bit aware ordinal conversion
    APIs:

      ord(u[:1]) (this is the standard ord() extended to work with Unicode
                  objects)
        --> Unicode ordinal number (32-bit)

      unichr(i) 
          --> Unicode object for character i (provided it is 32-bit);
              ValueError otherwise

    Both APIs should go into __builtins__ just like their string
    counterparts ord() and chr().

    Note that Unicode provides space for private encodings.  Usage of
    these can cause different output representations on different
    machines.  This problem is not a Python or Unicode problem, but a
    machine setup and maintenance one.


Comparison & Hash Value

    Unicode objects should compare equal to other objects after these
    other objects have been coerced to Unicode.  For strings this
    means that they are interpreted as Unicode string using the
    <default encoding>.

    Unicode objects should return the same hash value as their ASCII
    equivalent strings.  Unicode strings holding non-ASCII values are
    not guaranteed to return the same hash values as the default
    encoded equivalent string representation.

    When compared using cmp() (or PyObject_Compare()) the
    implementation should mask TypeErrors raised during the conversion
    to remain in synch with the string behavior.  All other errors
    such as ValueErrors raised during coercion of strings to Unicode
    should not be masked and passed through to the user.

    In containment tests ('a' in u'abc' and u'a' in 'abc') both sides
    should be coerced to Unicode before applying the test.  Errors
    occurring during coercion (e.g. None in u'abc') should not be
    masked.


Coercion

    Using Python strings and Unicode objects to form new objects
    should always coerce to the more precise format, i.e. Unicode
    objects.

      u + s := u + unicode(s)

      s + u := unicode(s) + u

    All string methods should delegate the call to an equivalent
    Unicode object method call by converting all involved strings to
    Unicode and then applying the arguments to the Unicode method of
    the same name, e.g.

      string.join((s,u),sep) := (s + sep) + u

      sep.join((s,u)) := (s + sep) + u

    For a discussion of %-formatting w/r to Unicode objects, see
    Formatting Markers.


Exceptions

    UnicodeError is defined in the exceptions module as a subclass of
    ValueError.  It is available at the C level via
    PyExc_UnicodeError.  All exceptions related to Unicode
    encoding/decoding should be subclasses of UnicodeError.


Codecs (Coder/Decoders) Lookup

    A Codec (see Codec Interface Definition) search registry should be
    implemented by a module "codecs":

      codecs.register(search_function)

    Search functions are expected to take one argument, the encoding
    name in all lower case letters and with hyphens and spaces
    converted to underscores, and return a tuple of functions
    (encoder, decoder, stream_reader, stream_writer) taking the
    following arguments:

      encoder and decoder:
      
        These must be functions or methods which have the same
        interface as the .encode/.decode methods of Codec instances
        (see Codec Interface). The functions/methods are expected to
        work in a stateless mode.

      stream_reader and stream_writer:

        These need to be factory functions with the following
        interface:

            factory(stream,errors='strict')

        The factory functions must return objects providing the
        interfaces defined by StreamWriter/StreamReader resp.  (see
        Codec Interface).  Stream codecs can maintain state.

        Possible values for errors are defined in the Codec section
        below.

    In case a search function cannot find a given encoding, it should
    return None.

    Aliasing support for encodings is left to the search functions to
    implement.

    The codecs module will maintain an encoding cache for performance
    reasons.  Encodings are first looked up in the cache.  If not
    found, the list of registered search functions is scanned.  If no
    codecs tuple is found, a LookupError is raised.  Otherwise, the
    codecs tuple is stored in the cache and returned to the caller.

    To query the Codec instance the following API should be used:

      codecs.lookup(encoding)

    This will either return the found codecs tuple or raise a
    LookupError.


Standard Codecs

    Standard codecs should live inside an encodings/ package directory
    in the Standard Python Code Library.  The __init__.py file of that
    directory should include a Codec Lookup compatible search function
    implementing a lazy module based codec lookup.

    Python should provide a few standard codecs for the most relevant
    encodings, e.g.

      'utf-8':              8-bit variable length encoding
      'utf-16':             16-bit variable length encoding (little/big endian)
      'utf-16-le':          utf-16 but explicitly little endian
      'utf-16-be':          utf-16 but explicitly big endian
      'ascii':              7-bit ASCII codepage
      'iso-8859-1':         ISO 8859-1 (Latin 1) codepage
      'unicode-escape':     See Unicode Constructors for a definition
      'raw-unicode-escape': See Unicode Constructors for a definition
      'native':             Dump of the Internal Format used by Python

    Common aliases should also be provided per default, e.g.
    'latin-1' for 'iso-8859-1'.

    Note: 'utf-16' should be implemented by using and requiring byte
    order marks (BOM) for file input/output.

    All other encodings such as the CJK ones to support Asian scripts
    should be implemented in separate packages which do not get
    included in the core Python distribution and are not a part of
    this proposal.


Codecs Interface Definition

    The following base class should be defined in the module "codecs".
    They provide not only templates for use by encoding module
    implementors, but also define the interface which is expected by
    the Unicode implementation.

    Note that the Codec Interface defined here is well suitable for a
    larger range of applications.  The Unicode implementation expects
    Unicode objects on input for .encode() and .write() and character
    buffer compatible objects on input for .decode().  Output of
    .encode() and .read() should be a Python string and .decode() must
    return an Unicode object.

    First, we have the stateless encoders/decoders.  These do not work
    in chunks as the stream codecs (see below) do, because all
    components are expected to be available in memory.

    class Codec:

        """Defines the interface for stateless encoders/decoders.

           The .encode()/.decode() methods may implement different
           error handling schemes by providing the errors argument.
           These string values are defined:

             'strict'  - raise an error (or a subclass)
             'ignore'  - ignore the character and continue with the next
             'replace' - replace with a suitable replacement character;
                         Python will use the official U+FFFD
                         REPLACEMENT CHARACTER for the builtin Unicode
                         codecs.
        """

        def encode(self,input,errors='strict'):

            """Encodes the object input and returns a tuple (output
               object, length consumed).

               errors defines the error handling to apply.  It
               defaults to 'strict' handling.

               The method may not store state in the Codec instance.
               Use StreamCodec for codecs which have to keep state in
               order to make encoding/decoding efficient.
            """

        def decode(self,input,errors='strict'):

            """Decodes the object input and returns a tuple (output
               object, length consumed).

               input must be an object which provides the
               bf_getreadbuf buffer slot.  Python strings, buffer
               objects and memory mapped files are examples of objects
               providing this slot.

               errors defines the error handling to apply.  It
               defaults to 'strict' handling.

               The method may not store state in the Codec instance.
               Use StreamCodec for codecs which have to keep state in
               order to make encoding/decoding efficient.

            """ 

    StreamWriter and StreamReader define the interface for stateful
    encoders/decoders which work on streams.  These allow processing
    of the data in chunks to efficiently use memory.  If you have
    large strings in memory, you may want to wrap them with cStringIO
    objects and then use these codecs on them to be able to do chunk
    processing as well, e.g. to provide progress information to the
    user.

    class StreamWriter(Codec):

        def __init__(self,stream,errors='strict'):

            """Creates a StreamWriter instance.

               stream must be a file-like object open for writing
               (binary) data.

               The StreamWriter may implement different error handling
               schemes by providing the errors keyword argument.
               These parameters are defined:

                 'strict' - raise a ValueError (or a subclass)
                 'ignore' - ignore the character and continue with the next
                 'replace'- replace with a suitable replacement character
            """
            self.stream = stream
            self.errors = errors

        def write(self,object):

            """Writes the object's contents encoded to self.stream.
            """
            data, consumed = self.encode(object,self.errors)
            self.stream.write(data)

        def writelines(self, list):

            """Writes the concatenated list of strings to the stream
               using .write().
            """
            self.write(''.join(list))

        def reset(self):

            """Flushes and resets the codec buffers used for keeping state.

               Calling this method should ensure that the data on the
               output is put into a clean state, that allows appending
               of new fresh data without having to rescan the whole
               stream to recover state.
            """
            pass

        def __getattr__(self,name, getattr=getattr):

            """Inherit all other methods from the underlying stream.
            """
            return getattr(self.stream,name)


    class StreamReader(Codec):

        def __init__(self,stream,errors='strict'):

            """Creates a StreamReader instance.

               stream must be a file-like object open for reading
               (binary) data.

               The StreamReader may implement different error handling
               schemes by providing the errors keyword argument.
               These parameters are defined:

                 'strict' - raise a ValueError (or a subclass)
                 'ignore' - ignore the character and continue with the next
                 'replace'- replace with a suitable replacement character;
            """
            self.stream = stream
            self.errors = errors

        def read(self,size=-1):

            """Decodes data from the stream self.stream and returns the
               resulting object.

               size indicates the approximate maximum number of bytes
               to read from the stream for decoding purposes.  The
               decoder can modify this setting as appropriate.  The
               default value -1 indicates to read and decode as much
               as possible.  size is intended to prevent having to
               decode huge files in one step.

               The method should use a greedy read strategy meaning
               that it should read as much data as is allowed within
               the definition of the encoding and the given size, e.g.
               if optional encoding endings or state markers are
               available on the stream, these should be read too.
            """
            # Unsliced reading:
            if size < 0:
                return self.decode(self.stream.read())[0]

            # Sliced reading:
            read = self.stream.read
            decode = self.decode
            data = read(size)
            i = 0
            while 1:
                try:
                    object, decodedbytes = decode(data)
                except ValueError,why:
                    # This method is slow but should work under pretty
                    # much all conditions; at most 10 tries are made
                    i = i + 1
                    newdata = read(1)
                    if not newdata or i > 10:
                        raise
                    data = data + newdata
                else:
                    return object

        def readline(self, size=None):

            """Read one line from the input stream and return the
               decoded data.

               Note: Unlike the .readlines() method, this method
               inherits the line breaking knowledge from the
               underlying stream's .readline() method -- there is
               currently no support for line breaking using the codec
               decoder due to lack of line buffering.  Subclasses
               should however, if possible, try to implement this
               method using their own knowledge of line breaking.

               size, if given, is passed as size argument to the
               stream's .readline() method.
            """
            if size is None:
                line = self.stream.readline()
            else:
                line = self.stream.readline(size)
            return self.decode(line)[0]

        def readlines(self, sizehint=0):

            """Read all lines available on the input stream
               and return them as list of lines.

               Line breaks are implemented using the codec's decoder
               method and are included in the list entries.

               sizehint, if given, is passed as size argument to the
               stream's .read() method.
            """
            if sizehint is None:
                data = self.stream.read()
            else:
                data = self.stream.read(sizehint)
            return self.decode(data)[0].splitlines(1)

        def reset(self):

            """Resets the codec buffers used for keeping state.

               Note that no stream repositioning should take place.
               This method is primarily intended to be able to recover
               from decoding errors.

            """
            pass

        def __getattr__(self,name, getattr=getattr):

            """ Inherit all other methods from the underlying stream.
            """
            return getattr(self.stream,name)


    Stream codec implementors are free to combine the StreamWriter and
    StreamReader interfaces into one class.  Even combining all these
    with the Codec class should be possible.

    Implementors are free to add additional methods to enhance the
    codec functionality or provide extra state information needed for
    them to work.  The internal codec implementation will only use the
    above interfaces, though.

    It is not required by the Unicode implementation to use these base
    classes, only the interfaces must match; this allows writing
    Codecs as extension types.

    As guideline, large mapping tables should be implemented using
    static C data in separate (shared) extension modules.  That way
    multiple processes can share the same data.

    A tool to auto-convert Unicode mapping files to mapping modules
    should be provided to simplify support for additional mappings
    (see References).


Whitespace

    The .split() method will have to know about what is considered
    whitespace in Unicode.


Case Conversion

    Case conversion is rather complicated with Unicode data, since
    there are many different conditions to respect.  See

      http://www.unicode.org/unicode/reports/tr13/ 

    for some guidelines on implementing case conversion.

    For Python, we should only implement the 1-1 conversions included
    in Unicode.  Locale dependent and other special case conversions
    (see the Unicode standard file SpecialCasing.txt) should be left
    to user land routines and not go into the core interpreter.

    The methods .capitalize() and .iscapitalized() should follow the
    case mapping algorithm defined in the above technical report as
    closely as possible.


Line Breaks

    Line breaking should be done for all Unicode characters having the
    B property as well as the combinations CRLF, CR, LF (interpreted
    in that order) and other special line separators defined by the
    standard.

    The Unicode type should provide a .splitlines() method which
    returns a list of lines according to the above specification. See
    Unicode Methods.


Unicode Character Properties

    A separate module "unicodedata" should provide a compact interface
    to all Unicode character properties defined in the standard's
    UnicodeData.txt file.

    Among other things, these properties provide ways to recognize
    numbers, digits, spaces, whitespace, etc.

    Since this module will have to provide access to all Unicode
    characters, it will eventually have to contain the data from
    UnicodeData.txt which takes up around 600kB.  For this reason, the
    data should be stored in static C data.  This enables compilation
    as shared module which the underlying OS can shared between
    processes (unlike normal Python code modules).

    There should be a standard Python interface for accessing this
    information so that other implementors can plug in their own
    possibly enhanced versions, e.g. ones that do decompressing of the
    data on-the-fly.


Private Code Point Areas

    Support for these is left to user land Codecs and not explicitly
    integrated into the core.  Note that due to the Internal Format
    being implemented, only the area between \uE000 and \uF8FF is
    usable for private encodings.


Internal Format

    The internal format for Unicode objects should use a Python
    specific fixed format <PythonUnicode> implemented as 'unsigned
    short' (or another unsigned numeric type having 16 bits).  Byte
    order is platform dependent.

    This format will hold UTF-16 encodings of the corresponding
    Unicode ordinals.  The Python Unicode implementation will address
    these values as if they were UCS-2 values. UCS-2 and UTF-16 are
    the same for all currently defined Unicode character points.
    UTF-16 without surrogates provides access to about 64k characters
    and covers all characters in the Basic Multilingual Plane (BMP) of
    Unicode.

    It is the Codec's responsibility to ensure that the data they pass
    to the Unicode object constructor respects this assumption.  The
    constructor does not check the data for Unicode compliance or use
    of surrogates.

    Future implementations can extend the 32 bit restriction to the
    full set of all UTF-16 addressable characters (around 1M
    characters).

    The Unicode API should provide interface routines from
    <PythonUnicode> to the compiler's wchar_t which can be 16 or 32
    bit depending on the compiler/libc/platform being used.

    Unicode objects should have a pointer to a cached Python string
    object <defenc> holding the object's value using the <default
    encoding>.  This is needed for performance and internal parsing
    (see Internal Argument Parsing) reasons.  The buffer is filled
    when the first conversion request to the <default encoding> is
    issued on the object.

    Interning is not needed (for now), since Python identifiers are
    defined as being ASCII only.

    codecs.BOM should return the byte order mark (BOM) for the format
    used internally.  The codecs module should provide the following
    additional constants for convenience and reference (codecs.BOM
    will either be BOM_BE or BOM_LE depending on the platform):

      BOM_BE: '\376\377' 
        (corresponds to Unicode U+0000FEFF in UTF-16 on big endian
         platforms == ZERO WIDTH NO-BREAK SPACE)

      BOM_LE: '\377\376' 
        (corresponds to Unicode U+0000FFFE in UTF-16 on little endian
         platforms == defined as being an illegal Unicode character)

      BOM4_BE: '\000\000\376\377'
        (corresponds to Unicode U+0000FEFF in UCS-4)

      BOM4_LE: '\377\376\000\000'
        (corresponds to Unicode U+0000FFFE in UCS-4)

    Note that Unicode sees big endian byte order as being "correct".
    The swapped order is taken to be an indicator for a "wrong"
    format, hence the illegal character definition.

    The configure script should provide aid in deciding whether Python
    can use the native wchar_t type or not (it has to be a 16-bit
    unsigned type).


Buffer Interface

    Implement the buffer interface using the <defenc> Python string
    object as basis for bf_getcharbuf and the internal buffer for
    bf_getreadbuf.  If bf_getcharbuf is requested and the <defenc>
    object does not yet exist, it is created first.

    Note that as special case, the parser marker "s#" will not return
    raw Unicode UTF-16 data (which the bf_getreadbuf returns), but
    instead tries to encode the Unicode object using the default
    encoding and then returns a pointer to the resulting string object
    (or raises an exception in case the conversion fails).  This was
    done in order to prevent accidentely writing binary data to an
    output stream which the other end might not recognize.

    This has the advantage of being able to write to output streams
    (which typically use this interface) without additional
    specification of the encoding to use.

    If you need to access the read buffer interface of Unicode
    objects, use the PyObject_AsReadBuffer() interface.

    The internal format can also be accessed using the
    'unicode-internal' codec, e.g. via u.encode('unicode-internal').


Pickle/Marshalling

    Should have native Unicode object support.  The objects should be
    encoded using platform independent encodings.

    Marshal should use UTF-8 and Pickle should either choose
    Raw-Unicode-Escape (in text mode) or UTF-8 (in binary mode) as
    encoding.  Using UTF-8 instead of UTF-16 has the advantage of
    eliminating the need to store a BOM mark.


Regular Expressions

    Secret Labs AB is working on a Unicode-aware regular expression
    machinery.  It works on plain 8-bit, UCS-2, and (optionally) UCS-4
    internal character buffers.

    Also see

            http://www.unicode.org/unicode/reports/tr18/

    for some remarks on how to treat Unicode REs.


Formatting Markers

    Format markers are used in Python format strings.  If Python
    strings are used as format strings, the following interpretations
    should be in effect:

      '%s': For Unicode objects this will cause coercion of the
            whole format string to Unicode.  Note that you should use
            a Unicode format string to start with for performance
            reasons.

    In case the format string is an Unicode object, all parameters are
    coerced to Unicode first and then put together and formatted
    according to the format string.  Numbers are first converted to
    strings and then to Unicode.

      '%s': Python strings are interpreted as Unicode
            string using the <default encoding>.  Unicode objects are
            taken as is.

    All other string formatters should work accordingly.

    Example:

    u"%s %s" % (u"abc", "abc")  ==  u"abc abc"


Internal Argument Parsing

    These markers are used by the PyArg_ParseTuple() APIs:

      "U":  Check for Unicode object and return a pointer to it

      "s":  For Unicode objects: return a pointer to the object's
            <defenc> buffer (which uses the <default encoding>).

      "s#": Access to the default encoded version of the Unicode object
            (see Buffer Interface); note that the length relates to
            the length of the default encoded string rather than the
            Unicode object length.

      "t#": Same as "s#".

      "es": 
            Takes two parameters: encoding (const char *) and buffer
            (char **).

            The input object is first coerced to Unicode in the usual
            way and then encoded into a string using the given
            encoding.

            On output, a buffer of the needed size is allocated and
            returned through *buffer as NULL-terminated string.  The
            encoded may not contain embedded NULL characters.  The
            caller is responsible for calling PyMem_Free() to free the
            allocated *buffer after usage.

      "es#":
            Takes three parameters: encoding (const char *), buffer
            (char **) and buffer_len (int *).

            The input object is first coerced to Unicode in the usual
            way and then encoded into a string using the given
            encoding.

            If *buffer is non-NULL, *buffer_len must be set to
            sizeof(buffer) on input. Output is then copied to *buffer.

            If *buffer is NULL, a buffer of the needed size is
            allocated and output copied into it.  *buffer is then
            updated to point to the allocated memory area.  The caller
            is responsible for calling PyMem_Free() to free the
            allocated *buffer after usage.

            In both cases *buffer_len is updated to the number of
            characters written (excluding the trailing NULL-byte).
            The output buffer is assured to be NULL-terminated.

    Examples:

    Using "es#" with auto-allocation:

        static PyObject *
        test_parser(PyObject *self,
                    PyObject *args)
        {
            PyObject *str;
            const char *encoding = "latin-1";
            char *buffer = NULL;
            int buffer_len = 0;

            if (!PyArg_ParseTuple(args, "es#:test_parser",
                                  encoding, &buffer, &buffer_len))
                return NULL;
            if (!buffer) {
                PyErr_SetString(PyExc_SystemError,
                                "buffer is NULL");
                return NULL;
            }
            str = PyString_FromStringAndSize(buffer, buffer_len);
            PyMem_Free(buffer);
            return str;
        }

    Using "es" with auto-allocation returning a NULL-terminated string:    

        static PyObject *
        test_parser(PyObject *self,
                    PyObject *args)
        {
            PyObject *str;
            const char *encoding = "latin-1";
            char *buffer = NULL;

            if (!PyArg_ParseTuple(args, "es:test_parser",
                                  encoding, &buffer))
                return NULL;
            if (!buffer) {
                PyErr_SetString(PyExc_SystemError,
                                "buffer is NULL");
                return NULL;
            }
            str = PyString_FromString(buffer);
            PyMem_Free(buffer);
            return str;
        }

    Using "es#" with a pre-allocated buffer:

        static PyObject *
        test_parser(PyObject *self,
                    PyObject *args)
        {
            PyObject *str;
            const char *encoding = "latin-1";
            char _buffer[10];
            char *buffer = _buffer;
            int buffer_len = sizeof(_buffer);

            if (!PyArg_ParseTuple(args, "es#:test_parser",
                                  encoding, &buffer, &buffer_len))
                return NULL;
            if (!buffer) {
                PyErr_SetString(PyExc_SystemError,
                                "buffer is NULL");
                return NULL;
            }
            str = PyString_FromStringAndSize(buffer, buffer_len);
            return str;
        }


File/Stream Output

    Since file.write(object) and most other stream writers use the
    "s#" or "t#" argument parsing marker for querying the data to
    write, the default encoded string version of the Unicode object
    will be written to the streams (see Buffer Interface).

    For explicit handling of files using Unicode, the standard stream
    codecs as available through the codecs module should be used.

    The codecs module should provide a short-cut
    open(filename,mode,encoding) available which also assures that
    mode contains the 'b' character when needed.


File/Stream Input

    Only the user knows what encoding the input data uses, so no
    special magic is applied.  The user will have to explicitly
    convert the string data to Unicode objects as needed or use the
    file wrappers defined in the codecs module (see File/Stream
    Output).


Unicode Methods & Attributes

    All Python string methods, plus:

      .encode([encoding=<default encoding>][,errors="strict"]) 
         --> see Unicode Output

      .splitlines([include_breaks=0])
         --> breaks the Unicode string into a list of (Unicode) lines;
             returns the lines with line breaks included, if
             include_breaks is true.  See Line Breaks for a
             specification of how line breaking is done.


Code Base

    We should use Fredrik Lundh's Unicode object implementation as
    basis.  It already implements most of the string methods needed
    and provides a well written code base which we can build upon.

    The object sharing implemented in Fredrik's implementation should
    be dropped.


Test Cases

    Test cases should follow those in Lib/test/test_string.py and
    include additional checks for the Codec Registry and the Standard
    Codecs.


References

    Unicode Consortium:
            http://www.unicode.org/

    Unicode FAQ:
            http://www.unicode.org/unicode/faq/

    Unicode 3.0:
            http://www.unicode.org/unicode/standard/versions/Unicode3.0.html

    Unicode-TechReports:
            http://www.unicode.org/unicode/reports/techreports.html

    Unicode-Mappings:
            ftp://ftp.unicode.org/Public/MAPPINGS/

    Introduction to Unicode (a little outdated by still nice to read):
            http://www.nada.kth.se/i18n/ucs/unicode-iso10646-oview.html

    For comparison:
            Introducing Unicode to ECMAScript (aka JavaScript) --
            http://www-4.ibm.com/software/developer/library/internationalization-support.html

    IANA Character Set Names:
            ftp://ftp.isi.edu/in-notes/iana/assignments/character-sets

    Discussion of UTF-8 and Unicode support for POSIX and Linux:
            http://www.cl.cam.ac.uk/~mgk25/unicode.html

    Encodings:

        Overview:
                http://czyborra.com/utf/

        UTC-2:
                http://www.uazone.com/multiling/unicode/ucs2.html

        UTF-7:
                Defined in RFC2152, e.g.
                http://www.uazone.com/multiling/ml-docs/rfc2152.txt

        UTF-8:
                Defined in RFC2279, e.g.
                http://info.internet.isi.edu/in-notes/rfc/files/rfc2279.txt

        UTF-16:
                http://www.uazone.com/multiling/unicode/wg2n1035.html


History of this Proposal

    [ed. note: revisions prior to 1.7 are available in the CVS history
     of Misc/unicode.txt from the standard Python distribution.  All
     subsequent history is available via the CVS revisions on this
     file.]

    1.7: Added note about the changed behaviour of "s#".
    1.6: Changed <defencstr> to <defenc> since this is the name used in the
         implementation.  Added notes about the usage of <defenc> in
         the buffer protocol implementation.
    1.5: Added notes about setting the <default encoding>.  Fixed some
         typos (thanks to Andrew Kuchling).  Changed <defencstr> to
         <utf8str>.
    1.4: Added note about mixed type comparisons and contains tests.
         Changed treating of Unicode objects in format strings (if
         used with '%s' % u they will now cause the format string to
         be coerced to Unicode, thus producing a Unicode object on
         return).  Added link to IANA charset names (thanks to Lars
         Marius Garshol).  Added new codec methods .readline(),
         .readlines() and .writelines().
    1.3: Added new "es" and "es#" parser markers
    1.2: Removed POD about codecs.open()
    1.1: Added note about comparisons and hash values.  Added note about
         case mapping algorithms.  Changed stream codecs .read() and
         .write() method to match the standard file-like object
         methods (bytes consumed information is no longer returned by
         the methods)
    1.0: changed encode Codec method to be symmetric to the decode method
         (they both return (object, data consumed) now and thus become
         interchangeable); removed __init__ method of Codec class (the
         methods are stateless) and moved the errors argument down to
         the methods; made the Codec design more generic w/r to type
         of input and output objects; changed StreamWriter.flush to
         StreamWriter.reset in order to avoid overriding the stream's
         .flush() method; renamed .breaklines() to .splitlines();
         renamed the module unicodec to codecs; modified the File I/O
         section to refer to the stream codecs.
    0.9: changed errors keyword argument definition; added 'replace' error
         handling; changed the codec APIs to accept buffer like
         objects on input; some minor typo fixes; added Whitespace
         section and included references for Unicode characters that
         have the whitespace and the line break characteristic; added
         note that search functions can expect lower-case encoding
         names; dropped slicing and offsets in the codec APIs
    0.8: added encodings package and raw unicode escape encoding; untabified
         the proposal; added notes on Unicode format strings; added
         .breaklines() method
    0.7: added a whole new set of codec APIs; added a different
         encoder lookup scheme; fixed some names
    0.6: changed "s#" to "t#"; changed <defencbuf> to <defencstr> holding
         a real Python string object; changed Buffer Interface to
         delegate requests to <defencstr>'s buffer interface; removed
         the explicit reference to the unicodec.codecs dictionary (the
         module can implement this in way fit for the purpose);
         removed the settable default encoding; move UnicodeError from
         unicodec to exceptions; "s#" not returns the internal data;
         passed the UCS-2/UTF-16 checking from the Unicode constructor
         to the Codecs
    0.5: moved sys.bom to unicodec.BOM; added sections on case mapping,
         private use encodings and Unicode character properties
    0.4: added Codec interface, notes on %-formatting, changed some encoding
         details, added comments on stream wrappers, fixed some
         discussion points (most important: Internal Format),
         clarified the 'unicode-escape' encoding, added encoding
         references
    0.3: added references, comments on codec modules, the internal format,
         bf_getcharbuffer and the RE engine; added 'unicode-escape'
         encoding proposed by Tim Peters and fixed repr(u) accordingly
    0.2: integrated Guido's suggestions, added stream codecs and file
         wrapping
    0.1: first version



pep-0101 Doing Python Releases 101

PEP: 101
Title: Doing Python Releases 101
Version: $Revision$
Last-Modified: $Date$
Author: Barry Warsaw <barry at python.org>, Guido van Rossum <guido at python.org>
Status: Active
Type: Informational
Created: 22-Aug-2001
Post-History: 

Abstract

    Making a Python release is a thrilling and crazy process.  You've heard
    the expression "herding cats"?  Imagine trying to also saddle those
    purring little creatures up, and ride them into town, with some of their
    buddies firmly attached to your bare back, anchored by newly sharpened
    claws.  At least they're cute, you remind yourself.

    Actually, no that's a slight exaggeration <wink>.  The Python release
    process has steadily improved over the years and now, with the help of our
    amazing community, is really not too difficult.  This PEP attempts to
    collect, in one place, all the steps needed to make a Python release.  It
    is organized as a recipe and you can actually print this out and check
    items off as you complete them.

Things You'll Need

    As a release manager there are a lot of resources you'll need to access.
    Here's a hopefully-complete list.

    * A GPG key.

      Python releases are digitally signed with GPG; you'll need a key,
      which hopefully will be on the "web of trust" with at least one of
      the other release managers.

    * Access to ``dl-files.iad1.psf.io``, the server that hosts download files.
      You'll be uploading files directly here.

    * Shell access to ``hg.python.org``, the Python Mercurial host.  You'll
      have to adapt repository configuration there.

    * Write access to the PEP repository.

      If you're reading this, you probably already have this--the first
      task of any release manager is to draft the release schedule.  But
      in case you just signed up... sucker!  I mean, uh, congratulations!

    * Posting access to http://blog.python.org, a Blogger-hosted weblog.
      The RSS feed from this blog is used for the 'Python News' section
      on www.python.org.

    * A subscription to the super secret release manager mailing list, which may
      or may not be called ``python-cabal``. Bug Barry about this.

How to Make A Release

    Here are the steps taken to make a Python release.  Some steps are more
    fuzzy than others because there's little that can be automated (e.g.
    writing the NEWS entries).  Where a step is usually performed by An
    Expert, the role of that expert is given.  Otherwise, assume the step is
    done by the Release Manager (RM), the designated person performing the
    release.  The roles and their current experts are:

    * RM = Release Manager: Larry Hastings <larry@hastings.org> (US)
    * WE = Windows: Martin von Loewis <martin@v.loewis.de> (Central Europe) and Steve Dower <steve.dower@microsoft.com>
    * ME = Mac: Ned Deily <nad@acm.org> (US)
    * DE = Docs: Georg Brandl <georg@python.org> (Central Europe)
    * IE = Idle Expert: ??

    NOTE: It is highly recommended that the RM contact the Experts the day
          before the release.  Because the world is round and everyone lives
          in different timezones, the RM must ensure that the release tag is
          created in enough time for the Experts to cut binary releases.

          You should not make the release public (by updating the website and
          sending announcements) before all experts have updated their bits.
          In rare cases where the expert for Windows or Mac is MIA, you may add
          a message "(Platform) binaries will be provided shortly" and proceed.

    XXX: We should include a dependency graph to illustrate the steps that can
    be taken in parallel, or those that depend on other steps.

    As much as possible, the release steps are automated and guided by the
    release script, which is available in a separate repository:

        https://hg.python.org/release/

    We use the following conventions in the examples below.  Where a release
    number is given, it is of the form X.Y.ZaN, e.g. 3.3.0a3 for Python 3.3.0
    alpha 3, where "a" == alpha, "b" == beta, "rc" == release candidate.

    Release tags are named "vX.Y.ZaN".  The branch name for minor release
    maintenance branches is "X.Y".

    This helps by performing several automatic editing steps, and guides you
    to perform some manual editing steps.

  ___ Log into irc.freenode.net and join the #python-dev channel.

      You probably need to coordinate with other people around the world.
      This IRC channel is where we've arranged to meet.

  ___ Check to see if there are any showstopper bugs.

      Go to http://bugs.python.org and look for any open bugs that can block
      this release.  You're looking at the Priority of the open bugs for the
      release you're making; here are the relevant definitions:

      release blocker - Stops the release dead in its tracks.  You may not
                        make any release with any open release blocker bugs.

      deferred blocker - Doesn't block this release, but it will block a
                         future release.  You may not make a final or
                         candidate release with any open deferred blocker
                         bugs.

      critical - Important bugs that should be fixed, but which does not block
                 a release.

      Review the release blockers and either resolve them, bump them down to
      deferred, or stop the release and ask for community assistance.  If
      you're making a final or candidate release, do the same with any open
      deferred.

  ___ Check the stable buildbots.

      Go to http://www.python.org/dev/buildbot/stable/

      (the trailing slash is required).  Look at the buildbots for the release
      you're making.  Ignore any that are offline (or inform the community so
      they can be restarted).  If what remains are (mostly) green buildbots,
      you're good to go.  If you have non-offline red buildbots, you may want
      to hold up the release until they are fixed.  Review the problems and
      use your judgement, taking into account whether you are making an alpha,
      beta, or final release.

  ___ Make a release clone.

      Create a local clone of the cpython repository (called the "release
      clone" from now on).

      Also clone the repo at http://hg.python.org/cpython using the
      server-side clone feature.  The name of the new clone should preferably
      have a "releasing/" prefix.  The other experts will use the release
      clone for making the binaries, so it is important that they have access
      to it!

      It's best to set up your local release clone to push to the remote
      release clone by default (by editing .hg/hgrc to that effect).

  ___ Notify all committers by sending email to python-committers@python.org.

      Since we're now working with a distributed version control system, there
      is no need to stop everyone from pushing to the main repo; you'll just
      work in your own clone.  Therefore, there won't be any checkin freezes.

      However, all committers should know the point at which your release
      clone was made, as later commits won't make it into the release without
      extra effort.

  ___ Make sure the current branch of your release clone is the branch you
      want to release from.

  ___ Check the docs for markup errors.

      cd to the Doc directory and run ``make suspicious``.  If any markup
      errors are found, fix them.

  ___ Regenerate Lib/pydoc-topics.py.

      While still in the Doc directory, run ``make pydoc-topics``.  Then copy
      ``build/pydoc-topics/topics.py`` to ``../Lib/pydoc_data/topics.py``.

  ___ Commit your changes to pydoc_topics.py
      (and any fixes you made in the docs).

  ___ Make sure the SOURCE_URI in ``Doc/tools/pyspecific.py``
      points to the right branch in the hg repository (or ``default`` for
      unstable releases of the default branch).

  ___ Bump version numbers via the release script.

      $ .../release/release.py --bump X.Y.ZaN

      This automates updating various release numbers, but you will have to
      modify a few files manually.  If your $EDITOR environment variable is
      set up correctly, release.py will pop up editor windows with the files
      you need to edit.

      It is important to update the Misc/NEWS file, however in recent years,
      this has become easier as the community is responsible for most of the
      content of this file.  You should only need to review the text for
      sanity, and update the release date with today's date.

  ___ Make sure all changes have been committed.  (``release.py --bump``
      doesn't check in its changes for you.)

  ___ Check the years on the copyright notice.  If the last release
      was some time last year, add the current year to the copyright
      notice in several places:

      ___ README
      ___ LICENSE (make sure to change on trunk and the branch)
      ___ Python/getcopyright.c
      ___ Doc/README.txt (at the end)
      ___ Doc/copyright.rst
      ___ Doc/license.rst
      ___ PC/python_nt.rc sets up the DLL version resource for Windows
          (displayed when you right-click on the DLL and select
          Properties).

  ___ Check with the IE (if there is one <wink>) to be sure that
      Lib/idlelib/NEWS.txt has been similarly updated.

  ___ For a final major release, edit the first paragraph of
      Doc/whatsnew/X.Y.rst to include the actual release date; e.g. "Python
      2.5 was released on August 1, 2003."  There's no need to edit this for
      alpha or beta releases.  Note that Andrew Kuchling often takes care of
      this.

  ___ Tag the release for X.Y.ZaN.

      $ .../release/release.py --tag X.Y.ZaN

      NOTE: when forward, i.e. "null" merging your changes to newer branches,
      e.g. 2.6 -> 2.7, do *not* revert your changes to the .hgtags file or you
      will not be able to run the --export command below.  Revert everything
      else but leave .hgtags alone.

  ___ If this is a final major release, branch the tree for X.Y.

      When making a major release (e.g., for 3.2), you must create the
      long-lived maintenance branch.

      ___ Note down the current revision ID of the tree.

          $ hg identify

      ___ First, set the original trunk up to be the next release.

          $ .../release/release.py --bump 3.3a0

          ___ Edit all version references in the README

          ___ Move any historical "what's new" entries from Misc/NEWS to
              Misc/HISTORY.

          ___ Edit Doc/tutorial/interpreter.rst (2 references to '[Pp]ython3x',
              one to 'Python 3.x', also make the date in the banner consistent).

          ___ Edit Doc/tutorial/stdlib.rst and Doc/tutorial/stdlib2.rst, which
              have each one reference to '[Pp]ython3x'.

          ___ Add a new whatsnew/3.x.rst file (with the comment near the top
              and the toplevel sections copied from the previous file) and
              and add it to the toctree in whatsnew/index.rst.

          ___ Update the version number in configure.ac and re-run autoconf.

          ___ Update the version numbers for the Windows builds in PC/ and
              PCbuild/, which have references to python32.

              $ find PC/ PCbuild/ -type f | xargs sed -i 's/python32/python33/g'
              $ hg mv -f PC/os2emx/python32.def PC/os2emx/python33.def
              $ hg mv -f PC/python32stub.def PC/python33stub.def
              $ hg mv -f PC/python32gen.py PC/python33gen.py

          ___ Commit these changes to the default branch.

      ___ Now, go back to the previously noted revision and make the
          maintenance branch *from there*.

          $ hg update deadbeef    # revision ID noted down before
          $ hg branch 3.2

      ___ When you want to push back your new branch to the main CPython
          repository, the new branch name must be added to the "allow-branches"
          hook configuration, which protects against stray named branches being
          pushed.  Login to hg.python.org and edit (as the "hg" user)
          ``/data/hg/repos/cpython/.hg/hgrc`` to that effect.

  ___ For a final major release, Doc/tools/static/version_switch.js
      must be updated in all maintained branches, so that the new maintenance
      branch is not "dev" anymore and there is a new "dev" version.

  ___ Push your commits to the remote release clone.

      $ hg push ssh://hg.python.org/releasing/...

  ___ Notify the experts that they can start building binaries.

  ___ STOP STOP STOP STOP STOP STOP STOP STOP

      At this point you must receive the "green light" from other experts in
      order to create the release.  There are things you can do while you wait
      though, so keep reading until you hit the next STOP.

  ___ The WE builds the Windows helpfile, using (in Doc/)

      > make.bat htmlhelp   (on Windows)

      to create suitable input for HTML Help Workshop in build/htmlhelp. HTML
      Help Workshop is then fired up on the created python33.hhp file, finally
      resulting in an python33.chm file.

  ___ The WE then generates Windows installer files for each Windows
      target architecture (for Python 3.3, this means x86 and AMD64).

      - He has one checkout tree per target architecture, and builds the
        pcbuild.sln project for the appropriate architecture.

      - PC\icons.mak must have been run with nmake.

      - The cmd.exe window in which this is run must have Cygwin/bin in its
        path (at least for x86).

      - The cmd.exe window must have MS compiler tools for the target
        architecture in its path (VS 2010 for Python 3.3).

      - The WE then edits Tools/msi/config.py (a file only present locally)
        to update full_current_version and sets snapshot to false.  Currently
        for a release config.py looks like

            snapshot=0
            full_current_version="3.3.5rc2"
            certname="Python Software Foundation
            PCBUILD='PCbuild\\amd64'

        The last line is only present for the amd64 checkout.

      - Now he runs msi.py with ActivePython or Python with pywin32.

      The WE checksums the files (*.msi, *.chm, *-pdb.zip), uploads them to
      dl-files.iad1.psf.io together with gpg signature files, and emails you the
      location and md5sums.

  ___ The ME builds Mac installer packages and uploads them to
      dl-files.iad1.psf.io together with gpg signature files.

  ___ Time to build the source tarball.  Be sure to update your clone to the
      correct branch.  E.g.

      $ hg update 3.2

  ___ Do a "hg status" in this directory.

      You should not see any files.  I.e. you better not have any uncommitted
      changes in your working directory.

  ___ Make sure you have virtualenv installed (for Python 2.7). The release
      script installs Sphinx in a virtualenv when building the docs.

      For building the PDF docs, you also need a fairly complete installation
      of a recent TeX distribution such as texlive.

  ___ Use the release script to create the source gzip and xz tarballs,
      documentation tar and zip files, and gpg signature files.

      $ .../release/release.py --export X.Y.ZaN

      This can take a while for final releases, and it will leave all the
      tarballs and signatures in a subdirectory called 'X.Y.ZaN/src', and the
      built docs in 'X.Y.ZaN/docs' (for final releases).

  ___ scp or rsync all the files to your home directory on dl-files.iad1.psf.io.

      While you're waiting for the files to finish uploading, you can continue
      on with the remaining tasks.  You can also ask folks on #python-dev
      and/or python-committers to download the files as they finish uploading
      so that they can test them on their platforms as well.

  ___ Now you want to perform the very important step of checking the
      tarball you just created, to make sure a completely clean,
      virgin build passes the regression test.  Here are the best
      steps to take:

      $ cd /tmp
      $ tar xvf ~/Python-3.2rc2.tgz
      $ cd Python-3.2rc2
      $ ls
      (Do things look reasonable?)
      $ ls Lib
      (Are there stray .pyc files?)
      $ ./configure
      (Loads of configure output)
      $ make test
      (Do all the expected tests pass?)

      If you're feeling lucky and have some time to kill, or if you are making
      a release candidate or final release, run the full test suite:

      $ make testall

      If the tests pass, then you can feel good that the tarball is
      fine.  If some of the tests fail, or anything else about the
      freshly unpacked directory looks weird, you better stop now and
      figure out what the problem is.

  ___ Now you need to go to dl-files.iad1.psf.io and move all the files in place
      over there.  Our policy is that every Python version gets its own
      directory, but each directory contains all releases of that version.

      ___ On dl-files.iad1.psf.io, cd /srv/www.python.org/ftp/python/X.Y.Z
          creating it if necessary.  Make sure it is owned by group 'downloads'
          and group-writable.

      ___ Move the release .tgz, and .tar.xz files into place, as well as the
          .asc GPG signature files.  The Win/Mac binaries are usually put there
          by the experts themselves.

          Make sure they are world readable.  They should also be group
          writable, and group-owned by downloads.

      ___ Use ``gpg --verify`` to make sure they got uploaded intact.

      ___ If this is a final release: Move the doc zips and tarballs to
          /srv/www.python.org/ftp/python/doc/X.Y.Z, creating the directory
          if necessary, and adapt the "current" symlink in .../doc to point to
          that directory.  Note though that if you're releasing a maintenance
          release for an older version, don't change the current link.

      ___ If this is a final release (even a maintenance release), also unpack
          the HTML docs to /srv/docs.python.org/release/X.Y.Z on
          docs.iad1.psf.io. Make sure the files are in group "docs".  If it is a
          release of a security-fix-only version, tell the DE to build a version
          with the "version switcher" and put it there.

      ___ Let the DE check if the docs are built and work all right.

      ___ If this is a final major release: Tell the DE to adapt redirects for
          docs.python.org/X.Y in the Apache config for docs.python.org, update
          the script Doc/tools/dailybuild.py to point to the right
          stable/development branches, and to install it and make the initial
          checkout.  The Doc's version_switcher.js script also needs to be
          updated.  In general, please don't touch things in the toplevel
          /srv/docs.python.org/ directory unless you know what you're doing.

      ___ Note both the documentation and downloads are behind a caching CDN. If
          you change archives after downloading them through the website, you'll
          need to purge the stale data in the CDN like this:

          $ curl -X PURGE https://www.python.org/ftp/python/2.7.5/Python-2.7.5.tar.xz

  ___ For the extra paranoid, do a completely clean test of the release.
      This includes downloading the tarball from www.python.org.

      Make sure the md5 checksums match.  Then unpack the tarball,
      and do a clean make test.

      $ make distclean
      $ ./configure
      $ make test

      To ensure that the regression test suite passes.  If not, you
      screwed up somewhere!

  Now it's time to twiddle the web site.

  To do these steps, you must have the permission to edit the website.  If you
  don't have that, ask someone on pydotorg@python.org for the proper
  permissions.  It's insane for you not to have it.

  XXX This is completely out of date for Django based python.org.

  This page will probably come in handy:

  http://docutils.sourceforge.net/docs/user/rst/quickref.html

  None of the web site updates are automated by release.py.

  ___ Build the basic site.

      In the top directory, do an `svn update` to get the latest code.  In the
      build subdirectory, do `make` to build the site.  Do `make serve` to
      start service the pages on localhost:8005.  Hit that url to see the site
      as it is right now.  At any time you can re-run `make` to update the
      local site.  You don't have to restart the server.

      Don't `svn commit` until you're all done!

  ___ If this is the first release for this version (even a new patch
      version), you'll need to create a subdirectory inside download/releases
      to hold the new version files.  It's probably a good idea to copy an
      existing recent directory and twiddle the files in there for the new
      version number.

  ___ Update the version specific pages.

      ___ cd to `download/releases/X.Y.Z`
      ___ Edit the version numbers in content.ht
      ___ Update the md5 checksums

      ___ Comment out the "This is a preview release" or the "This is a
          production release" paragraph as appropriate

      Note, you don't have to copy any release files into this directory;
      they only live on dl-files.iad1.psf.io in the ftp directory.

  ___ Edit `download/releases/content.ht` to update the version numbers for
      this release.  There are a bunch of places you need to touch:

      ___ The subdirectory name as the first element in the Nav rows.
      ___ Possibly the Releases section, and possibly in the experimental
          releases section if this is an alpha, beta or release candidate.

  ___ Update the download page, editing `download/content.ht`.  Pre-releases are
      added only to the "Testing versions" list.

  ___ If this is a final release...

      ___ Update the 'Quick Links' section on the front page.  Edit the
          top-level `content.ht` file.

      ___ For X.Y.Z, edit all the previous X.Y releases' content.ht page to
          point to the new release.

      ___ Update `doc/content.ht` to indicate the new current documentation
          version, and remove the current version from any 'in development'
          section. Update the version in the "What's New" link.

      ___ Add the new version to `doc/versions/content.ht`.

  ___ Add a news section item to the front page by editing newsindex.yml.  The
      format should be pretty self evident.

  ___ When everything looks good, `svn commit` in the data directory.  This
      will trigger the live site to update itself, and at that point the
      release is live.

  ___ If this is a final release, create a new python.org/X.Y Apache alias
      (or ask pydotorg to do so for you).

  Now it's time to write the announcement for the mailing lists.  This is the
  fuzzy bit because not much can be automated.  You can use an earlier
  announcement as a template, but edit it for content!

  ___ STOP STOP STOP STOP STOP STOP STOP STOP

      ___ Have you gotten the green light from the WE?

      ___ Have you gotten the green light from the DE?


  ___ Once the announcement is ready, send it to the following
      addresses:

      python-list@python.org
      python-announce@python.org
      python-dev@python.org

  ___ Also post the announcement to `The Python Insider blog
      <http://blog.python.org>`_.  To add a new entry, go to
      `your Blogger home page, here. <https://www.blogger.com/home>`_

  Now it's time to do some cleaning up.  These steps are very important!

  ___ Do the guided post-release steps with the release script.

      $ .../release/release.py --done X.Y.ZaN

      Review and commit these changes.

  ___ Merge your release clone into the main development repo:

      $ cd ../cpython                         # your clone of the main repo
      $ hg pull ssh://hg.python.org/cpython   # update from remote first
      $ hg pull ../cpython-releaseX.Y         # now pull from release clone

      Now merge your release clone's changes in every branch you touched
      (usually only one, except if you made a new maintenance release).
      Easily resolvable conflicts may appear in Misc/NEWS.

  ___ If releasing from other than the default branch, remember to carefully
      merge any touched branches with higher level branches, up to default.  For
      example:

      $ hg update -C default
      $ hg resolve --list
      $ hg merge --tool "internal:fail" 3.4

      ... here, revert changes that are not relevant for the default branch...

      $ hg resolve --mark

  ___ Commit and push to the main repo.

  ___ You can delete the remote release clone, or simply reuse it for the next
      release.

  ___ Send email to python-committers informing them that the release has been
      published.

  ___ Update any release PEPs (e.g. 361) with the release dates.

  ___ Update the tracker at http://bugs.python.org:

      ___ Flip all the deferred blocker issues back to release blocker
          for the next release.

      ___ Add version X.Y+1 as when version X.Y enters alpha.

      ___ Change non-doc RFEs to version X.Y+1 when version X.Y enters beta.

      ___ Update 'behavior' issues from versions that your release make
          unsupported to the next supported version.

      ___ Review open issues, as this might find lurking showstopper bugs,
          besides reminding people to fix the easy ones they forgot about.


What Next?

  ___ Verify!  Pretend you're a user: download the files from python.org, and
      make Python from it. This step is too easy to overlook, and on several
      occasions we've had useless release files.  Once a general server problem
      caused mysterious corruption of all files; once the source tarball got
      built incorrectly; more than once the file upload process on SF truncated
      files; and so on.

  ___ Rejoice.  Drink.  Be Merry.  Write a PEP like this one.  Or be
      like unto Guido and take A Vacation.

  You've just made a Python release!


Windows Notes

    Windows has a MSI installer, various flavors of Windows have
    "special limitations", and the Windows installer also packs
    precompiled "foreign" binaries (Tcl/Tk, expat, etc).  So Windows
    testing is tiresome but very necessary.

    Concurrent with uploading the installer, the WE installs Python
    from it twice: once into the default directory suggested by the
    installer, and later into a directory with embedded spaces in its
    name.  For each installation, he runs the full regression suite
    from a DOS box, and both with and without -0. For maintenance
    release, he also tests whether upgrade installations succeed.

    He also tries *every* shortcut created under Start -> Menu -> the
    Python group.  When trying IDLE this way, you need to verify that
    Help -> Python Documentation works.  When trying pydoc this way
    (the "Module Docs" Start menu entry), make sure the "Start
    Browser" button works, and make sure you can search for a random
    module (like "random" <wink>) and then that the "go to selected"
    button works.

    It's amazing how much can go wrong here -- and even more amazing
    how often last-second checkins break one of these things.  If
    you're "the Windows geek", keep in mind that you're likely the
    only person routinely testing on Windows, and that Windows is
    simply a mess.

    Repeat the testing for each target architecture.  Try both an
    Admin and a plain User (not Power User) account.


Copyright

    This document has been placed in the public domain.



pep-0102 Doing Python Micro Releases

PEP: 102
Title: Doing Python Micro Releases
Version: $Revision$
Last-Modified: $Date$
Author: Anthony Baxter <anthony at interlink.com.au>, Barry Warsaw <barry at python.org>, Guido van Rossum <guido at python.org>
Status: Superseded
Type: Informational
Created: 22-Aug-2001 (edited down on 9-Jan-2002 to become PEP 102)
Post-History: 
Superseded-By: 101

Replacement Note

    Although the size of the to-do list in this PEP is much less scary
    than that in PEP 101, it turns out not to be enough justification
    for the duplication of information, and with it, the danger of one
    of the copies to become out of date.  Therefore, this PEP is not
    maintained anymore, and micro releases are fully covered by PEP 101.


Abstract

    Making a Python release is an arduous process that takes a
    minimum of half a day's work even for an experienced releaser.
    Until recently, most -- if not all -- of that burden was borne by
    Guido himself.  But several recent releases have been performed by
    other folks, so this PEP attempts to collect, in one place, all
    the steps needed to make a Python bugfix release.

    The major Python release process is covered in PEP 101 - this PEP
    is just PEP 101, trimmed down to only include the bits that are
    relevant for micro releases, a.k.a. patch, or bug fix releases.

    It is organized as a recipe and you can actually print this out and 
    check items off as you complete them.


How to Make A Release

    Here are the steps taken to make a Python release.  Some steps are
    more fuzzy than others because there's little that can be
    automated (e.g. writing the NEWS entries).  Where a step is
    usually performed by An Expert, the name of that expert is given.
    Otherwise, assume the step is done by the Release Manager (RM),
    the designated person performing the release.  Almost every place
    the RM is mentioned below, this step can also be done by the BDFL
    of course!

    XXX: We should include a dependency graph to illustrate the steps
    that can be taken in parallel, or those that depend on other
    steps.

    We use the following conventions in the examples below.  Where a
    release number is given, it is of the form X.Y.MaA, e.g. 2.1.2c1
    for Python 2.1.2 release candidate 1, where "a" == alpha, "b" ==
    beta, "c" == release candidate.  Final releases are tagged with
    "releaseXYZ" in CVS.  The micro releases are made from the
    maintenance branch of the major release, e.g. Python 2.1.2 is made
    from the release21-maint branch.

  ___ Send an email to python-dev@python.org indicating the release is 
      about to start.

  ___ Put a freeze on check ins into the maintenance branch.  At this 
      point, nobody except the RM should make any commits to the branch 
      (or his duly assigned agents, i.e. Guido the BDFL, Fred Drake for
      documentation, or Thomas Heller for Windows).  If the RM screwed up
      and some desperate last minute change to the branch is
      necessary, it can mean extra work for Fred and Thomas.  So try to
      avoid this!

  ___ On the branch, change Include/patchlevel.h in two places, to
      reflect the new version number you've just created.  You'll want
      to change the PY_VERSION macro, and one or several of the
      version subpart macros just above PY_VERSION, as appropriate.

  ___ Change the "%define version" line of Misc/RPM/python-2.3.spec to the
      same string as PY_VERSION was changed to above.  E.g:

      %define version 2.3.1

      You also probably want to reset the %define release line
      to '1pydotorg' if it's not already that.

  ___ If you're changing the version number for Python (e.g. from
      Python 2.1.1 to Python 2.1.2), you also need to update the
      README file, which has a big banner at the top proclaiming its
      identity.  Don't do this if you're just releasing a new alpha or
      beta release, but /do/ do this if you're release a new micro,
      minor or major release.

  ___ The LICENSE file also needs to be changed, due to several
      references to the release number.  As for the README file, changing
      these are necessary for a new micro, minor or major release.

      The LICENSE file contains a table that describes the legal
      heritage of Python; you should add an entry for the X.Y.Z
      release you are now making.  You should update this table in the
      LICENSE file on the CVS trunk too.

  ___ When the year changes, copyright legends need to be updated in
      many places, including the README and LICENSE files.

  ___ For the Windows build, additional files have to be updated.

      PCbuild/BUILDno.txt contains the Windows build number, see the
      instructions in this file how to change it.  Saving the project
      file PCbuild/pythoncore.dsp results in a change to
      PCbuild/pythoncore.dsp as well.

      PCbuild/python20.wse sets up the Windows installer version
      resource (displayed when you right-click on the installer .exe
      and select Properties), and also contains the Python version
      number.

      (Before version 2.3.2, it was required to manually edit
      PC/python_nt.rc, this step is now automated by the build
      process.)

  ___ After starting the process, the most important thing to do next
      is to update the Misc/NEWS file.  Thomas will need this in order to
      do the Windows release and he likes to stay up late.  This step
      can be pretty tedious, so it's best to get to it immediately
      after making the branch, or even before you've made the branch.
      The sooner the better (but again, watch for new checkins up
      until the release is made!)

      Add high level items new to this release.  E.g. if we're
      releasing 2.2a3, there must be a section at the top of the file
      explaining "What's new in Python 2.2a3".  It will be followed by
      a section entitled "What's new in Python 2.2a2".

      Note that you /hope/ that as developers add new features to the
      trunk, they've updated the NEWS file accordingly.  You can't be
      positive, so double check.  If you're a Unix weenie, it helps to
      verify with Thomas about changes on Windows, and Jack Jansen
      about changes on the Mac.

      This command should help you (but substitute the correct -r tag!):

      % cvs log -rr22a1: | python Tools/scripts/logmerge.py > /tmp/news.txt

      IOW, you're printing out all the cvs log entries from the
      previous release until now.  You can then troll through the
      news.txt file looking for interesting things to add to NEWS.

  ___ Check your NEWS changes into the maintenance branch.  It's easy
      to forget to update the release date in this file!

  ___ Check in any changes to IDLE's NEWS.txt.  Update the header in
      Lib/idlelib/NEWS.txt to reflect its release version and date.
      Update the IDLE version in Lib/idlelib/idlever.py to match.

  ___ Once the release process has started, the documentation needs to
      be built and posted on python.org according to the instructions
      in PEP 101.

      Note that Fred is responsible both for merging doc changes from
      the trunk to the branch AND for merging any branch changes from
      the branch to the trunk during the cleaning up phase.
      Basically, if it's in Doc/ Fred will take care of it.

  ___ Thomas compiles everything with MSVC 6.0 SP5, and moves the
      python23.chm file into the src/chm directory.  The installer
      executable is then generated with Wise Installation System.

      The installer includes the MSVC 6.0 runtime in the files
      MSVCRT.DLL and MSVCIRT.DLL.  It leads to disaster if these files
      are taken from the system directory of the machine where the
      installer is built, instead it must be absolutely made sure that
      these files come from the VCREDIST.EXE redistributable package
      contained in the MSVC SP5 CD.  VCREDIST.EXE must be unpacked
      with winzip, and the Wise Installation System prompts for the
      directory.

      After building the installer, it should be opened with winzip,
      and the MS dlls extracted again and check for the same version
      number as those unpacked from VCREDIST.EXE.

      Thomas uploads this file to the starship.  He then sends the RM
      a notice which includes the location and MD5 checksum of the
      Windows executable.

      Note that Thomas's creation of the Windows executable may generate
      a few more commits on the branch.  Thomas will be responsible for
      merging Windows-specific changes from trunk to branch, and from
      branch to trunk.

  ___ Sean performs his Red Hat magic, generating a set of RPMs.  He 
      uploads these files to python.org.  He then sends the RM a notice 
      which includes the location and MD5 checksum of the RPMs.

  ___ It's Build Time!

      Now, you're ready to build the source tarball.  First cd to your
      working directory for the branch.  E.g.
      % cd .../python-22a3

  ___ Do a "cvs update" in this directory.  Do NOT include the -A flag!

      You should not see any "M" files, but you may see several "P"
      and/or "U" files.  I.e. you better not have any uncommitted
      changes in your working directory, but you may pick up some of
      Fred's or Thomas's last minute changes.

  ___ Now tag the branch using a symbolic name like "rXYMaZ",
      e.g. r212
      % cvs tag r212

      Be sure to tag only the python/dist/src subdirectory of the
      Python CVS tree!

  ___ Change to a neutral directory, i.e. one in which you can do a
      fresh, virgin, cvs export of the branch.  You will be creating a
      new directory at this location, to be named "Python-X.Y.M".  Do
      a CVS export of the tagged branch.

      % cd ~
      % cvs -d cvs.sf.net:/cvsroot/python export -rr212 \
                            -d Python-2.1.2 python/dist/src

  ___ Generate the tarball.  Note that we're not using the `z' option
      on the tar command because 1) that's only supported by GNU tar
      as far as we know, and 2) we're going to max out the compression
      level, which isn't a supported option. We generate both tar.gz
      tar.bz2 formats, as the latter is about 1/6th smaller.

      % tar -cf - Python-2.1.2 | gzip -9 > Python-2.1.2.tgz
      % tar -cf - Python-2.1.2 | bzip2 -9 > Python-2.1.2.tar.bz2

  ___ Calculate the MD5 checksum of the tgz and tar.bz2 files you 
      just created
      % md5sum Python-2.1.2.tgz

      Note that if you don't have the md5sum program, there is a
      Python replacement in the Tools/scripts/md5sum.py file.

  ___ Create GPG keys for each of the files.

      % gpg -ba Python-2.1.2.tgz
      % gpg -ba Python-2.1.2.tar.bz2
      % gpg -ba Python-2.1.2.exe

  ___ Now you want to perform the very important step of checking the
      tarball you just created, to make sure a completely clean,
      virgin build passes the regression test.  Here are the best
      steps to take:

      % cd /tmp
      % tar zxvf ~/Python-2.1.2.tgz
      % cd Python-2.1.2
      % ls
      (Do things look reasonable?)
      % ./configure
      (Loads of configure output)
      % make test
      (Do all the expected tests pass?)

      If the tests pass, then you can feel good that the tarball is
      fine.  If some of the tests fail, or anything else about the
      freshly unpacked directory looks weird, you better stop now and
      figure out what the problem is.

  ___ You need to upload the tgz and the exe file to creosote.python.org.
      This step can take a long time depending on your network
      bandwidth.  scp both files from your own machine to creosote.

  ___ While you're waiting, you can start twiddling the web pages to
      include the announcement.

    ___ In the top of the python.org web site CVS tree, create a
        subdirectory for the X.Y.Z release.  You can actually copy an
        earlier patch release's subdirectory, but be sure to delete
        the X.Y.Z/CVS directory and "cvs add X.Y.Z", for example:

        % cd .../pydotorg
        % cp -r 2.2.2 2.2.3
        % rm -rf 2.2.3/CVS
        % cvs add 2.2.3
        % cd 2.2.3

    ___ Edit the files for content: usually you can globally replace
        X.Ya(Z-1) with X.YaZ.  However, you'll need to think about the
        "What's New?" section.

    ___ Copy the Misc/NEWS file to NEWS.txt in the X.Y.Z directory for
        python.org; this contains the "full scoop" of changes to
        Python since the previous release for this version of Python.

    ___ Copy the .asc GPG signatures you created earlier here as well.

    ___ Also, update the MD5 checksums.

    ___ Preview the web page by doing a "make" or "make install" (as
        long as you've created a new directory for this release!)

    ___ Similarly, edit the ../index.ht file, i.e. the python.org home
        page.  In the Big Blue Announcement Block, move the paragraph
        for the new version up to the top and boldify the phrase
        "Python X.YaZ is out".  Edit for content, and preview locally,
        but do NOT do a "make install" yet!

  ___ Now we're waiting for the scp to creosote to finish.  Da de da,
      da de dum, hmm, hmm, dum de dum.

  ___ Once that's done you need to go to creosote.python.org and move
      all the files in place over there.  Our policy is that every
      Python version gets its own directory, but each directory may
      contain several releases.  We keep all old releases, moving them
      into a "prev" subdirectory when we have a new release.

      So, there's a directory called "2.2" which contains
      Python-2.2a2.exe and Python-2.2a2.tgz, along with a "prev"
      subdirectory containing Python-2.2a1.exe and Python-2.2a1.tgz.

      So...

    ___ On creosote, cd to ~ftp/pub/python/X.Y creating it if
        necessary.

    ___ Move the previous release files to a directory called "prev"
        creating the directory if necessary (make sure the directory
        has g+ws bits on).  If this is the first alpha release of a
        new Python version, skip this step.

    ___ Move the .tgz file and the .exe file to this directory.  Make
        sure they are world readable.  They should also be group
        writable, and group-owned by webmaster.

    ___ md5sum the files and make sure they got uploaded intact.


  ___ Update the X.Y/bugs.ht file if necessary.  It is best to get
      BDFL input for this step.

  ___ Go up to the parent directory (i.e. the root of the web page
      hierarchy) and do a "make install" there.  You're release is now
      live!

  ___ Now it's time to write the announcement for the mailing lists.
      This is the fuzzy bit because not much can be automated.  You
      can use one of Guido's earlier announcements as a template, but
      please edit it for content!

      Once the announcement is ready, send it to the following
      addresses:

      python-list@python.org
      python-announce@python.org
      python-dev@python.org

  ___ Send a SourceForge News Item about the release.  From the
      project's "menu bar", select the "News" link; once in News,
      select the "Submit" link.  Type a suitable subject (e.g. "Python
      2.2c1 released" :-) in the Subject box, add some text to the
      Details box (at the very least including the release URL at
      www.python.org and the fact that you're happy with the release)
      and click the SUBMIT button.

      Feel free to remove any old news items.

    Now it's time to do some cleanup.  These steps are very important!

  ___ Edit the file Include/patchlevel.h so that the PY_VERSION
      string says something like "X.YaZ+".  Note the trailing `+'
      indicating that the trunk is going to be moving forward with
      development.  E.g. the line should look like:

      #define PY_VERSION              "2.1.2+"

      Make sure that the other PY_ version macros contain the
      correct values.  Commit this change.

  ___ For the extra paranoid, do a completely clean test of the
      release.  This includes downloading the tarball from
      www.python.org.

  ___ Make sure the md5 checksums match.  Then unpack the tarball,
      and do a clean make test.

      % make distclean
      % ./configure
      % make test

      To ensure that the regression test suite passes.  If not, you
      screwed up somewhere!

    Step 5 ...

    Verify!  This can be interleaved with Step 4.  Pretend you're a
    user:  download the files from python.org, and make Python from
    it.  This step is too easy to overlook, and on several occasions
    we've had useless release files.  Once a general server problem
    caused mysterious corruption of all files; once the source tarball
    got built incorrectly; more than once the file upload process on
    SF truncated files; and so on.


What Next?

    Rejoice.  Drink.  Be Merry.  Write a PEP like this one.  Or be
    like unto Guido and take A Vacation.

    You've just made a Python release!

    Actually, there is one more step.  You should turn over ownership
    of the branch to Jack Jansen.  All this means is that now he will
    be responsible for making commits to the branch.  He's going to
    use this to build the MacOS versions.  He may send you information
    about the Mac release that should be merged into the informational
    pages on www.python.org.  When he's done, he'll tag the branch
    something like "rX.YaZ-mac".  He'll also be responsible for
    merging any Mac-related changes back into the trunk.


Final Release Notes

    The Final release of any major release, e.g. Python 2.2 final, has
    special requirements, specifically because it will be one of the
    longest lived releases (i.e. betas don't last more than a couple
    of weeks, but final releases can last for years!).

    For this reason we want to have a higher coordination between the
    three major releases: Windows, Mac, and source.  The Windows and
    source releases benefit from the close proximity of the respective
    release-bots.  But the Mac-bot, Jack Jansen, is 6 hours away.  So
    we add this extra step to the release process for a final
    release:

    ___ Hold up the final release until Jack approves, or until we
        lose patience <wink>.

    The python.org site also needs some tweaking when a new bugfix release
    is issued.  

    ___ The documentation should be installed at doc/<version>/.

    ___ Add a link from doc/<previous-minor-release>/index.ht to the 
        documentation for the new version.

    ___ All older doc/<old-release>/index.ht files should be updated to 
        point to the documentation for the new version.

    ___ /robots.txt should be modified to prevent the old version's 
        documentation from being crawled by search engines.


Windows Notes

    Windows has a GUI installer, various flavors of Windows have
    "special limitations", and the Windows installer also packs
    precompiled "foreign" binaries (Tcl/Tk, expat, etc).  So Windows
    testing is tiresome but very necessary.

    Concurrent with uploading the installer, Thomas installs Python
    from it twice: once into the default directory suggested by the
    installer, and later into a directory with embedded spaces in its
    name.  For each installation, he runs the full regression suite
    from a DOS box, and both with and without -0.

    He also tries *every* shortcut created under Start -> Menu -> the
    Python group.  When trying IDLE this way, you need to verify that
    Help -> Python Documentation works.  When trying pydoc this way
    (the "Module Docs" Start menu entry), make sure the "Start
    Browser" button works, and make sure you can search for a random
    module (Thomas uses "random" <wink>) and then that the "go to
    selected" button works.

    It's amazing how much can go wrong here -- and even more amazing
    how often last-second checkins break one of these things.  If
    you're "the Windows geek", keep in mind that you're likely the
    only person routinely testing on Windows, and that Windows is
    simply a mess.

    Repeat all of the above on at least one flavor of Win9x, and one
    of NT/2000/XP.  On NT/2000/XP, try both an Admin and a plain User
    (not Power User) account.

    WRT Step 5 above (verify the release media), since by the time
    release files are ready to download Thomas has generally run many
    Windows tests on the installer he uploaded, he usually doesn't do
    anything for Step 5 except a full byte-comparison ("fc /b" if
    using a Windows shell) of the downloaded file against the file he
    uploaded.


Copyright

    This document has been placed in the public domain.



pep-0160 Python 1.6 Release Schedule

PEP: 160
Title: Python 1.6 Release Schedule
Version: $Revision$
Last-Modified: $Date$
Author: Fred L. Drake, Jr. <fdrake at acm.org>
Status: Final
Type: Informational
Created: 25-Jul-2000
Python-Version: 1.6
Post-History: 

Introduction

    This PEP describes the Python 1.6 release schedule.  The CVS
    revision history of this file contains the definitive historical
    record.

    This release will be produced by BeOpen PythonLabs staff for the
    Corporation for National Research Initiatives (CNRI).


Schedule

    August 1     1.6 beta 1 release (planned).
    August 3     1.6 beta 1 release (actual).
    August 15    1.6 final release (planned).
    September 5  1.6 final release (actual).


Features

    A number of features are required for Python 1.6 in order to
    fulfill the various promises that have been made.  The following
    are required to be fully operational, documented, and forward
    compatible with the plans for Python 2.0:

    * Unicode support: The Unicode object defined for Python 2.0 must
      be provided, including all methods and codec support.

    * SRE: Fredrik Lundh's new regular expression engine will be used
      to provide support for both 8-bit strings and Unicode strings.
      It must pass the regression test used for the pcre-based version
      of the re module.

    * The curses module was in the middle of a transformation to a
      package, so the final form was adopted.


Mechanism

    The release will be created as a branch from the development tree
    rooted at CNRI's close of business on 16 May 2000.  Patches
    required from more recent checkins will be merged in by moving the
    branch tag on individual files whenever possible in order to
    reduce mailing list clutter and avoid divergent and incompatible
    implementations.

    The branch tag is "cnri-16-start".

    Patches and features will be merged to the extent required to pass
    regression tests in effect on 16 May 2000.

    The beta release is tagged "r16b1" in the CVS repository, and the
    final Python 1.6 release is tagged "release16" in the repository.


Copyright

    This document has been placed in the public domain.



pep-0200 Python 2.0 Release Schedule

PEP: 200
Title: Python 2.0 Release Schedule
Version: $Revision$
Last-Modified: $Date$
Author: Jeremy Hylton <jeremy at alum.mit.edu>
Status: Final
Type: Informational
Created: 
Python-Version: 2.0
Post-History: 

Introduction

    This PEP describes the Python 2.0 release schedule, tracking the
    status and ownership of the major new features, summarizes
    discussions held in mailing list forums, and provides URLs for
    further information, patches, and other outstanding issues.  The
    CVS revision history of this file contains the definitive
    historical record.

Release Schedule

    [revised 5 Oct 2000]

    26-Sep-2000: 2.0 beta 2
     9-Oct-2000: 2.0 release candidate 1 (2.0c1)
    16-Oct-2000: 2.0 final

Previous milestones

    14-Aug-2000: All 2.0 PEPs finished / feature freeze
     5-Sep-2000: 2.0 beta 1

What is release candidate 1?

    We believe that release candidate 1 will fix all known bugs that
    we intend to fix for the 2.0 final release.  This release should
    be a bit more stable than the previous betas.  We would like to
    see even more widespread testing before the final release, so we
    are producing this release candidate.  The final release will be
    exactly the same unless any show-stopping (or brown bag) bugs are
    found by testers of the release candidate.

Guidelines for submitting patches and making changes

    Use good sense when committing changes.  You should know what we
    mean by good sense or we wouldn't have given you commit privileges
    <0.5 wink>.  Some specific examples of good sense include:

    - Do whatever the dictator tells you.

    - Discuss any controversial changes on python-dev first.  If you
      get a lot of +1 votes and no -1 votes, make the change.  If you
      get a some -1 votes, think twice; consider asking Guido what he
      thinks.

    - If the change is to code you contributed, it probably makes
      sense for you to fix it.

    - If the change affects code someone else wrote, it probably makes
      sense to ask him or her first.

    - You can use the SF Patch Manager to submit a patch and assign it
      to someone for review.

    Any significant new feature must be described in a PEP and
    approved before it is checked in.

    Any significant code addition, such as a new module or large
    patch, must include test cases for the regression test and
    documentation.  A patch should not be checked in until the tests
    and documentation are ready.

    If you fix a bug, you should write a test case that would have
    caught the bug.

    If you commit a patch from the SF Patch Manager or fix a bug from
    the Jitterbug database, be sure to reference the patch/bug number
    in the CVS log message.  Also be sure to change the status in the
    patch manager or bug database (if you have access to the bug
    database).

    It is not acceptable for any checked in code to cause the
    regression test to fail.  If a checkin causes a failure, it must
    be fixed within 24 hours or it will be backed out.

    All contributed C code must be ANSI C.  If possible check it with
    two different compilers, e.g. gcc and MSVC.

    All contributed Python code must follow Guido's Python style
    guide.  http://www.python.org/doc/essays/styleguide.html

    It is understood that any code contributed will be released under
    an Open Source license.  Do not contribute code if it can't be
    released this way.


Failing test cases need to get fixed

    We need to resolve errors in the regression test suite quickly.
    Changes should not be committed to the CVS tree unless the
    regression test runs cleanly with the changes applied.  If it
    fails, there may be bugs lurking in the code.  (There may be bugs
    anyway, but that's another matter.)  If the test cases are known
    to fail, they serve no useful purpose.

    test case         platform    date reported
    ---------         --------    -------------
    test_mmap         Win ME      03-Sep-2000       Windows 2b1p2 prelease
        [04-Sep-2000 tim
         reported by Audun S. Runde mailto:audun@mindspring.com
         the mmap constructor fails w/
            WindowsError: [Errno 6] The handle is invalid
         since there are no reports of this failing on other
         flavors of Windows, this looks like to be an ME bug
        ]

Open items -- Need to be resolved before 2.0 final release

    Decide whether cycle-gc should be enabled by default.

    Resolve compatibility issues between core xml package and the
    XML-SIG XML package.

    Update Tools/compiler so that it is compatible with list
    comprehensions, import as, and any other new language features.

    Improve code coverage of test suite.

    Finish writing the PEPs for the features that went out with
    2.0b1(! sad, but realistic -- we'll get better with practice).

    Major effort to whittle the bug database down to size.  I've (tim)
    seen this before: if you can keep all the open bugs fitting on one
    screen, people will generally keep it that way.  But let it
    slobber over a screen for a month, & it just goes to hell (no
    "visible progress" indeed!).

Accepted and in progress

    * Currently none left. [4-Sep-2000 guido]

Open: proposed but not accepted or rejected

    * There are a number of open patches again.  We need to clear
      these out soon.  

Previously failing test cases

    If you find a test bouncing between this section and the previous one,
    the code it's testing is in trouble!

    test case         platform    date reported
    ---------         --------    -------------
    test_fork1        Linux       26-Jul-2000
        [28-aug-2000 fixed by cgw; solution is to create copies of
        lock in child process]
        [19-Aug-2000 tim
         Charles Waldman whipped up a patch to give child processes a new
         "global lock":
         http://sourceforge.net/patch/?func=detailpatch&patch_id=101226&group_id=5470
         While this doesn't appear to address the symptoms we *saw*, it
         *does* so far appear to be fixing the failing cases anyway
        ]

    test_parser       all         22-Aug-2000
    test_posixpath    all         22-Aug-2000

    test_popen2       Win32       26-Jul-2000
        [31-Aug-2000 tim
         This died again, but for an entirely different reason:  it uses a
         dict to map file pointers to process handles, and calls a dict
         access function during popen.close().  But .close releases threads,
         which left the internal popen code accessing the dict without a
         valid thread state.  The dict implementation changed so that's no
         longer accepted.  Fixed by creating a temporary thread state in the
         guts of popen's close routine, and grabbing the global lock with
         it for the duration]
        [20-Aug-2000 tim
         changed the popen2.py _test function to use the "more" cmd
         when os.name == "nt".  This makes test_popen2 pass under
         Win98SE.
         HOWEVER, the Win98 "more" invents a leading newline out
         of thin air, and I'm not sure that the other Windows flavors
         of "more" also do that.
         So, somebody please try under other Windows flavors!
        ]
        [still fails 15-Aug-2000 for me, on Win98 - tim
             test test_popen2 crashed -- exceptions.AssertionError :
         The problem is that the test uses "cat", but there is
         no such thing under Windows (unless you install it).
         So it's the test that's broken here, not (necessarily)
         the code.
        ]

    test_winreg        Win32      26-Jul-2000
        [works 15-Aug-2000 for me, on Win98 - tim]

    test_mmap          Win32      26-Jul-2000
        [believe that was fixed by Mark H.]
        [works 15-Aug-2000 for me, on Win98 - tim]

    test_longexp      Win98+?     15-Aug-2000
        [fails in release build,
         passes in release build under verbose mode but doesn't
             look like it should pass,
         passes in debug build,
         passes in debug build under verbose mode and looks like
             it should pass
        ]
        [18-Aug-2000, tim:  can't reproduce, and nobody else
         saw it.  I believe there *is* a subtle bug in
         regrtest.py when using -v, and I'll pursue that,
         but can't provoke anything wrong with test_longexp
         anymore; eyeballing Fred's changes didn't turn up
         a suspect either
         19-Aug-2000, tim: the "subtle bug" in regrtest.py -v is
         actually a feature:  -v masks *some* kinds of failures,
         since it doesn't compare test output with the canned
         output; this is what makes it say "test passed" even
         in some cases where the test fails without -v
        ]

    test_winreg2      Win32       26-Jul-2000
        [20-Aug-2000 tim - the test has been removed from the project]
        [19-Aug-2000 tim
         This test will never work on Win98, because it's looking for
         a part of registry that doesn't exist under W98.
         The module (winreg.py) and this test case will be removed
         before 2.0 for other reasons, though.
        ]
        [still fails 15-Aug-2000 for me, on Win98 - tim
         test test_winreg2 failed -- Writing: 'Test Failed: testHives',
         expected: 'HKEY_PERFORMANCE_DATA\012'
        ]


Open items -- completed/fixed

    [4-Sep-2000 guido: Fredrik finished this on 1-Sep]
    * PyErr_Format - Fredrik Lundh
      Make this function safe from buffer overflows.

    [4-Sep-2000 guido: Fred has added popen2, popen3 on 28-Sep]
    Add popen2 support for Linux -- Fred Drake

    [4-Sep-2000 guido: done on 1-Sep]
    Deal with buffering problem with SocketServer

    [04-Sep-2000 tim:  done; installer runs; w9xpopen not an issue]
    [01-Sep-2000 tim:  make a prerelease availabe]
    Windows ME:  Don't know anything about it.  Will the installer
    even run?  Does it need the w9xpopen hack?

    [04-Sep-2000 tim:  done; tested on several Windows flavors now]
    [01-Sep-2000 tim:  completed but untested except on Win98SE]
    Windows installer:  If HKLM isn't writable, back off to HKCU (so
    Python can be installed on NT & 2000 without admin privileges).

    [01-Sep-200 tim - as Guido said, runtime code in posixmodule.c doesn't
     call this on NT/2000, so no need to avoid installing it everywhere.
     Added code to the installer *to* install it, though.]
    Windows installer:  Install w9xpopen.exe only under Win95/98.

    [23-Aug-2000 jeremy - tim reports "completed recently"]
    Windows:  Look for registry info in HKCU before HKLM - Mark
    Hammond.

    [20-Aug-2000 tim - done]
    Remove winreg.py and test_winreg2.py.  Paul Prescod (the author)
    now wants to make a registry API more like the MS .NET API.  Unclear
    whether that can be done in time for 2.0, but, regardless, if we
    let winreg.py out the door we'll be stuck with it forever, and not
    even Paul wants it anymore.

    [24-Aug-2000 tim+guido - done]
    Win98 Guido:  popen is hanging on Guido, and even freezing the
    whole machine.  Was caused by Norton Antivirus 2000 (6.10.20) on
    Windows 9x.  Resolution: disable virus protection.


Accepted and completed

    * Change meaning of \x escapes - PEP 223 - Fredrik Lundh

    * Add \U1234678 escapes in u"" strings - Fredrik Lundh

    * Support for opcode arguments > 2**16 - Charles Waldman
      SF Patch 100893

    * "import as" - Thomas Wouters
      Extend the 'import' and 'from ... import' mechanism to enable
      importing a symbol as another name. (Without adding a new keyword.)

    * List comprehensions - Skip Montanaro
      Tim Peters still needs to do PEP.

    * Restore old os.path.commonprefix behavior
      Do we have test cases that work on all platforms?

    * Tim O'Malley's cookie module with good license

    * Lockstep iteration ("zip" function) - Barry Warsaw

    * SRE - Fredrik Lundh
      [at least I *think* it's done, as of 15-Aug-2000 - tim]

    * Fix xrange printing behavior - Fred Drake
      Remove the tp_print handler for the xrange type; it produced a
      list display instead of 'xrange(...)'.  The new code produces a
      minimal call to xrange(), enclosed in (... * N) when N != 1.
      This makes the repr() more human readable while making it do
      what reprs are advertised as doing.  It also makes the xrange
      objects obvious when working in the interactive interpreter.

    * Extended print statement - Barry Warsaw
      PEP 214
      http://www.python.org/dev/peps/pep-0214/
      SF Patch #100970
      http://sourceforge.net/patch/?func=detailpatch&patch_id=100970&group_id=5470

    * interface to poll system call - Andrew Kuchling
      SF Patch 100852

    * Augmented assignment - Thomas Wouters
      Add += and family, plus Python and C hooks, and API functions.

    * gettext.py module - Barry Warsaw


Postponed

    * Extended slicing on lists - Michael Hudson
      Make lists (and other builtin types) handle extended slices.

    * Compression of Unicode database - Fredrik Lundh
      SF Patch 100899
      At least for 2.0b1.  May be included in 2.0 as a bug fix.

    * Range literals - Thomas Wouters
      SF Patch 100902
      We ended up having a lot of doubt about the proposal.

    * Eliminated SET_LINENO opcode - Vladimir Marangozov
      Small optimization achieved by using the code object's lnotab
      instead of the SET_LINENO instruction.  Uses code rewriting
      technique (that Guido's frowns on) to support debugger, which
      uses SET_LINENO.

      http://starship.python.net/~vlad/lineno/
      for (working at the time) patches

      Discussions on python-dev:

      - http://www.python.org/pipermail/python-dev/2000-April/subject.html
        Subject: "Why do we need Traceback Objects?"

      - http://www.python.org/pipermail/python-dev/1999-August/002252.html

    * test harness for C code - Trent Mick


Rejected

    * 'indexing-for' - Thomas Wouters
      Special syntax to give Python code access to the loop-counter in 'for'
      loops. (Without adding a new keyword.)



pep-0201 Lockstep Iteration

PEP: 201
Title: Lockstep Iteration
Version: $Revision$
Last-Modified: $Date$
Author: Barry Warsaw <barry at python.org>
Status: Final
Type: Standards Track
Created: 13-Jul-2000
Python-Version: 2.0
Post-History: 27-Jul-2000

Introduction

    This PEP describes the `lockstep iteration' proposal.  This PEP
    tracks the status and ownership of this feature, slated for
    introduction in Python 2.0.  It contains a description of the
    feature and outlines changes necessary to support the feature.
    This PEP summarizes discussions held in mailing list forums, and
    provides URLs for further information, where appropriate.  The CVS
    revision history of this file contains the definitive historical
    record.


Motivation

    Standard for-loops in Python iterate over every element in a
    sequence until the sequence is exhausted[1].  However, for-loops
    iterate over only a single sequence, and it is often desirable to
    loop over more than one sequence in a lock-step fashion.  In other
    words, in a way such that nthe i-th iteration through the loop
    returns an object containing the i-th element from each sequence.

    The common idioms used to accomplish this are unintuitive.  This
    PEP proposes a standard way of performing such iterations by
    introducing a new builtin function called `zip'.

    While the primary motivation for zip() comes from lock-step
    iteration, by implementing zip() as a built-in function, it has
    additional utility in contexts other than for-loops.

Lockstep For-Loops

    Lockstep for-loops are non-nested iterations over two or more
    sequences, such that at each pass through the loop, one element
    from each sequence is taken to compose the target.  This behavior
    can already be accomplished in Python through the use of the map()
    built-in function:

    >>> a = (1, 2, 3)
    >>> b = (4, 5, 6)
    >>> for i in map(None, a, b): print i
    ... 
    (1, 4)
    (2, 5)
    (3, 6)
    >>> map(None, a, b)
    [(1, 4), (2, 5), (3, 6)]

    The for-loop simply iterates over this list as normal.

    While the map() idiom is a common one in Python, it has several
    disadvantages:

    - It is non-obvious to programmers without a functional
      programming background.

    - The use of the magic `None' first argument is non-obvious.

    - It has arbitrary, often unintended, and inflexible semantics
      when the lists are not of the same length: the shorter sequences
      are padded with `None'.

      >>> c = (4, 5, 6, 7)
      >>> map(None, a, c)
      [(1, 4), (2, 5), (3, 6), (None, 7)]

    For these reasons, several proposals were floated in the Python
    2.0 beta time frame for syntactic support of lockstep for-loops.
    Here are two suggestions:

    for x in seq1, y in seq2:
        # stuff

    for x, y in seq1, seq2:
        # stuff

    Neither of these forms would work, since they both already mean
    something in Python and changing the meanings would break existing
    code.  All other suggestions for new syntax suffered the same
    problem, or were in conflict with other another proposed feature
    called `list comprehensions' (see PEP 202).

The Proposed Solution

    The proposed solution is to introduce a new built-in sequence
    generator function, available in the __builtin__ module.  This
    function is to be called `zip' and has the following signature:

    zip(seqa, [seqb, [...]])

    zip() takes one or more sequences and weaves their elements
    together, just as map(None, ...) does with sequences of equal
    length.  The weaving stops when the shortest sequence is
    exhausted.


Return Value

    zip() returns a real Python list, the same way map() does.


Examples

    Here are some examples, based on the reference implementation
    below.

    >>> a = (1, 2, 3, 4)
    >>> b = (5, 6, 7, 8)
    >>> c = (9, 10, 11)
    >>> d = (12, 13)

    >>> zip(a, b)
    [(1, 5), (2, 6), (3, 7), (4, 8)]

    >>> zip(a, d)
    [(1, 12), (2, 13)]

    >>> zip(a, b, c, d)
    [(1, 5, 9, 12), (2, 6, 10, 13)]

    Note that when the sequences are of the same length, zip() is
    reversible:

    >>> a = (1, 2, 3)
    >>> b = (4, 5, 6)
    >>> x = zip(a, b)
    >>> y = zip(*x) # alternatively, apply(zip, x)
    >>> z = zip(*y) # alternatively, apply(zip, y)
    >>> x
    [(1, 4), (2, 5), (3, 6)]
    >>> y
    [(1, 2, 3), (4, 5, 6)]
    >>> z
    [(1, 4), (2, 5), (3, 6)]
    >>> x == z
    1

    It is not possible to reverse zip this way when the sequences are
    not all the same length.


Reference Implementation

    Here is a reference implementation, in Python of the zip()
    built-in function.  This will be replaced with a C implementation
    after final approval.

    def zip(*args):
        if not args:
            raise TypeError('zip() expects one or more sequence arguments')
        ret = []
        i = 0
        try:
            while 1:
                item = []
                for s in args:
                    item.append(s[i])
                ret.append(tuple(item))
                i = i + 1
        except IndexError:
            return ret


BDFL Pronouncements

    Note: the BDFL refers to Guido van Rossum, Python's Benevolent
    Dictator For Life.

    - The function's name.  An earlier version of this PEP included an
      open issue listing 20+ proposed alternative names to zip().  In
      the face of no overwhelmingly better choice, the BDFL strongly
      prefers zip() due to its Haskell[2] heritage.  See version 1.7
      of this PEP for the list of alternatives.

    - zip() shall be a built-in function.

    - Optional padding.  An earlier version of this PEP proposed an
      optional `pad' keyword argument, which would be used when the
      argument sequences were not the same length.  This is similar
      behavior to the map(None, ...) semantics except that the user
      would be able to specify pad object.  This has been rejected by
      the BDFL in favor of always truncating to the shortest sequence,
      because of the KISS principle.  If there's a true need, it is
      easier to add later.  If it is not needed, it would still be
      impossible to delete it in the future.

    - Lazy evaluation.  An earlier version of this PEP proposed that
      zip() return a built-in object that performed lazy evaluation
      using __getitem__() protocol.  This has been strongly rejected
      by the BDFL in favor of returning a real Python list.  If lazy
      evaluation is desired in the future, the BDFL suggests an xzip()
      function be added.

    - zip() with no arguments.  the BDFL strongly prefers this raise a
      TypeError exception.

    - zip() with one argument.  the BDFL strongly prefers that this
      return a list of 1-tuples.

    - Inner and outer container control.  An earlier version of this
      PEP contains a rather lengthy discussion on a feature that some
      people wanted, namely the ability to control what the inner and
      outer container types were (they are tuples and list
      respectively in this version of the PEP).  Given the simplified
      API and implementation, this elaboration is rejected.  For a
      more detailed analysis, see version 1.7 of this PEP.

Subsequent Change to zip()

    In Python 2.4, zip() with no arguments was modified to return an
    empty list rather than raising a TypeError exception.  The rationale
    for the original behavior was that the absence of arguments was
    thought to indicate a programming error.  However, that thinking
    did not anticipate the use of zip() with the * operator for unpacking
    variable length argument lists.  For example, the inverse of zip
    could be defined as:  unzip = lambda s: zip(*s).  That transformation
    also defines a matrix transpose or an equivalent row/column swap for
    tables defined as lists of tuples.  The latter transformation is
    commonly used when reading data files with records as rows and fields
    as columns.  For example, the code:

        date, rain, high, low = zip(*csv.reader(file("weather.csv")))

    rearranges columnar data so that each field is collected into
    individual tuples for straight-forward looping and summarization:

        print "Total rainfall", sum(rain)

    Using zip(*args) is more easily coded if zip(*[]) is handled as an
    allowable case rather than an exception.  This is especially helpful
    when data is either built up from or recursed down to a null case
    with no records.

    Seeing this possibility, the BDFL agreed (with some misgivings) to
    have the behavior changed for Py2.4.

Other Changes

    - The xzip() function discussed above was implemented in Py2.3 in
      the itertools module as itertools.izip().  This function provides
      lazy behavior, consuming single elements and producing a single
      tuple on each pass.  The "just-in-time" style saves memory and
      runs faster than its list based counterpart, zip().

    - The itertools module also added itertools.repeat() and
      itertools.chain().  These tools can be used together to pad
      sequences with None (to match the behavior of map(None, seqn)):

          zip(firstseq, chain(secondseq, repeat(None)))


References

    [1] http://docs.python.org/reference/compound_stmts.html#for
    [2] http://www.haskell.org/onlinereport/standard-prelude.html#$vzip

    Greg Wilson's questionaire on proposed syntax to some CS grad students
    http://www.python.org/pipermail/python-dev/2000-July/013139.html


Copyright

    This document has been placed in the public domain.



pep-0202 List Comprehensions

PEP: 202
Title: List Comprehensions
Version: $Revision$
Last-Modified: $Date$
Author: Barry Warsaw <barry at python.org>
Status: Final
Type: Standards Track
Created: 13-Jul-2000
Python-Version: 2.0
Post-History: 

Introduction

    This PEP describes a proposed syntactical extension to Python,
    list comprehensions.


The Proposed Solution

    It is proposed to allow conditional construction of list literals
    using for and if clauses.  They would nest in the same way for
    loops and if statements nest now.
    

Rationale

    List comprehensions provide a more concise way to create lists in
    situations where map() and filter() and/or nested loops would
    currently be used.


Examples

    >>> print [i for i in range(10)]
    [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

    >>> print [i for i in range(20) if i%2 == 0]
    [0, 2, 4, 6, 8, 10, 12, 14, 16, 18]

    >>> nums = [1,2,3,4]
    >>> fruit = ["Apples", "Peaches", "Pears", "Bananas"]
    >>> print [(i,f) for i in nums for f in fruit]
    [(1, 'Apples'), (1, 'Peaches'), (1, 'Pears'), (1, 'Bananas'),
     (2, 'Apples'), (2, 'Peaches'), (2, 'Pears'), (2, 'Bananas'),
     (3, 'Apples'), (3, 'Peaches'), (3, 'Pears'), (3, 'Bananas'),
     (4, 'Apples'), (4, 'Peaches'), (4, 'Pears'), (4, 'Bananas')]
    >>> print [(i,f) for i in nums for f in fruit if f[0] == "P"]
    [(1, 'Peaches'), (1, 'Pears'),
     (2, 'Peaches'), (2, 'Pears'),
     (3, 'Peaches'), (3, 'Pears'),
     (4, 'Peaches'), (4, 'Pears')]
    >>> print [(i,f) for i in nums for f in fruit if f[0] == "P" if i%2 == 1]
    [(1, 'Peaches'), (1, 'Pears'), (3, 'Peaches'), (3, 'Pears')]
    >>> print [i for i in zip(nums,fruit) if i[0]%2==0]
    [(2, 'Peaches'), (4, 'Bananas')]


Reference Implementation

    List comprehensions become part of the Python language with
    release 2.0, documented in [1].


BDFL Pronouncements

    - The syntax proposed above is the Right One.

    - The form [x, y for ...] is disallowed; one is required to write
      [(x, y) for ...].

    - The form [... for x... for y...] nests, with the last index
      varying fastest, just like nested for loops.


References

    [1] http://docs.python.org/reference/expressions.html#list-displays



pep-0203 Augmented Assignments

PEP: 203
Title: Augmented Assignments
Version: $Revision$
Last-Modified: $Date$
Author: Thomas Wouters <thomas at python.org>
Status: Final
Type: Standards Track
Created: 13-Jul-2000
Python-Version: 2.0
Post-History: 14-Aug-2000

Introduction

    This PEP describes the `augmented assignment' proposal for Python
    2.0.  This PEP tracks the status and ownership of this feature,
    slated for introduction in Python 2.0.  It contains a description
    of the feature and outlines changes necessary to support the
    feature.  This PEP summarizes discussions held in mailing list
    forums, and provides URLs for further information where
    appropriate.  The CVS revision history of this file contains the
    definitive historical record.


Proposed semantics

    The proposed patch that adds augmented assignment to Python
    introduces the following new operators:
    
       += -= *= /= %= **= <<= >>= &= ^= |=
    
    They implement the same operator as their normal binary form,
    except that the operation is done `in-place' when the left-hand
    side object supports it, and that the left-hand side is only
    evaluated once.
    
    They truly behave as augmented assignment, in that they perform
    all of the normal load and store operations, in addition to the
    binary operation they are intended to do. So, given the expression:
    
       x += y
    
    The object `x' is loaded, then `y' is added to it, and the
    resulting object is stored back in the original place. The precise
    action performed on the two arguments depends on the type of `x',
    and possibly of `y'.

    The idea behind augmented assignment in Python is that it isn't
    just an easier way to write the common practice of storing the
    result of a binary operation in its left-hand operand, but also a
    way for the left-hand operand in question to know that it should
    operate `on itself', rather than creating a modified copy of
    itself.

    To make this possible, a number of new `hooks' are added to Python
    classes and C extension types, which are called when the object in
    question is used as the left hand side of an augmented assignment
    operation. If the class or type does not implement the `in-place'
    hooks, the normal hooks for the particular binary operation are
    used.
    
    So, given an instance object `x', the expression
    
        x += y
    
    tries to call x.__iadd__(y), which is the `in-place' variant of
    __add__. If __iadd__ is not present, x.__add__(y) is attempted,
    and finally y.__radd__(x) if __add__ is missing too.  There is no
    `right-hand-side' variant of __iadd__, because that would require
    for `y' to know how to in-place modify `x', which is unsafe to say
    the least. The __iadd__ hook should behave similar to __add__,
    returning the result of the operation (which could be `self')
    which is to be assigned to the variable `x'.
 
    For C extension types, the `hooks' are members of the
    PyNumberMethods and PySequenceMethods structures.  Some special
    semantics apply to make the use of these methods, and the mixing
    of Python instance objects and C types, as unsurprising as
    possible.

    In the generic case of `x <augop> y' (or a similar case using the
    PyNumber_InPlace API functions) the principal object being
    operated on is `x'.  This differs from normal binary operations,
    where `x' and `y' could be considered `co-operating', because
    unlike in binary operations, the operands in an in-place operation
    cannot be swapped.  However, in-place operations do fall back to
    normal binary operations when in-place modification is not
    supported, resuling in the following rules:
    
    - If the left-hand object (`x') is an instance object, and it
      has a `__coerce__' method, call that function with `y' as the
      argument. If coercion succeeds, and the resulting left-hand
      object is a different object than `x', stop processing it as
      in-place and call the appropriate function for the normal binary
      operation, with the coerced `x' and `y' as arguments. The result
      of the operation is whatever that function returns.
      
      If coercion does not yield a different object for `x', or `x'
      does not define a `__coerce__' method, and `x' has the
      appropriate `__ihook__' for this operation, call that method
      with `y' as the argument, and the result of the operation is
      whatever that method returns.

    - Otherwise, if the left-hand object is not an instance object,
      but its type does define the in-place function for this
      operation, call that function with `x' and `y' as the arguments,
      and the result of the operation is whatever that function
      returns.
      
      Note that no coercion on either `x' or `y' is done in this case,
      and it's perfectly valid for a C type to receive an instance
      object as the second argument; that is something that cannot
      happen with normal binary operations.

    - Otherwise, process it exactly as a normal binary operation (not
      in-place), including argument coercion. In short, if either
      argument is an instance object, resolve the operation through
      `__coerce__', `__hook__' and `__rhook__'. Otherwise, both
      objects are C types, and they are coerced and passed to the
      appropriate function.
   
    - If no way to process the operation can be found, raise a
      TypeError with an error message specific to the operation.

    - Some special casing exists to account for the case of `+' and
      `*', which have a special meaning for sequences: for `+',
      sequence concatenation, no coercion what so ever is done if a C
      type defines sq_concat or sq_inplace_concat. For `*', sequence
      repeating, `y' is converted to a C integer before calling either
      sq_inplace_repeat and sq_repeat. This is done even if `y' is an
      instance, though not if `x' is an instance.

    The in-place function should always return a new reference, either
    to the old `x' object if the operation was indeed performed
    in-place, or to a new object.


Rationale

    There are two main reasons for adding this feature to Python:
    simplicity of expression, and support for in-place operations. The
    end result is a tradeoff between simplicity of syntax and
    simplicity of expression; like most new features, augmented
    assignment doesn't add anything that was previously impossible. It
    merely makes these things easier to do.
    
    Adding augmented assignment will make Python's syntax more complex. 
    Instead of a single assignment operation, there are now twelve
    assignment operations, eleven of which also perform an binary
    operation. However, these eleven new forms of assignment are easy
    to understand as the coupling between assignment and the binary
    operation, and they require no large conceptual leap to
    understand. Furthermore, languages that do have augmented
    assignment have shown that they are a popular, much used feature.
    Expressions of the form
    
        <x> = <x> <operator> <y>
        
    are common enough in those languages to make the extra syntax
    worthwhile, and Python does not have significantly fewer of those
    expressions. Quite the opposite, in fact, since in Python you can
    also concatenate lists with a binary operator, something that is
    done quite frequently. Writing the above expression as
    
        <x> <operator>= <y> 
    
    is both more readable and less error prone, because it is
    instantly obvious to the reader that it is <x> that is being
    changed, and not <x> that is being replaced by something almost,
    but not quite, entirely unlike <x>.
    
    The new in-place operations are especially useful to matrix
    calculation and other applications that require large objects. In
    order to efficiently deal with the available program memory, such
    packages cannot blindly use the current binary operations. Because
    these operations always create a new object, adding a single item
    to an existing (large) object would result in copying the entire
    object (which may cause the application to run out of memory), add
    the single item, and then possibly delete the original object,
    depending on reference count.
    
    To work around this problem, the packages currently have to use
    methods or functions to modify an object in-place, which is
    definitely less readable than an augmented assignment expression. 
    Augmented assignment won't solve all the problems for these
    packages, since some operations cannot be expressed in the limited
    set of binary operators to start with, but it is a start. A
    different PEP[2] is looking at adding new operators.


New methods

    The proposed implementation adds the following 11 possible `hooks'
    which Python classes can implement to overload the augmented
    assignment operations:
    
        __iadd__
        __isub__
        __imul__
        __idiv__
        __imod__
        __ipow__
        __ilshift__
        __irshift__
        __iand__
        __ixor__
        __ior__
    
    The `i' in `__iadd__' stands for `in-place'.

    For C extension types, the following struct members are added:
    
    To PyNumberMethods:
        binaryfunc nb_inplace_add;
        binaryfunc nb_inplace_subtract;
        binaryfunc nb_inplace_multiply;
        binaryfunc nb_inplace_divide;
        binaryfunc nb_inplace_remainder;
        binaryfunc nb_inplace_power;
        binaryfunc nb_inplace_lshift;
        binaryfunc nb_inplace_rshift;
        binaryfunc nb_inplace_and;
        binaryfunc nb_inplace_xor;
        binaryfunc nb_inplace_or;

    To PySequenceMethods:
        binaryfunc sq_inplace_concat;
        intargfunc sq_inplace_repeat;

    In order to keep binary compatibility, the tp_flags TypeObject
    member is used to determine whether the TypeObject in question has
    allocated room for these slots. Until a clean break in binary
    compatibility is made (which may or may not happen before 2.0)
    code that wants to use one of the new struct members must first
    check that they are available with the `PyType_HasFeature()'
    macro:
    
    if (PyType_HasFeature(x->ob_type, Py_TPFLAGS_HAVE_INPLACE_OPS) &&
        x->ob_type->tp_as_number && x->ob_type->tp_as_number->nb_inplace_add) {
            /* ... */

    This check must be made even before testing the method slots for
    NULL values! The macro only tests whether the slots are available,
    not whether they are filled with methods or not.


Implementation

    The current implementation of augmented assignment[1] adds, in
    addition to the methods and slots already covered, 13 new bytecodes
    and 13 new API functions.
    
    The API functions are simply in-place versions of the current
    binary-operation API functions:
    
        PyNumber_InPlaceAdd(PyObject *o1, PyObject *o2);
        PyNumber_InPlaceSubtract(PyObject *o1, PyObject *o2);
        PyNumber_InPlaceMultiply(PyObject *o1, PyObject *o2);
        PyNumber_InPlaceDivide(PyObject *o1, PyObject *o2);
        PyNumber_InPlaceRemainder(PyObject *o1, PyObject *o2);
        PyNumber_InPlacePower(PyObject *o1, PyObject *o2);
        PyNumber_InPlaceLshift(PyObject *o1, PyObject *o2);
        PyNumber_InPlaceRshift(PyObject *o1, PyObject *o2);
        PyNumber_InPlaceAnd(PyObject *o1, PyObject *o2);
        PyNumber_InPlaceXor(PyObject *o1, PyObject *o2);
        PyNumber_InPlaceOr(PyObject *o1, PyObject *o2);
        PySequence_InPlaceConcat(PyObject *o1, PyObject *o2);
        PySequence_InPlaceRepeat(PyObject *o, int count);

    They call either the Python class hooks (if either of the objects
    is a Python class instance) or the C type's number or sequence
    methods.

    The new bytecodes are:
        INPLACE_ADD
        INPLACE_SUBTRACT
        INPLACE_MULTIPLY
        INPLACE_DIVIDE
        INPLACE_REMAINDER
        INPLACE_POWER
        INPLACE_LEFTSHIFT
        INPLACE_RIGHTSHIFT
        INPLACE_AND
        INPLACE_XOR
        INPLACE_OR
        ROT_FOUR
        DUP_TOPX
    
    The INPLACE_* bytecodes mirror the BINARY_* bytecodes, except that
    they are implemented as calls to the `InPlace' API functions. The
    other two bytecodes are `utility' bytecodes: ROT_FOUR behaves like
    ROT_THREE except that the four topmost stack items are rotated.
    
    DUP_TOPX is a bytecode that takes a single argument, which should
    be an integer between 1 and 5 (inclusive) which is the number of
    items to duplicate in one block. Given a stack like this (where
    the right side of the list is the `top' of the stack):

        [1, 2, 3, 4, 5]
    
    "DUP_TOPX 3" would duplicate the top 3 items, resulting in this
    stack:
    
        [1, 2, 3, 4, 5, 3, 4, 5]

    DUP_TOPX with an argument of 1 is the same as DUP_TOP. The limit
    of 5 is purely an implementation limit. The implementation of
    augmented assignment requires only DUP_TOPX with an argument of 2
    and 3, and could do without this new opcode at the cost of a fair
    number of DUP_TOP and ROT_*.


Open Issues

    The PyNumber_InPlace API is only a subset of the normal PyNumber
    API: only those functions that are required to support the
    augmented assignment syntax are included. If other in-place API
    functions are needed, they can be added later.


    The DUP_TOPX bytecode is a conveniency bytecode, and is not
    actually necessary. It should be considered whether this bytecode
    is worth having. There seems to be no other possible use for this
    bytecode at this time.
    

Copyright

    This document has been placed in the public domain.


References

    [1] http://www.python.org/pipermail/python-list/2000-June/059556.html

    [2] http://sourceforge.net/patch?func=detailpatch&patch_id=100699&group_id=5470

    [3] PEP 211, Adding A New Outer Product Operator, Wilson
        http://www.python.org/dev/peps/pep-0211/



pep-0204 Range Literals

PEP: 204
Title: Range Literals
Version: $Revision$
Last-Modified: $Date$
Author: Thomas Wouters <thomas at python.org>
Status: Rejected
Type: Standards Track
Created: 14-Jul-2000
Python-Version: 2.0
Post-History: 

Introduction

    This PEP describes the `range literal' proposal for Python 2.0.
    This PEP tracks the status and ownership of this feature, slated
    for introduction in Python 2.0.  It contains a description of the
    feature and outlines changes necessary to support the feature.
    This PEP summarizes discussions held in mailing list forums, and
    provides URLs for further information, where appropriate.  The CVS
    revision history of this file contains the definitive historical
    record.


List ranges

    Ranges are sequences of numbers of a fixed stepping, often used in
    for-loops.  The Python for-loop is designed to iterate over a
    sequence directly:
    
        >>> l = ['a', 'b', 'c', 'd']
        >>> for item in l:
        ...     print item
        a
        b
        c
        d
    
    However, this solution is not always prudent.  Firstly, problems
    arise when altering the sequence in the body of the for-loop,
    resulting in the for-loop skipping items.  Secondly, it is not
    possible to iterate over, say, every second element of the
    sequence.  And thirdly, it is sometimes necessary to process an
    element based on its index, which is not readily available in the
    above construct.
    
    For these instances, and others where a range of numbers is
    desired, Python provides the `range' builtin function, which
    creates a list of numbers.  The `range' function takes three
    arguments, `start', `end' and `step'.  `start' and `step' are
    optional, and default to 0 and 1, respectively.
    
    The `range' function creates a list of numbers, starting at
    `start', with a step of `step', up to, but not including `end', so
    that `range(10)' produces a list that has exactly 10 items, the
    numbers 0 through 9.
    
    Using the `range' function, the above example would look like
    this:
    
        >>> for i in range(len(l)):
        ...     print l[i]
        a
        b
        c
        d
    
    Or, to start at the second element of `l' and processing only
    every second element from then on:
    
        >>> for i in range(1, len(l), 2):
        ...     print l[i]
        b
        d
    
    There are several disadvantages with this approach:
    
    - Clarity of purpose: Adding another function call, possibly with
      extra arithmetic to determine the desired length and step of the
      list, does not improve readability of the code.  Also, it is
      possible to `shadow' the builtin `range' function by supplying a
      local or global variable with the same name, effectively
      replacing it.  This may or may not be a desired effect.
      
    - Efficiency: because the `range' function can be overridden, the
      Python compiler cannot make assumptions about the for-loop, and
      has to maintain a separate loop counter.
      
    - Consistency: There already is a syntax that is used to denote
      ranges, as shown below.  This syntax uses the exact same
      arguments, though all optional, in the exact same way.  It seems
      logical to extend this syntax to ranges, to form `range
      literals'.


Slice Indices

    In Python, a sequence can be indexed in one of two ways:
    retrieving a single item, or retrieving a range of items. 
    Retrieving a range of items results in a new object of the same
    type as the original sequence, containing zero or more items from
    the original sequence.  This is done using a `range notation':
    
        >>> l[2:4]
        ['c', 'd']
    
    This range notation consists of zero, one or two indices separated
    by a colon.  The first index is the `start' index, the second the
    `end'.  When either is left out, they default to respectively the
    start and the end of the sequence.
    
    There is also an extended range notation, which incorporates
    `step' as well.  Though this notation is not currently supported
    by most builtin types, if it were, it would work as follows:
    
        >>> l[1:4:2]
        ['b', 'd']

    The third `argument' to the slice syntax is exactly the same as
    the `step' argument to range().  The underlying mechanisms of the
    standard, and these extended slices, are sufficiently different
    and inconsistent that many classes and extensions outside of
    mathematical packages do not implement support for the extended
    variant.  While this should be resolved, it is beyond the scope of
    this PEP.
    
    Extended slices do show, however, that there is already a
    perfectly valid and applicable syntax to denote ranges in a way
    that solve all of the earlier stated disadvantages of the use of
    the range() function:
    
    - It is clearer, more concise syntax, which has already proven to
      be both intuitive and easy to learn.
      
    - It is consistent with the other use of ranges in Python
      (e.g. slices).
      
    - Because it is built-in syntax, instead of a builtin function, it
      cannot be overridden.  This means both that a viewer can be
      certain about what the code does, and that an optimizer will not
      have to worry about range() being `shadowed'.


The Proposed Solution

    The proposed implementation of range-literals combines the syntax
    for list literals with the syntax for (extended) slices, to form
    range literals:
    
        >>> [1:10]
        [1, 2, 3, 4, 5, 6, 7, 8, 9]
        >>> [:5]
        [0, 1, 2, 3, 4]
        >>> [5:1:-1]
        [5, 4, 3, 2]
    
    There is one minor difference between range literals and the slice
    syntax: though it is possible to omit all of `start', `end' and
    `step' in slices, it does not make sense to omit `end' in range
    literals.  In slices, `end' would default to the end of the list,
    but this has no meaning in range literals.


Reference Implementation

    The proposed implementation can be found on SourceForge[1].  It
    adds a new bytecode, BUILD_RANGE, that takes three arguments from
    the stack and builds a list on the bases of those.  The list is
    pushed back on the stack.
    
    The use of a new bytecode is necessary to be able to build ranges
    based on other calculations, whose outcome is not known at compile
    time.
    
    The code introduces two new functions to listobject.c, which are
    currently hovering between private functions and full-fledged API
    calls.

    PyList_FromRange() builds a list from start, end and step,
    returning NULL if an error occurs.  Its prototype is:

        PyObject * PyList_FromRange(long start, long end, long step)
    
    PyList_GetLenOfRange() is a helper function used to determine the
    length of a range.  Previously, it was a static function in
    bltinmodule.c, but is now necessary in both listobject.c and
    bltinmodule.c (for xrange).  It is made non-static solely to avoid
    code duplication.  Its prototype is:

        long PyList_GetLenOfRange(long start, long end, long step) 


Open issues

    - One possible solution to the discrepancy of requiring the `end'
      argument in range literals is to allow the range syntax to
      create a `generator', rather than a list, such as the `xrange'
      builtin function does.  However, a generator would not be a
      list, and it would be impossible, for instance, to assign to
      items in the generator, or append to it.

      The range syntax could conceivably be extended to include tuples
      (i.e. immutable lists), which could then be safely implemented
      as generators.  This may be a desirable solution, especially for
      large number arrays: generators require very little in the way
      of storage and initialization, and there is only a small
      performance impact in calculating and creating the appropriate
      number on request.  (TBD: is there any at all? Cursory testing
      suggests equal performance even in the case of ranges of length
      1)

      However, even if idea was adopted, would it be wise to `special
      case' the second argument, making it optional in one instance of
      the syntax, and non-optional in other cases ?

    - Should it be possible to mix range syntax with normal list
      literals, creating a single list?  E.g.:

          >>> [5, 6, 1:6, 7, 9]

    to create

          [5, 6, 1, 2, 3, 4, 5, 7, 9]

    - How should range literals interact with another proposed new
      feature, `list comprehensions'[2]?  Specifically, should it be
      possible to create lists in list comprehensions?  E.g.:
    
          >>> [x:y for x in (1, 2) y in (3, 4)]

      Should this example return a single list with multiple ranges:

          [1, 2, 1, 2, 3, 2, 2, 3]

      Or a list of lists, like so:

          [[1, 2], [1, 2, 3], [2], [2, 3]]

      However, as the syntax and semantics of list comprehensions are
      still subject of hot debate, these issues are probably best
      addressed by the `list comprehensions' PEP.

    - Range literals accept objects other than integers: it performs
      PyInt_AsLong() on the objects passed in, so as long as the
      objects can be coerced into integers, they will be accepted.
      The resulting list, however, is always composed of standard
      integers.

      Should range literals create a list of the passed-in type?  It
      might be desirable in the cases of other builtin types, such as
      longs and strings:

          >>> [ 1L : 2L<<64 : 2<<32L ]    
          >>> ["a":"z":"b"]
          >>> ["a":"z":2]

      However, this might be too much `magic' to be obvious.  It might
      also present problems with user-defined classes: even if the
      base class can be found and a new instance created, the instance
      may require additional arguments to __init__, causing the
      creation to fail.
    
    - The PyList_FromRange() and PyList_GetLenOfRange() functions need
      to be classified: are they part of the API, or should they be
      made private functions?


Rejection

    After careful consideration, and a period of meditation, this
    proposal has been rejected. The open issues, as well as some
    confusion between ranges and slice syntax, raised enough questions
    for Guido not to accept it for Python 2.0, and later to reject the
    proposal altogether. The new syntax and its intentions were deemed
    not obvious enough.

    [ TBD: Guido, ammend/confirm this, please. Preferably both; this
      is a PEP, it should contain *all* the reasons for rejection
      and/or reconsideration, for future reference. ]


Copyright

    This document has been placed in the Public Domain.


References:

    [1] http://sourceforge.net/patch/?func=detailpatch&patch_id=100902&group_id=5470
    [2] PEP 202, List Comprehensions



pep-0205 Weak References

PEP: 205
Title: Weak References
Version: $Revision$
Last-Modified: $Date$
Author: Fred L. Drake, Jr. <fdrake at acm.org>
Status: Final
Type: Standards Track
Created: 
Python-Version: 2.1
Post-History: 11-Jan-2001

Motivation

    There are two basic applications for weak references which have
    been noted by Python programmers: object caches and reduction of
    pain from circular references.

    Caches (weak dictionaries)

        There is a need to allow objects to be maintained that represent
        external state, mapping a single instance to the external
        reality, where allowing multiple instances to be mapped to the
        same external resource would create unnecessary difficulty
        maintaining synchronization among instances.  In these cases,
        a common idiom is to support a cache of instances; a factory
        function is used to return either a new or existing instance.

        The difficulty in this approach is that one of two things must
        be tolerated: either the cache grows without bound, or there
        needs to be explicit management of the cache elsewhere in the
        application.  The later can be very tedious and leads to more
        code than is really necessary to solve the problem at hand,
        and the former can be unacceptable for long-running processes
        or even relatively short processes with substantial memory
        requirements.

        - External objects that need to be represented by a single
          instance, no matter how many internal users there are.  This
          can be useful for representing files that need to be written
          back to disk in whole rather than locked & modified for
          every use.

        - Objects that are expensive to create, but may be needed by
          multiple internal consumers.  Similar to the first case, but
          not necessarily bound to external resources, and possibly
          not an issue for shared state.  Weak references are only
          useful in this case if there is some flavor of "soft"
          references or if there is a high likelihood that users of
          individual objects will overlap in lifespan.

    Circular references

        - DOMs require a huge amount of circular (to parent & document
          nodes) references, but these could be eliminated using a weak
          dictionary mapping from each node to its parent.  This
          might be especially useful in the context of something like
          xml.dom.pulldom, allowing the .unlink() operation to become
          a no-op.

    This proposal is divided into the following sections:

        - Proposed Solution
        - Implementation Strategy
        - Possible Applications
        - Previous Weak Reference Work in Python
        - Weak References in Java

    The full text of one early proposal is included as an appendix
    since it does not appear to be available on the net.


Aspects of the Solution Space

    There are two distinct aspects to the weak references problem:

        - Invalidation of weak references
        - Presentation of weak references to Python code

    Invalidation:

    Past approaches to weak reference invalidation have often hinged
    on storing a strong reference and being able to examine all the
    instances of weak reference objects, and invalidating them when
    the reference count of their referent goes to one (indicating that
    the reference stored by the weak reference is the last remaining
    reference).  This has the advantage that the memory management
    machinery in Python need not change, and that any type can be
    weakly referenced.

    The disadvantage of this approach to invalidation is that it
    assumes that the management of the weak references is called
    sufficiently frequently that weakly-referenced objects are noticed
    within a reasonably short time frame; since this means a scan over
    some data structure to invalidate references, an operation which
    is O(N) on the number of weakly referenced objects, this is not
    effectively amortized for any single object which is weakly
    referenced.  This also assumes that the application is calling
    into code which handles weakly-referenced objects with some
    frequency, which makes weak-references less attractive for library
    code.

    An alternate approach to invalidation is that the de-allocation
    code to be aware of the possibility of weak references and make a
    specific call into the weak-reference management code to all
    invalidation whenever an object is deallocated.  This requires a
    change in the tp_dealloc handler for weakly-referencable objects;
    an additional call is needed at the "top" of the handler for
    objects which support weak-referencing, and an efficient way to
    map from an object to a chain of weak references for that object
    is needed as well.

    Presentation:

    Two ways that weak references are presented to the Python layer
    have been as explicit reference objects upon which some operation
    is required in order to retrieve a usable reference to the
    underlying object, and proxy objects which masquerade as the
    original objects as much as possible.

    Reference objects are easy to work with when some additional layer
    of object managemenet is being added in Python; references can be
    checked for liveness explicitly, without having to invoke
    operations on the referents and catching some special exception
    raised when an invalid weak reference is used.

    However, a number of users favor the proxy appoach simply because
    the weak reference looks so much like the original object.


Proposed Solution

    Weak references should be able to point to any Python object that
    may have substantial memory size (directly or indirectly), or hold
    references to external resources (database connections, open
    files, etc.).

    A new module, weakref, will contain new functions used to create
    weak references.  weakref.ref() will create a "weak reference
    object" and optionally attach a callback which will be called when
    the object is about to be finalized.  weakref.mapping() will
    create a "weak dictionary".  A third function, weakref.proxy(),
    will create a proxy object that behaves somewhat like the original
    object.

    A weak reference object will allow access to the referenced object
    if it hasn't been collected and to determine if the object still
    exists in memory.  Retrieving the referent is done by calling the
    reference object.  If the referent is no longer alive, this will
    return None instead.

    A weak dictionary maps arbitrary keys to values, but does not own
    a reference to the values.  When the values are finalized, the
    (key, value) pairs for which it is a value are removed from all
    the mappings containing such pairs.  Like dictionaries, weak
    dictionaries are not hashable.

    Proxy objects are weak references that attempt to behave like the
    object they proxy, as much as they can.  Regardless of the
    underlying type, proxies are not hashable since their ability to
    act as a weak reference relies on a fundamental mutability that
    will cause failures when used as dictionary keys -- even if the
    proper hash value is computed before the referent dies, the
    resulting proxy cannot be used as a dictionary key since it cannot
    be compared once the referent has expired, and comparability is
    necessary for dictionary keys.  Operations on proxy objects after
    the referent dies cause weakref.ReferenceError to be raised in
    most cases.  "is" comparisons, type(), and id() will continue to
    work, but always refer to the proxy and not the referent.

    The callbacks registered with weak references must accept a single
    parameter, which will be the weak reference or proxy object
    itself.  The object cannot be accessed or resurrected in the
    callback.


Implementation Strategy

    The implementation of weak references will include a list of
    reference containers that must be cleared for each weakly-
    referencable object.  If the reference is from a weak dictionary,
    the dictionary entry is cleared first.  Then, any associated
    callback is called with the object passed as a parameter.  Once
    all callbacks have been called, the object is finalized and
    deallocated.

    Many built-in types will participate in the weak-reference
    management, and any extension type can elect to do so.  The type
    structure will contain an additional field which provides an
    offset into the instance structure which contains a list of weak
    reference structures.  If the value of the field is <= 0, the
    object does not participate.  In this case, weakref.ref(),
    <weakdict>.__setitem__() and .setdefault(), and item assignment will
    raise TypeError.  If the value of the field is > 0, a new weak
    reference can be generated and added to the list.

    This approach is taken to allow arbitrary extension types to
    participate, without taking a memory hit for numbers or other
    small types.

    Standard types which support weak references include instances,
    functions, and bound & unbound methods.  With the addition of
    class types ("new-style classes") in Python 2.2, types grew
    support for weak references.  Instances of class types are weakly
    referencable if they have a base type which is weakly referencable,
    the class not specify __slots__, or a slot is named __weakref__.
    Generators also support weak references.


Possible Applications

    PyGTK+ bindings?

    Tkinter -- could avoid circular references by using weak
    references from widgets to their parents.  Objects won't be
    discarded any sooner in the typical case, but there won't be so
    much dependence on the programmer calling .destroy() before
    releasing a reference.  This would mostly benefit long-running
    applications.

    DOM trees.


Previous Weak Reference Work in Python

    Dianne Hackborn has proposed something called "virtual references".
    'vref' objects are very similar to java.lang.ref.WeakReference
    objects, except there is no equivalent to the invalidation
    queues.  Implementing a "weak dictionary" would be just as
    difficult as using only weak references (without the invalidation
    queue) in Java.  Information on this has disappeared from the Web,
    but is included below as an Appendix.

    Marc-AndrĂŠ Lemburg's mx.Proxy package:

        http://www.lemburg.com/files/python/mxProxy.html

    The weakdict module by Dieter Maurer is implemented in C and
    Python.  It appears that the Web pages have not been updated since
    Python 1.5.2a, so I'm not yet sure if the implementation is
    compatible with Python 2.0.

        http://www.handshake.de/~dieter/weakdict.html

    PyWeakReference by Alex Shindich:

        http://sourceforge.net/projects/pyweakreference/

    Eric Tiedemann has a weak dictionary implementation:

        http://www.hyperreal.org/~est/python/weak/


Weak References in Java

    http://java.sun.com/j2se/1.3/docs/api/java/lang/ref/package-summary.html

    Java provides three forms of weak references, and one interesting
    helper class.  The three forms are called "weak", "soft", and
    "phantom" references.  The relevant classes are defined in the
    java.lang.ref package.

    For each of the reference types, there is an option to add the
    reference to a queue when it is invalidated by the memory
    allocator.  The primary purpose of this facility seems to be that
    it allows larger structures to be composed to incorporate
    weak-reference semantics without having to impose substantial
    additional locking requirements.  For instance, it would not be
    difficult to use this facility to create a "weak" hash table which
    removes keys and referents when a reference is no longer used
    elsewhere.  Using weak references for the objects without some
    sort of notification queue for invalidations leads to much more
    tedious implementation of the various operations required on hash
    tables.  This can be a performance bottleneck if deallocations of
    the stored objects are infrequent.

    Java's "weak" references are most like Dianne Hackborn's old vref
    proposal: a reference object refers to a single Python object,
    but does not own a reference to that object.  When that object is
    deallocated, the reference object is invalidated.  Users of the
    reference object can easily determine that the reference has been
    invalidated, or a NullObjectDereferenceError can be raised when
    an attempt is made to use the referred-to object.

    The "soft" references are similar, but are not invalidated as soon
    as all other references to the referred-to object have been
    released.  The "soft" reference does own a reference, but allows
    the memory allocator to free the referent if the memory is needed
    elsewhere.  It is not clear whether this means soft references are
    released before the malloc() implementation calls sbrk() or its
    equivalent, or if soft references are only cleared when malloc()
    returns NULL.

    "Phantom" references are a little different; unlike weak and soft
    references, the referent is not cleared when the reference is
    added to its queue.  When all phantom references for an object
    are dequeued, the object is cleared.  This can be used to keep an
    object alive until some additional cleanup is performed which
    needs to happen before the objects .finalize() method is called.

    Unlike the other two reference types, "phantom" references must be
    associated with an invalidation queue.


Appendix -- Dianne Hackborn's vref proposal (1995)

    [This has been indented and paragraphs reflowed, but there have be
    no content changes.  --Fred]

    Proposal: Virtual References

    In an attempt to partly address the recurring discussion
    concerning reference counting vs. garbage collection, I would like
    to propose an extension to Python which should help in the
    creation of "well structured" cyclic graphs.  In particular, it
    should allow at least trees with parent back-pointers and
    doubly-linked lists to be created without worry about cycles.

    The basic mechanism I'd like to propose is that of a "virtual
    reference," or a "vref" from here on out.  A vref is essentially a
    handle on an object that does not increment the object's reference
    count.  This means that holding a vref on an object will not keep
    the object from being destroyed.  This would allow the Python
    programmer, for example, to create the aforementioned tree
    structure tree structure, which is automatically destroyed when it
    is no longer in use -- by making all of the parent back-references
    into vrefs, they no longer create reference cycles which keep the
    tree from being destroyed.

    In order to implement this mechanism, the Python core must ensure
    that no -real- pointers are ever left referencing objects that no
    longer exist.  The implementation I would like to propose involves
    two basic additions to the current Python system:

    1. A new "vref" type, through which the Python programmer creates
       and manipulates virtual references.  Internally, it is
       basically a C-level Python object with a pointer to the Python
       object it is a reference to.  Unlike all other Python code,
       however, it does not change the reference count of this object.
       In addition, it includes two pointers to implement a
       doubly-linked list, which is used below.

    2. The addition of a new field to the basic Python object
       [PyObject_Head in object.h], which is either NULL, or points to
       the head of a list of all vref objects that reference it.  When
       a vref object attaches itself to another object, it adds itself
       to this linked list.  Then, if an object with any vrefs on it
       is deallocated, it may walk this list and ensure that all of
       the vrefs on it point to some safe value, e.g. Nothing.


    This implementation should hopefully have a minimal impact on the
    current Python core -- when no vrefs exist, it should only add one
    pointer to all objects, and a check for a NULL pointer every time
    an object is deallocated.

    Back at the Python language level, I have considered two possible
    semantics for the vref object --

    ==> Pointer semantics:

      In this model, a vref behaves essentially like a Python-level
      pointer; the Python program must explicitly dereference the vref
      to manipulate the actual object it references.

      An example vref module using this model could include the
      function "new"; When used as 'MyVref = vref.new(MyObject)', it
      returns a new vref object such that that MyVref.object ==
      MyObject.  MyVref.object would then change to Nothing if
      MyObject is ever deallocated.

      For a concrete example, we may introduce some new C-style syntax:

      & -- unary operator, creates a vref on an object, same as vref.new().
      * -- unary operator, dereference a vref, same as VrefObject.object.

      We can then define:

      1.     type(&MyObject) == vref.VrefType
      2.        *(&MyObject) == MyObject
      3. (*(&MyObject)).attr == MyObject.attr
      4.          &&MyObject == Nothing
      5.           *MyObject -> exception

      Rule #4 is subtle, but comes about because we have made a vref
      to (a vref with no real references).  Thus the outer vref is
      cleared to Nothing when the inner one inevitably disappears.

    ==> Proxy semantics:

      In this model, the Python programmer manipulates vref objects
      just as if she were manipulating the object it is a reference
      of.  This is accomplished by implementing the vref so that all
      operations on it are redirected to its referenced object.  With
      this model, the dereference operator (*) no longer makes sense;
      instead, we have only the reference operator (&), and define:

      1.  type(&MyObject) == type(MyObject)
      2.        &MyObject == MyObject
      3. (&MyObject).attr == MyObject.attr
      4.       &&MyObject == MyObject

      Again, rule #4 is important -- here, the outer vref is in fact a
      reference to the original object, and -not- the inner vref.
      This is because all operations applied to a vref actually apply
      to its object, so that creating a vref of a vref actually
      results in creating a vref of the latter's object.

    The first, pointer semantics, has the advantage that it would be
    very easy to implement; the vref type is extremely simple,
    requiring at minimum a single attribute, object, and a function to
    create a reference.

    However, I really like the proxy semantics.  Not only does it put
    less of a burden on the Python programmer, but it allows you to do
    nice things like use a vref anywhere you would use the actual
    object.  Unfortunately, it would probably an extreme pain, if not
    practically impossible, to implement in the current Python
    implementation.  I do have some thoughts, though, on how to do
    this, if it seems interesting; one possibility is to introduce new
    type-checking functions which handle the vref.  This would
    hopefully older C modules which don't expect vrefs to simply
    return a type error, until they can be fixed.

    Finally, there are some other additional capabilities that this
    system could provide.  One that seems particularily interesting to
    me involves allowing the Python programmer to add "destructor"
    function to a vref -- this Python function would be called
    immediately prior to the referenced object being deallocated,
    allowing a Python program to invisibly attach itself to another
    object and watch for it to disappear.  This seems neat, though I
    haven't actually come up with any practical uses for it, yet... :)

    -- Dianne


Copyright

    This document has been placed in the public domain.



pep-0206 Python Advanced Library

PEP: 206
Title: Python Advanced Library
Version: $Revision$
Last-Modified: $Date$
Author: A.M. Kuchling <amk at amk.ca>
Status: Withdrawn
Type: Informational
Created: 
Post-History: 

Introduction

    This PEP describes the Python Advanced Library, a collection of
    high-quality and frequently-used third party extension modules.

Batteries Included Philosophy

    The Python source distribution has long maintained the philosophy
    of "batteries included" -- having a rich and versatile standard
    library which is immediately available, without making the user
    download separate packages.  This gives the Python language a head
    start in many projects.

    However, the standard library modules aren't always the best
    choices for a job.  Some library modules were quick hacks
    (e.g. calendar, commands), some were designed poorly and are now
    near-impossible to fix (cgi), and some have been rendered obsolete
    by other, more complete modules (binascii offers the same features
    as the binhex, uu, base64 modules).  This PEP describes a list of
    third-party modules that make Python more competitive for various
    application domains, forming the Python Advanced Library.

    The deliverable is a set of scripts that will retrieve, build, and
    install the packages for a particular application domain.  The
    Python Package Index now contains enough information to let
    software automatically find packages and download them, so the
    time is ripe to implement this.
    
    Currently this document doesn't suggest *removing* modules from
    the standard library that are superseded by a third-party module.
    That's difficult to do because it entails many backward-compatibility 
    problems, so it's not worth bothering with now.

    Please suggest additional domains of interest.


Domain: Web tasks

    XML parsing: ElementTree + SAX.

    URL retrieval: libcurl? other possibilities?

    HTML parsing: mxTidy? HTMLParser?

    Async network I/O: Twisted

    RDF parser: ???

    HTTP serving: ???

    HTTP cookie processing: ???

    Web framework: A WSGI gateway, perhaps?  Paste?

    Graphics: PIL, Chaco.


Domain: Scientific Programming

    Numeric: Numeric, SciPy

    Graphics: PIL, Chaco.


Domain: Application Development

    GUI toolkit: ???

    Graphics: Reportlab for PDF generation.


Domain: Education

    Graphics: PyGame


Software covered by the GNU General Public License

    Some of these third-party modules are covered by the GNU General
    Public License and the GNU Lesser General Public License.
    Providing a script to download and install such packages, or even
    assembling all these packages into a single tarball or CD-ROM,
    shouldn't cause any difficulties with the GPL, under the "mere
    aggregation" clause of the license.
   

Open Issues

    What other application domains are important?

    Should this just be a set of Ubuntu or Debian packages?  Compiling
    things such as PyGame can be very complicated and may be too
    difficult to automate.


Acknowledgements

    The PEP is based on an earlier draft PEP by Moshe Zadka, titled
    "2.0 Batteries Included."


pep-0207 Rich Comparisons

PEP: 207
Title: Rich Comparisons
Version: $Revision$
Last-Modified: $Date$
Author: Guido van Rossum <guido at python.org>, David Ascher <DavidA at ActiveState.com>
Status: Final
Type: Standards Track
Created: 
Python-Version: 2.1
Post-History: 

Abstract

    This PEP proposes several new features for comparisons:

    - Allow separately overloading of <, >, <=, >=, ==, !=, both in
      classes and in C extensions.

    - Allow any of those overloaded operators to return something else
      besides a Boolean result.


Motivation

    The main motivation comes from NumPy, whose users agree that A<B
    should return an array of elementwise comparison outcomes; they
    currently have to spell this as less(A,B) because A<B can only
    return a Boolean result or raise an exception.

    An additional motivation is that frequently, types don't have a
    natural ordering, but still need to be compared for equality.
    Currently such a type *must* implement comparison and thus define
    an arbitrary ordering, just so that equality can be tested.

    Also, for some object types an equality test can be implemented
    much more efficiently than an ordering test; for example, lists
    and dictionaries that differ in length are unequal, but the
    ordering requires inspecting some (potentially all) items.


Previous Work

    Rich Comparisons have been proposed before; in particular by David
    Ascher, after experience with Numerical Python:

      http://starship.python.net/crew/da/proposals/richcmp.html

    It is also included below as an Appendix.  Most of the material in
    this PEP is derived from David's proposal.


Concerns

    1 Backwards compatibility, both at the Python level (classes using
      __cmp__ need not be changed) and at the C level (extensions
      defining tp_compare need not be changed, code using
      PyObject_Compare() must work even if the compared objects use
      the new rich comparison scheme).

    2 When A<B returns a matrix of elementwise comparisons, an easy
      mistake to make is to use this expression in a Boolean context.
      Without special precautions, it would always be true.  This use
      should raise an exception instead.

    3 If a class overrides x==y but nothing else, should x!=y be
      computed as not(x==y), or fail?  What about the similar
      relationship between < and >=, or between > and <=?

    4 Similarly, should we allow x<y to be calculated from y>x?  And
      x<=y from not(x>y)?  And x==y from y==x, or x!=y from y!=x?

    5 When comparison operators return elementwise comparisons, what
      to do about shortcut operators like A<B<C, ``A<B and C<D'',
      ``A<B or C<D''?

    6 What to do about min() and max(), the 'in' and 'not in'
      operators, list.sort(), dictionary key comparison, and other
      uses of comparisons by built-in operations?


Proposed Resolutions

    1 Full backwards compatibility can be achieved as follows.  When
      an object defines tp_compare() but not tp_richcompare(), and a
      rich comparison is requested, the outcome of tp_compare() is
      used in the ovious way.  E.g. if "<" is requested, an exception if
      tp_compare() raises an exception, the outcome is 1 if
      tp_compare() is negative, and 0 if it is zero or positive.  Etc.

      Full forward compatibility can be achieved as follows.  When a
      classic comparison is requested on an object that implements
      tp_richcompare(), up to three comparisons are used: first == is
      tried, and if it returns true, 0 is returned; next, < is tried
      and if it returns true, -1 is returned; next, > is tried and if
      it returns true, +1 is returned.  If any operator tried returns
      a non-Boolean value (see below), the exception raised by
      conversion to Boolean is passed through.  If none of the
      operators tried returns true, the classic comparison fallbacks
      are tried next.

      (I thought long and hard about the order in which the three
      comparisons should be tried.  At one point I had a convincing
      argument for doing it in this order, based on the behavior of
      comparisons for cyclical data structures.  But since that code
      has changed again, I'm not so sure that it makes a difference
      any more.)

    2 Any type that returns a collection of Booleans instead of a
      single boolean should define nb_nonzero() to raise an exception.
      Such a type is considered a non-Boolean.

    3 The == and != operators are not assumed to be each other's
      complement (e.g. IEEE 754 floating point numbers do not satisfy
      this).  It is up to the type to implement this if desired.
      Similar for < and >=, or > and <=; there are lots of examples
      where these assumptions aren't true (e.g. tabnanny).

    4 The reflexivity rules *are* assumed by Python.  Thus, the
      interpreter may swap y>x with x<y, y>=x with x<=y, and may swap
      the arguments of x==y and x!=y.  (Note: Python currently assumes
      that x==x is always true and x!=x is never true; this should not
      be assumed.)

    5 In the current proposal, when A<B returns an array of
      elementwise comparisons, this outcome is considered non-Boolean,
      and its interpretation as Boolean by the shortcut operators
      raises an exception.  David Ascher's proposal tries to deal
      with this; I don't think this is worth the additional complexity
      in the code generator.  Instead of A<B<C, you can write
      (A<B)&(B<C).

    6 The min() and list.sort() operations will only use the
      < operator; max() will only use the > operator.  The 'in' and
      'not in' operators and dictionary lookup will only use the ==
      operator.


Implementation Proposal

    This closely follows David Ascher's proposal.

    C API

    - New functions:

      PyObject *PyObject_RichCompare(PyObject *, PyObject *, int)

      This performs the requested rich comparison, returning a Python
      object or raising an exception.  The 3rd argument must be one of
      Py_LT, Py_LE, Py_EQ, Py_NE, Py_GT or Py_GE.

      int PyObject_RichCompareBool(PyObject *, PyObject *, int)

      This performs the requested rich comparison, returning a
      Boolean: -1 for exception, 0 for false, 1 for true.  The 3rd
      argument must be one of Py_LT, Py_LE, Py_EQ, Py_NE, Py_GT or
      Py_GE.  Note that when PyObject_RichCompare() returns a
      non-Boolean object, PyObject_RichCompareBool() will raise an
      exception.

    - New typedef:

      typedef PyObject *(*richcmpfunc) (PyObject *, PyObject *, int);

    - New slot in type object, replacing spare tp_xxx7:

      richcmpfunc tp_richcompare;

      This should be a function with the same signature as
      PyObject_RichCompare(), and performing the same comparison.
      At least one of the arguments is of the type whose
      tp_richcompare slot is being used, but the other may have a
      different type.  If the function cannot compare the particular
      combination of objects, it should return a new reference to
      Py_NotImplemented.

    - PyObject_Compare() is changed to try rich comparisons if they
      are defined (but only if classic comparisons aren't defined).

    Changes to the interpreter

    - Whenever PyObject_Compare() is called with the intent of getting
      the outcome of a particular comparison (e.g. in list.sort(), and
      of course for the comparison operators in ceval.c), the code is
      changed to call PyObject_RichCompare() or
      PyObject_RichCompareBool() instead; if the C code needs to know
      the outcome of the comparison, PyObject_IsTrue() is called on
      the result (which may raise an exception).

    - Most built-in types that currently define a comparison will be
      modified to define a rich comparison instead.  (This is
      optional; I've converted lists, tuples, complex numbers, and
      arrays so far, and am not sure whether I will convert others.)

    Classes

    - Classes can define new special methods __lt__, __le__, __eq__,
      __ne__,__gt__, __ge__ to override the corresponding operators.
      (I.e., <, <=, ==, !=, >, >=. You gotta love the Fortran
      heritage.)  If a class defines __cmp__ as well, it is only used
      when __lt__ etc. have been tried and return NotImplemented.


Copyright

    This document has been placed in the public domain.


Appendix

    Here is most of David Ascher's original proposal (version 0.2.1,
    dated Wed Jul 22 16:49:28 1998; I've left the Contents, History
    and Patches sections out).  It addresses almost all concerns
    above.


Abstract

    A new mechanism allowing comparisons of Python objects to return
    values other than -1, 0, or 1 (or raise exceptions) is
    proposed. This mechanism is entirely backwards compatible, and can
    be controlled at the level of the C PyObject type or of the Python
    class definition. There are three cooperating parts to the
    proposed mechanism:

    - the use of the last slot in the type object structure to store a
    pointer to a rich comparison function

    - the addition of special methods for classes

    - the addition of an optional argument to the builtin cmp()
    function.


Motivation

    The current comparison protocol for Python objects assumes that
    any two Python objects can be compared (as of Python 1.5, object
    comparisons can raise exceptions), and that the return value for
    any comparison should be -1, 0 or 1. -1 indicates that the first
    argument to the comparison function is less than the right one, +1
    indicating the contrapositive, and 0 indicating that the two
    objects are equal. While this mechanism allows the establishment
    of a order relationship (e.g. for use by the sort() method of list
    objects), it has proven to be limited in the context of Numeric
    Python (NumPy).

    Specifically, NumPy allows the creation of multidimensional
    arrays, which support most of the numeric operators. Thus:

             x = array((1,2,3,4))        y = array((2,2,4,4))

    are two NumPy arrays. While they can be added elementwise,:

             z = x + y   # z == array((3,4,7,8))

    they cannot be compared in the current framework - the released
    version of NumPy compares the pointers, (thus yielding junk
    information) which was the only solution before the recent
    addition of the ability (in 1.5) to raise exceptions in comparison
    functions.

    Even with the ability to raise exceptions, the current protocol
    makes array comparisons useless. To deal with this fact, NumPy
    includes several functions which perform the comparisons: less(),
    less_equal(), greater(), greater_equal(), equal(),
    not_equal(). These functions return arrays with the same shape as
    their arguments (modulo broadcasting), filled with 0's and 1's
    depending on whether the comparison is true or not for each
    element pair. Thus, for example, using the arrays x and y defined
    above:

             less(x,y) 

    would be an array containing the numbers (1,0,0,0).

    The current proposal is to modify the Python object interface to
    allow the NumPy package to make it so that x < y returns the same
    thing as less(x,y). The exact return value is up to the NumPy
    package -- what this proposal really asks for is changing the
    Python core so that extension objects have the ability to return
    something other than -1, 0, 1, should their authors choose to do
    so.

Current State of Affairs

    The current protocol is, at the C level, that each object type
    defines a tp_compare slot, which is a pointer to a function which
    takes two PyObject* references and returns -1, 0, or 1. This
    function is called by the PyObject_Compare() function defined in
    the C API. PyObject_Compare() is also called by the builtin
    function cmp() which takes two arguments.

Proposed Mechanism

    1. Changes to the C structure for type objects

    The last available slot in the PyTypeObject, reserved up to now
    for future expansion, is used to optionally store a pointer to a
    new comparison function, of type richcmpfunc defined by:

           typedef PyObject *(*richcmpfunc)
                Py_PROTO((PyObject *, PyObject *, int));

    This function takes three arguments. The first two are the objects
    to be compared, and the third is an integer corresponding to an
    opcode (one of LT, LE, EQ, NE, GT, GE). If this slot is left NULL,
    then rich comparison for that object type is not supported (except
    for class instances whose class provide the special methods
    described below).

    The above opcodes need to be added to the published Python/C API
    (probably under the names Py_LT, Py_LE, etc.)

    2. Additions of special methods for classes

    Classes wishing to support the rich comparison mechanisms must add
    one or more of the following new special methods:

             def __lt__(self, other):
                ...
             def __le__(self, other):
                ...
             def __gt__(self, other):
                ...
             def __ge__(self, other):
                ...
             def __eq__(self, other):
                ...
             def __ne__(self, other):
                ...

    Each of these is called when the class instance is the on the
    left-hand-side of the corresponding operators (<, <=, >, >=, ==,
    and != or <>). The argument other is set to the object on the
    right side of the operator. The return value of these methods is
    up to the class implementor (after all, that's the entire point of
    the proposal).

    If the object on the left side of the operator does not define an
    appropriate rich comparison operator (either at the C level or
    with one of the special methods, then the comparison is reversed,
    and the right hand operator is called with the opposite operator,
    and the two objects are swapped. This assumes that a < b and b > a
    are equivalent, as are a <= b and b >= a, and that == and != are
    commutative (e.g. a == b if and only if b == a).

    For example, if obj1 is an object which supports the rich
    comparison protocol and x and y are objects which do not support
    the rich comparison protocol, then obj1 < x will call the __lt__
    method of obj1 with x as the second argument. x < obj1 will call
    obj1's __gt__ method with x as a second argument, and x < y will
    just use the existing (non-rich) comparison mechanism.

    The above mechanism is such that classes can get away with not
    implementing either __lt__ and __le__ or __gt__ and
    __ge__. Further smarts could have been added to the comparison
    mechanism, but this limited set of allowed "swaps" was chosen
    because it doesn't require the infrastructure to do any processing
    (negation) of return values. The choice of six special methods was
    made over a single (e.g. __richcmp__) method to allow the
    dispatching on the opcode to be performed at the level of the C
    implementation rather than the user-defined method.

    3. Addition of an optional argument to the builtin cmp()

    The builtin cmp() is still used for simple comparisons. For rich
    comparisons, it is called with a third argument, one of "<", "<=",
    ">", ">=", "==", "!=", "<>" (the last two have the same
    meaning). When called with one of these strings as the third
    argument, cmp() can return any Python object. Otherwise, it can
    only return -1, 0 or 1 as before.

Chained Comparisons

    Problem

    It would be nice to allow objects for which the comparison returns
    something other than -1, 0, or 1 to be used in chained
    comparisons, such as:

             x < y < z

    Currently, this is interpreted by Python as:

             temp1 = x < y
             if temp1:
               return y < z
             else:
               return temp1     

    Note that this requires testing the truth value of the result of
    comparisons, with potential "shortcutting" of the right-side
    comparison testings. In other words, the truth-value of the result
    of the result of the comparison determines the result of a chained
    operation. This is problematic in the case of arrays, since if x,
    y and z are three arrays, then the user expects:

            x < y < z

    to be an array of 0's and 1's where 1's are in the locations
    corresponding to the elements of y which are between the
    corresponding elements in x and z. In other words, the right-hand
    side must be evaluated regardless of the result of x < y, which is
    incompatible with the mechanism currently in use by the parser.

    Solution

    Guido mentioned that one possible way out would be to change the
    code generated by chained comparisons to allow arrays to be
    chained-compared intelligently. What follows is a mixture of his
    idea and my suggestions. The code generated for x < y < z would be
    equivalent to:

             temp1 = x < y
             if temp1:
               temp2 = y < z
               return boolean_combine(temp1, temp2)
             else:
               return temp1     

    where boolean_combine is a new function which does something like
    the following:

             def boolean_combine(a, b):
                 if hasattr(a, '__boolean_and__') or \
                    hasattr(b, '__boolean_and__'):
                     try:
                         return a.__boolean_and__(b)
                     except:
                         return b.__boolean_and__(a)
                 else: # standard behavior
                     if a:
                         return b
                     else: 
                         return 0

    where the __boolean_and__ special method is implemented for
    C-level types by another value of the third argument to the
    richcmp function. This method would perform a boolean comparison
    of the arrays (currently implemented in the umath module as the
    logical_and ufunc).

    Thus, objects returned by rich comparisons should always test
    true, but should define another special method which creates
    boolean combinations of them and their argument.

    This solution has the advantage of allowing chained comparisons to
    work for arrays, but the disadvantage that it requires comparison
    arrays to always return true (in an ideal world, I'd have them
    always raise an exception on truth testing, since the meaning of
    testing "if a>b:" is massively ambiguous.

    The inlining already present which deals with integer comparisons
    would still apply, resulting in no performance cost for the most
    common cases.


pep-0208 Reworking the Coercion Model

PEP: 208
Title: Reworking the Coercion Model
Version: $Revision$
Last-Modified: $Date$
Author: Neil Schemenauer <nas at arctrix.com>, Marc-AndrĂŠ Lemburg <mal at lemburg.com>
Status: Final
Type: Standards Track
Created: 04-Dec-2000
Python-Version: 2.1
Post-History: 

Abstract

    Many Python types implement numeric operations.  When the arguments of
    a numeric operation are of different types, the interpreter tries to
    coerce the arguments into a common type.  The numeric operation is
    then performed using this common type.  This PEP proposes a new type
    flag to indicate that arguments to a type's numeric operations should
    not be coerced.  Operations that do not support the supplied types
    indicate it by returning a new singleton object.  Types which do not
    set the type flag are handled in a backwards compatible manner.
    Allowing operations handle different types is often simpler, more
    flexible, and faster than having the interpreter do coercion.


Rationale

    When implementing numeric or other related operations, it is often
    desirable to provide not only operations between operands of one type
    only, e.g. integer + integer, but to generalize the idea behind the
    operation to other type combinations as well, e.g. integer + float.

    A common approach to this mixed type situation is to provide a method
    of "lifting" the operands to a common type (coercion) and then use
    that type's operand method as execution mechanism.  Yet, this strategy
    has a few drawbacks:

        * the "lifting" process creates at least one new (temporary)
          operand object,

        * since the coercion method is not being told about the operation
          that is to follow, it is not possible to implement operation
          specific coercion of types,

        * there is no elegant way to solve situations were a common type
          is not at hand, and

        * the coercion method will always have to be called prior to the
          operation's method itself.

    A fix for this situation is obviously needed, since these drawbacks
    make implementations of types needing these features very cumbersome,
    if not impossible.  As an example, have a look at the DateTime and
    DateTimeDelta[1] types, the first being absolute, the second
    relative.  You can always add a relative value to an absolute one,
    giving a new absolute value.  Yet, there is no common type which the
    existing coercion mechanism could use to implement that operation.

    Currently, PyInstance types are treated specially by the interpreter
    in that their numeric methods are passed arguments of different types.
    Removing this special case simplifies the interpreter and allows other
    types to implement numeric methods that behave like instance types.
    This is especially useful for extension types like ExtensionClass.


Specification

    Instead of using a central coercion method, the process of handling
    different operand types is simply left to the operation.  If the
    operation finds that it cannot handle the given operand type
    combination, it may return a special singleton as indicator.

    Note that "numbers" (anything that implements the number protocol, or
    part of it) written in Python already use the first part of this
    strategy - it is the C level API that we focus on here.

    To maintain nearly 100% backward compatibility we have to be very
    careful to make numbers that don't know anything about the new
    strategy (old style numbers) work just as well as those that expect
    the new scheme (new style numbers).  Furthermore, binary compatibility
    is a must, meaning that the interpreter may only access and use new
    style operations if the number indicates the availability of these.

    A new style number is considered by the interpreter as such if and
    only it it sets the type flag Py_TPFLAGS_CHECKTYPES.  The main
    difference between an old style number and a new style one is that the
    numeric slot functions can no longer assume to be passed arguments of
    identical type.  New style slots must check all arguments for proper
    type and implement the necessary conversions themselves.  This may seem
    to cause more work on the behalf of the type implementor, but is in
    fact no more difficult than writing the same kind of routines for an
    old style coercion slot.

    If a new style slot finds that it cannot handle the passed argument
    type combination, it may return a new reference of the special
    singleton Py_NotImplemented to the caller.  This will cause the caller
    to try the other operands operation slots until it finds a slot that
    does implement the operation for the specific type combination.  If
    none of the possible slots succeed, it raises a TypeError.

    To make the implementation easy to understand (the whole topic is
    esoteric enough), a new layer in the handling of numeric operations is
    introduced.  This layer takes care of all the different cases that need
    to be taken into account when dealing with all the possible
    combinations of old and new style numbers.  It is implemented by the
    two static functions binary_op() and ternary_op(), which are both
    internal functions that only the functions in Objects/abstract.c
    have access to.  The numeric API (PyNumber_*) is easy to adapt to
    this new layer.

    As a side-effect all numeric slots can be NULL-checked (this has to be
    done anyway, so the added feature comes at no extra cost).


    The scheme used by the layer to execute a binary operation is as
    follows:

      v        | w          | Action taken
      ---------+------------+----------------------------------
      new      | new        | v.op(v,w), w.op(v,w)
      new      | old        | v.op(v,w), coerce(v,w), v.op(v,w)
      old      | new        | w.op(v,w), coerce(v,w), v.op(v,w)
      old      | old        | coerce(v,w), v.op(v,w)

    The indicated action sequence is executed from left to right until
    either the operation succeeds and a valid result (!=
    Py_NotImplemented) is returned or an exception is raised.  Exceptions
    are returned to the calling function as-is.  If a slot returns
    Py_NotImplemented, the next item in the sequence is executed.

    Note that coerce(v,w) will use the old style nb_coerce slot methods
    via a call to PyNumber_Coerce().

    Ternary operations have a few more cases to handle:

      v   | w   | z   | Action taken
      ----+-----+-----+------------------------------------
      new | new | new | v.op(v,w,z), w.op(v,w,z), z.op(v,w,z)
      new | old | new | v.op(v,w,z), z.op(v,w,z), coerce(v,w,z), v.op(v,w,z)
      old | new | new | w.op(v,w,z), z.op(v,w,z), coerce(v,w,z), v.op(v,w,z)
      old | old | new | z.op(v,w,z), coerce(v,w,z), v.op(v,w,z)
      new | new | old | v.op(v,w,z), w.op(v,w,z), coerce(v,w,z), v.op(v,w,z)
      new | old | old | v.op(v,w,z), coerce(v,w,z), v.op(v,w,z)
      old | new | old | w.op(v,w,z), coerce(v,w,z), v.op(v,w,z)
      old | old | old | coerce(v,w,z), v.op(v,w,z)

    The same notes as above, except that coerce(v,w,z) actually does:

        if z != Py_None:
            coerce(v,w), coerce(v,z), coerce(w,z)
        else:
            # treat z as absent variable
            coerce(v,w)


    The current implementation uses this scheme already (there's only one
    ternary slot: nb_pow(a,b,c)).

    Note that the numeric protocol is also used for some other related
    tasks, e.g. sequence concatenation.  These can also benefit from the
    new mechanism by implementing right-hand operations for type
    combinations that would otherwise fail to work.  As an example, take
    string concatenation: currently you can only do string + string.  With
    the new mechanism, a new string-like type could implement new_type +
    string and string + new_type, even though strings don't know anything
    about new_type.

    Since comparisons also rely on coercion (every time you compare an
    integer to a float, the integer is first converted to float and then
    compared...), a new slot to handle numeric comparisons is needed:

        PyObject *nb_cmp(PyObject *v, PyObject *w)

    This slot should compare the two objects and return an integer object
    stating the result.  Currently, this result integer may only be -1, 0,
    1. If the slot cannot handle the type combination, it may return a
    reference to Py_NotImplemented.  [XXX Note that this slot is still
    in flux since it should take into account rich comparisons
    (i.e. PEP 207).]

    Numeric comparisons are handled by a new numeric protocol API:

        PyObject *PyNumber_Compare(PyObject *v, PyObject *w)

    This function compare the two objects as "numbers" and return an
    integer object stating the result.  Currently, this result integer may
    only be -1, 0, 1.  In case the operation cannot be handled by the given
    objects, a TypeError is raised.

    The PyObject_Compare() API needs to adjusted accordingly to make use
    of this new API.

    Other changes include adapting some of the built-in functions (e.g.
    cmp()) to use this API as well.  Also, PyNumber_CoerceEx() will need to
    check for new style numbers before calling the nb_coerce slot.  New
    style numbers don't provide a coercion slot and thus cannot be
    explicitly coerced.


Reference Implementation

    A preliminary patch for the CVS version of Python is available through
    the Source Forge patch manager[2].


Credits

    This PEP and the patch are heavily based on work done by Marc-AndrĂŠ
    Lemburg[3].


Copyright

    This document has been placed in the public domain.


References

    [1] http://www.lemburg.com/files/python/mxDateTime.html
    [2] http://sourceforge.net/patch/?func=detailpatch&patch_id=102652&group_id=5470
    [3] http://www.lemburg.com/files/python/CoercionProposal.html




pep-0209 Multi-dimensional Arrays

PEP: 209
Title: Multi-dimensional Arrays
Version: $Revision$
Last-Modified: $Date$
Author: Paul Barrett <barrett at stsci.edu>, Travis Oliphant <oliphant at ee.byu.edu>
Status: Withdrawn
Type: Standards Track
Created: 03-Jan-2001
Python-Version: 2.2
Post-History: 

Abstract

    This PEP proposes a redesign and re-implementation of the multi-
    dimensional array module, Numeric, to make it easier to add new
    features and functionality to the module.  Aspects of Numeric 2
    that will receive special attention are efficient access to arrays
    exceeding a gigabyte in size and composed of inhomogeneous data
    structures or records.  The proposed design uses four Python
    classes: ArrayType, UFunc, Array, and ArrayView; and a low-level
    C-extension module, _ufunc, to handle the array operations
    efficiently.  In addition, each array type has its own C-extension
    module which defines the coercion rules, operations, and methods
    for that type.  This design enables new types, features, and
    functionality to be added in a modular fashion.  The new version
    will introduce some incompatibilities with the current Numeric.


Motivation

    Multi-dimensional arrays are commonly used to store and manipulate
    data in science, engineering, and computing.  Python currently has
    an extension module, named Numeric (henceforth called Numeric 1),
    which provides a satisfactory set of functionality for users
    manipulating homogeneous arrays of data of moderate size (of order
    10 MB).  For access to larger arrays (of order 100 MB or more) of
    possibly inhomogeneous data, the implementation of Numeric 1 is
    inefficient and cumbersome.  In the future, requests by the
    Numerical Python community for additional functionality is also
    likely as PEPs 211: Adding New Linear Operators to Python, and
    225: Elementwise/Objectwise Operators illustrate.


Proposal

    This proposal recommends a re-design and re-implementation of
    Numeric 1, henceforth called Numeric 2, which will enable new
    types, features, and functionality to be added in an easy and
    modular manner.  The initial design of Numeric 2 should focus on
    providing a generic framework for manipulating arrays of various
    types and should enable a straightforward mechanism for adding new
    array types and UFuncs.  Functional methods that are more specific
    to various disciplines can then be layered on top of this core.
    This new module will still be called Numeric and most of the
    behavior found in Numeric 1 will be preserved.

    The proposed design uses four Python classes: ArrayType, UFunc,
    Array, and ArrayView; and a low-level C-extension module to handle
    the array operations efficiently.  In addition, each array type
    has its own C-extension module which defines the coercion rules,
    operations, and methods for that type.  At a later date, when core
    functionality is stable, some Python classes can be converted to
    C-extension types.

    Some planned features are:
    
    1.  Improved memory usage
    
    This feature is particularly important when handling large arrays
    and can produce significant improvements in performance as well as
    memory usage.  We have identified several areas where memory usage
    can be improved:
    
        a.  Use a local coercion model
    
        Instead of using Python's global coercion model which creates
        temporary arrays, Numeric 2, like Numeric 1, will implement a
        local coercion model as described in PEP 208 which defers the
        responsibility of coercion to the operator.  By using internal
        buffers, a coercion operation can be done for each array
        (including output arrays), if necessary, at the time of the
        operation.  Benchmarks [1] have shown that performance is at
        most degraded only slightly and is improved in cases where the
        internal buffers are less than the L2 cache size and the
        processor is under load.  To avoid array coercion altogether,
        C functions having arguments of mixed type are allowed in
        Numeric 2.
    
        b.  Avoid creation of temporary arrays
    
        In complex array expressions (i.e. having more than one
        operation), each operation will create a temporary array which
        will be used and then deleted by the succeeding operation.  A
        better approach would be to identify these temporary arrays
        and reuse their data buffers when possible, namely when the
        array shape and type are the same as the temporary array being
        created.  This can be done by checking the temporary array's
        reference count.  If it is 1, then it will be deleted once the
        operation is done and is a candidate for reuse.
    
        c.  Optional use of memory-mapped files
    
        Numeric users sometimes need to access data from very large
        files or to handle data that is greater than the available
        memory.  Memory-mapped arrays provide a mechanism to do this
        by storing the data on disk while making it appear to be in
        memory.  Memory- mapped arrays should improve access to all
        files by eliminating one of two copy steps during a file
        access.  Numeric should be able to access in-memory and
        memory-mapped arrays transparently.
    
        d.  Record access

        In some fields of science, data is stored in files as binary
        records.  For example in astronomy, photon data is stored as a
        1 dimensional list of photons in order of arrival time.  These
        records or C-like structures contain information about the
        detected photon, such as its arrival time, its position on the
        detector, and its energy.  Each field may be of a different
        type, such as char, int, or float.  Such arrays introduce new
        issues that must be dealt with, in particular byte alignment
        or byte swapping may need to be performed for the numeric
        values to be properly accessed (though byte swapping is also
        an issue for memory mapped data).  Numeric 2 is designed to
        automatically handle alignment and representational issues
        when data is accessed or operated on.  There are two
        approaches to implementing records; as either a derived array
        class or a special array type, depending on your point-of-
        view.  We defer this discussion to the Open Issues section.
    
    
    2.  Additional array types
    
    Numeric 1 has 11 defined types: char, ubyte, sbyte, short, int,
    long, float, double, cfloat, cdouble, and object.  There are no
    ushort, uint, or ulong types, nor are there more complex types
    such as a bit type which is of use to some fields of science and
    possibly for implementing masked-arrays.  The design of Numeric 1
    makes the addition of these and other types a difficult and
    error-prone process.  To enable the easy addition (and deletion)
    of new array types such as a bit type described below, a re-design
    of Numeric is necessary.
    
        a.  Bit type
    
        The result of a rich comparison between arrays is an array of
        boolean values.  The result can be stored in an array of type
        char, but this is an unnecessary waste of memory.  A better
        implementation would use a bit or boolean type, compressing
        the array size by a factor of eight.  This is currently being
        implemented for Numeric 1 (by Travis Oliphant) and should be
        included in Numeric 2.

    3.  Enhanced array indexing syntax
    
    The extended slicing syntax was added to Python to provide greater
    flexibility when manipulating Numeric arrays by allowing
    step-sizes greater than 1.  This syntax works well as a shorthand
    for a list of regularly spaced indices.  For those situations
    where a list of irregularly spaced indices are needed, an enhanced
    array indexing syntax would allow 1-D arrays to be arguments.
    
    4.  Rich comparisons
    
    The implementation of PEP 207: Rich Comparisons in Python 2.1
    provides additional flexibility when manipulating arrays.  We
    intend to implement this feature in Numeric 2.
    
    5. Array broadcasting rules
    
    When an operation between a scalar and an array is done, the
    implied behavior is to create a new array having the same shape as
    the array operand containing the scalar value.  This is called
    array broadcasting.  It also works with arrays of lesser rank,
    such as vectors.  This implicit behavior is implemented in Numeric
    1 and will also be implemented in Numeric 2.


Design and Implementation

    The design of Numeric 2 has four primary classes:
    
    1.  ArrayType:
    
    This is a simple class that describes the fundamental properties
    of an array-type, e.g. its name, its size in bytes, its coercion
    relations with respect to other types, etc., e.g.
    
    > Int32 = ArrayType('Int32', 4, 'doc-string')
    
    Its relation to the other types is defined when the C-extension
    module for that type is imported.  The corresponding Python code
    is:
    
    > Int32.astype[Real64] = Real64
    
    This says that the Real64 array-type has higher priority than the
    Int32 array-type.
    
    The following attributes and methods are proposed for the core
    implementation.  Additional attributes can be added on an
    individual basis, e.g. .bitsize or .bitstrides for the bit type.
    
    Attributes:
        .name:                  e.g. "Int32", "Float64", etc.
        .typecode:              e.g. 'i', 'f', etc.
                                (for backward compatibility)
        .size (in bytes):       e.g. 4, 8, etc.
        .array_rules (mapping): rules between array types
        .pyobj_rules (mapping): rules between array and python types
        .doc:                   documentation string
    Methods:
        __init__():             initialization
        __del__():              destruction
        __repr__():             representation
    
    C-API:
        This still needs to be fleshed-out.
    
    
    2.  UFunc:
    
    This class is the heart of Numeric 2.  Its design is similar to
    that of ArrayType in that the UFunc creates a singleton callable
    object whose attributes are name, total and input number of
    arguments, a document string, and an empty CFunc dictionary; e.g.
    
    > add = UFunc('add', 3, 2, 'doc-string')
    
    When defined the add instance has no C functions associated with
    it and therefore can do no work.  The CFunc dictionary is
    populated or registered later when the C-extension module for an
    array-type is imported.  The arguments of the register method are:
    function name, function descriptor, and the CUFunc object.  The
    corresponding Python code is
    
    > add.register('add', (Int32, Int32, Int32), cfunc-add)
    
    In the initialization function of an array type module, e.g.
    Int32, there are two C API functions: one to initialize the
    coercion rules and the other to register the CFunc objects.
    
    When an operation is applied to some arrays, the __call__ method
    is invoked.  It gets the type of each array (if the output array
    is not given, it is created from the coercion rules) and checks
    the CFunc dictionary for a key that matches the argument types.
    If it exists the operation is performed immediately, otherwise the
    coercion rules are used to search for a related operation and set
    of conversion functions.  The __call__ method then invokes a
    compute method written in C to iterate over slices of each array,
    namely:
    
    > _ufunc.compute(slice, data, func, swap, conv)
    
    The 'func' argument is a CFuncObject, while the 'swap' and 'conv'
    arguments are lists of CFuncObjects for those arrays needing pre-
    or post-processing, otherwise None is used.  The data argument is
    a list of buffer objects, and the slice argument gives the number
    of iterations for each dimension along with the buffer offset and
    step size for each array and each dimension.
    
    We have predefined several UFuncs for use by the __call__ method:
    cast, swap, getobj, and setobj.  The cast and swap functions do
    coercion and byte-swapping, respectively and the getobj and setobj
    functions do coercion between Numeric arrays and Python sequences.
    
    The following attributes and methods are proposed for the core
    implementation.
    
    Attributes:
        .name:                  e.g. "add", "subtract", etc.
        .nargs:                 number of total arguments
        .iargs:                 number of input arguments
        .cfuncs (mapping):      the set C functions
        .doc:                   documentation string
    Methods:
        __init__():             initialization
        __del__():              destruction
        __repr__():             representation
        __call__():             look-up and dispatch method
        initrule():             initialize coercion rule
        uninitrule():           uninitialize coercion rule
        register():             register a CUFunc
        unregister():           unregister a CUFunc

    C-API:
        This still needs to be fleshed-out.
    
    3.  Array:
    
    This class contains information about the array, such as shape,
    type, endian-ness of the data, etc..  Its operators, '+', '-',
    etc. just invoke the corresponding UFunc function, e.g.
    
    > def __add__(self, other):
    >     return ufunc.add(self, other)

    The following attributes, methods, and functions are proposed for
    the core implementation.
    
    Attributes:
        .shape:                 shape of the array
        .format:                type of the array
        .real (only complex):   real part of a complex array
        .imag (only complex):   imaginary part of a complex array
    Methods:
        __init__():             initialization
        __del__():              destruction
        __repr_():              representation
        __str__():              pretty representation
        __cmp__():              rich comparison
        __len__():
        __getitem__():
        __setitem__():
        __getslice__():
        __setslice__():
        numeric methods:
        copy():                 copy of array
        aslist():               create list from array
        asstring():             create string from array
        
    Functions:
        fromlist():             create array from sequence
        fromstring():           create array from string
        array():                create array with shape and value
        concat():               concatenate two arrays
        resize():               resize array

    C-API:
        This still needs to be fleshed-out.

    4.  ArrayView

    This class is similar to the Array class except that the reshape
    and flat methods will raise exceptions, since non-contiguous
    arrays cannot be reshaped or flattened using just pointer and
    step-size information.

    C-API:
        This still needs to be fleshed-out.
    
    5.  C-extension modules:
    
    Numeric2 will have several C-extension modules.

        a.  _ufunc:

        The primary module of this set is the _ufuncmodule.c.  The
        intention of this module is to do the bare minimum,
        i.e. iterate over arrays using a specified C function.  The
        interface of these functions is the same as Numeric 1, i.e.

        int (*CFunc)(char *data, int *steps, int repeat, void *func);

        and their functionality is expected to be the same, i.e. they
        iterate over the inner-most dimension.

        The following attributes and methods are proposed for the core
        implementation.
    
        Attributes:
        
        Methods:
            compute():

        C-API:
            This still needs to be fleshed-out.

        b.  _int32, _real64, etc.:
    
        There will also be C-extension modules for each array type,
        e.g. _int32module.c, _real64module.c, etc.  As mentioned
        previously, when these modules are imported by the UFunc
        module, they will automatically register their functions and
        coercion rules.  New or improved versions of these modules can
        be easily implemented and used without affecting the rest of
        Numeric 2.


Open Issues

    1.  Does slicing syntax default to copy or view behavior?

    The default behavior of Python is to return a copy of a sub-list
    or tuple when slicing syntax is used, whereas Numeric 1 returns a
    view into the array.  The choice made for Numeric 1 is apparently
    for reasons of performance: the developers wish to avoid the
    penalty of allocating and copying the data buffer during each
    array operation and feel that the need for a deep copy of an array
    to be rare.  Yet, some have argued that Numeric's slice notation
    should also have copy behavior to be consistent with Python lists.
    In this case the performance penalty associated with copy behavior
    can be minimized by implementing copy-on-write.  This scheme has
    both arrays sharing one data buffer (as in view behavior) until
    either array is assigned new data at which point a copy of the
    data buffer is made.  View behavior would then be implemented by
    an ArrayView class, whose behavior be similar to Numeric 1 arrays,
    i.e. .shape is not settable for non-contiguous arrays.  The use of
    an ArrayView class also makes explicit what type of data the array
    contains.

    2.  Does item syntax default to copy or view behavior?

    A similar question arises with the item syntax.  For example, if a
    = [[0,1,2], [3,4,5]] and b = a[0], then changing b[0] also changes
    a[0][0], because a[0] is a reference or view of the first row of
    a.  Therefore, if c is a 2-d array, it would appear that c[i]
    should return a 1-d array which is a view into, instead of a copy
    of, c for consistency.  Yet, c[i] can be considered just a
    shorthand for c[i,:] which would imply copy behavior assuming
    slicing syntax returns a copy.  Should Numeric 2 behave the same
    way as lists and return a view or should it return a copy.
    
    3.  How is scalar coercion implemented?

    Python has fewer numeric types than Numeric which can cause
    coercion problems.  For example when multiplying a Python scalar
    of type float and a Numeric array of type float, the Numeric array
    is converted to a double, since the Python float type is actually
    a double.  This is often not the desired behavior, since the
    Numeric array will be doubled in size which is likely to be
    annoying, particularly for very large arrays.  We prefer that the
    array type trumps the python type for the same type class, namely
    integer, float, and complex.  Therefore an operation between a
    Python integer and an Int16 (short) array will return an Int16
    array.  Whereas an operation between a Python float and an Int16
    array would return a Float64 (double) array.  Operations between
    two arrays use normal coercion rules.
    
    4.  How is integer division handled?
    
    In a future version of Python, the behavior of integer division
    will change.  The operands will be converted to floats, so the
    result will be a float.  If we implement the proposed scalar
    coercion rules where arrays have precedence over Python scalars,
    then dividing an array by an integer will return an integer array
    and will not be consistent with a future version of Python which
    would return an array of type double.  Scientific programmers are
    familiar with the distinction between integer and float-point
    division, so should Numeric 2 continue with this behavior?

    5.  How should records be implemented?

    There are two approaches to implementing records depending on your
    point-of-view.  The first is two divide arrays into separate
    classes depending on the behavior of their types.  For example
    numeric arrays are one class, strings a second, and records a
    third, because the range and type of operations of each class
    differ.  As such, a record array is not a new type, but a
    mechanism for a more flexible form of array.  To easily access and
    manipulate such complex data, the class is comprised of numeric
    arrays having different byte offsets into the data buffer.  For
    example, one might have a table consisting of an array of Int16,
    Real32 values.  Two numeric arrays, one with an offset of 0 bytes
    and a stride of 6 bytes to be interpreted as Int16, and one with an
    offset of 2 bytes and a stride of 6 bytes to be interpreted as
    Real32 would represent the record array.  Both numeric arrays
    would refer to the same data buffer, but have different offset and
    stride attributes, and a different numeric type.

    The second approach is to consider a record as one of many array
    types, albeit with fewer, and possibly different, array operations
    than for numeric arrays.  This approach considers an array type to
    be a mapping of a fixed-length string.  The mapping can either be
    simple, like integer and floating-point numbers, or complex, like
    a complex number, a byte string, and a C-structure.  The record
    type effectively merges the struct and Numeric modules into a
    multi-dimensional struct array.  This approach implies certain
    changes to the array interface.  For example, the 'typecode'
    keyword argument should probably be changed to the more
    descriptive 'format' keyword.

        a.  How are record semantics defined and implemented?

        Which ever implementation approach is taken for records, the
        syntax and semantics of how they are to be accessed and
        manipulated must be decided, if one wishes to have access to
        sub-fields of records.  In this case, the record type can
        essentially be considered an inhomogeneous list, like a tuple
        returned by the unpack method of the struct module; and a 1-d
        array of records may be interpreted as a 2-d array with the
        second dimension being the index into the list of fields.
        This enhanced array semantics makes access to an array of one
        or more of the fields easy and straightforward.  It also
        allows a user to do array operations on a field in a natural
        and intuitive way.  If we assume that records are implemented
        as an array type, then last dimension defaults to 0 and can
        therefore be neglected for arrays comprised of simple types,
        like numeric.
   
    6.  How are masked-arrays implemented?

    Masked-arrays in Numeric 1 are implemented as a separate array
    class.  With the ability to add new array types to Numeric 2, it
    is possible that masked-arrays in Numeric 2 could be implemented
    as a new array type instead of an array class.
    
    7.  How are numerical errors handled (IEEE floating-point errors in
        particular)?

    It is not clear to the proposers (Paul Barrett and Travis
    Oliphant) what is the best or preferred way of handling errors.
    Since most of the C functions that do the operation, iterate over
    the inner-most (last) dimension of the array.  This dimension
    could contain a thousand or more items having one or more errors
    of differing type, such as divide-by-zero, underflow, and
    overflow.  Additionally, keeping track of these errors may come at
    the expense of performance.  Therefore, we suggest several
    options:

        a.  Print a message of the most severe error, leaving it to
        the user to locate the errors.

        b.  Print a message of all errors that occurred and the number
        of occurrences, leaving it to the user to locate the errors.

        c.  Print a message of all errors that occurred and a list of
        where they occurred.

        d.  Or use a hybrid approach, printing only the most severe
        error, yet keeping track of what and where the errors
        occurred.  This would allow the user to locate the errors
        while keeping the error message brief.

    8.  What features are needed to ease the integration of FORTRAN
        libraries and code?

    It would be a good idea at this stage to consider how to ease the
    integration of FORTRAN libraries and user code in Numeric 2.


Implementation Steps

    1.  Implement basic UFunc capability
    
        a.  Minimal Array class:

        Necessary class attributes and methods, e.g. .shape, .data,
        .type, etc.

        b.  Minimal ArrayType class:

        Int32, Real64, Complex64, Char, Object

        c.  Minimal UFunc class:

        UFunc instantiation, CFunction registration, UFunc call for
        1-D arrays including the rules for doing alignment,
        byte-swapping, and coercion.

        d.  Minimal C-extension module:

        _UFunc, which does the innermost array loop in C.
    
        This step implements whatever is needed to do: 'c = add(a, b)'
        where a, b, and c are 1-D arrays.  It teaches us how to add
        new UFuncs, to coerce the arrays, to pass the necessary
        information to a C iterator method and to do the actually
        computation.
    
    2.  Continue enhancing the UFunc iterator and Array class
    
        a.  Implement some access methods for the Array class:
            print, repr, getitem, setitem, etc.

        b.  Implement multidimensional arrays

        c.  Implement some of basic Array methods using UFuncs:
            +, -, *, /, etc.

        d.  Enable UFuncs to use Python sequences.
    
    3.  Complete the standard UFunc and Array class behavior
    
        a.  Implement getslice and setslice behavior

        b.  Work on Array broadcasting rules

        c.  Implement Record type

    4.  Add additional functionality
    
        a.  Add more UFuncs

        b.  Implement buffer or mmap access


Incompatibilities

    The following is a list of incompatibilities in behavior between
    Numeric 1 and Numeric 2.

    1.  Scalar coercion rules

    Numeric 1 has single set of coercion rules for array and Python
    numeric types.  This can cause unexpected and annoying problems
    during the calculation of an array expression.  Numeric 2 intends
    to overcome these problems by having two sets of coercion rules:
    one for arrays and Python numeric types, and another just for
    arrays.

    2.  No savespace attribute

    The savespace attribute in Numeric 1 makes arrays with this
    attribute set take precedence over those that do not have it set.
    Numeric 2 will not have such an attribute and therefore normal
    array coercion rules will be in effect.

    3.  Slicing syntax returns a copy

    The slicing syntax in Numeric 1 returns a view into the original
    array.  The slicing behavior for Numeric 2 will be a copy.  You
    should use the ArrayView class to get a view into an array.

    4.  Boolean comparisons return a boolean array

    A comparison between arrays in Numeric 1 results in a Boolean
    scalar, because of current limitations in Python.  The advent of
    Rich Comparisons in Python 2.1 will allow an array of Booleans to
    be returned.

    5.  Type characters are deprecated

    Numeric 2 will have an ArrayType class composed of Type instances,
    for example Int8, Int16, Int32, and Int for signed integers.  The
    typecode scheme in Numeric 1 will be available for backward
    compatibility, but will be deprecated.


Appendices

    A.  Implicit sub-arrays iteration

    A computer animation is composed of a number of 2-D images or
    frames of identical shape.  By stacking these images into a single
    block of memory, a 3-D array is created.  Yet the operations to be
    performed are not meant for the entire 3-D array, but on the set
    of 2-D sub-arrays.  In most array languages, each frame has to be
    extracted, operated on, and then reinserted into the output array
    using a for-like loop.  The J language allows the programmer to
    perform such operations implicitly by having a rank for the frame
    and array.  By default these ranks will be the same during the
    creation of the array.  It was the intention of the Numeric 1
    developers to implement this feature, since it is based on the
    language J.  The Numeric 1 code has the required variables for
    implementing this behavior, but was never implemented.  We intend
    to implement implicit sub-array iteration in Numeric 2, if the
    array broadcasting rules found in Numeric 1 do not fully support
    this behavior.


Copyright

    This document is placed in the public domain.


Related PEPs

    PEP 207: Rich Comparisons
        by Guido van Rossum and David Ascher

    PEP 208: Reworking the Coercion Model
        by Neil Schemenauer and Marc-Andre' Lemburg

    PEP 211: Adding New Linear Algebra Operators to Python
        by Greg Wilson

    PEP 225: Elementwise/Objectwise Operators
        by Huaiyu Zhu

    PEP 228: Reworking Python's Numeric Model
        by Moshe Zadka


References

    [1] P. Greenfield 2000. private communication.



pep-0210 Decoupling the Interpreter Loop

PEP: 210
Title: Decoupling the Interpreter Loop
Version: $Revision$
Last-Modified: $Date$
Author: David Ascher <davida at activestate.com>
Status: Rejected
Type: Standards Track
Created: 15-Jul-2000
Python-Version: 2.1
Post-History: 

pep-0211 Adding A New Outer Product Operator

PEP: 211
Title: Adding A New Outer Product Operator
Version: $Revision$
Last-Modified: $Date$
Author: Greg Wilson <gvwilson at ddj.com>
Status: Deferred
Type: Standards Track
Created: 15-Jul-2000
Python-Version: 2.1
Post-History: 

Introduction

    This PEP describes a proposal to define "@" (pronounced "across")
    as a new outer product operator in Python 2.2.  When applied to
    sequences (or other iterable objects), this operator will combine
    their iterators, so that:

        for (i, j) in S @ T:
            pass

    will be equivalent to:

        for i in S:
            for j in T:
                pass

    Classes will be able to overload this operator using the special
    methods "__across__", "__racross__", and "__iacross__".  In
    particular, the new Numeric module (PEP 0209) will overload this
    operator for multi-dimensional arrays to implement matrix
    multiplication.


Background

    Number-crunching is now just a small part of computing, but many
    programmers --- including many Python users --- still need to
    express complex mathematical operations in code.  Most numerical
    languages, such as APL, Fortran-90, MATLAB, IDL, and Mathematica,
    therefore provide two forms of the common arithmetic operators.
    One form works element-by-element, e.g. multiplies corresponding
    elements of its matrix arguments.  The other implements the
    "mathematical" definition of that operation, e.g. performs
    row-column matrix multiplication.

    Zhu and Lielens have proposed doubling up Python's operators in
    this way [1].  Their proposal would create six new binary infix
    operators, and six new in-place operators.

    The original version of this proposal was much more conservative.
    The author consulted the developers of GNU Octave [2], an open
    source clone of MATLAB.  Its developers agreed that providing an
    infix operator for matrix multiplication was important: numerical
    programmers really do care whether they have to write "mmul(A,B)"
    instead of "A op B".

    On the other hand, when asked how important it was to have infix
    operators for matrix solution and other operations, Prof. James
    Rawlings replied [3]:

        I DON'T think it's a must have, and I do a lot of matrix
        inversion. I cannot remember if its A\b or b\A so I always
        write inv(A)*b instead. I recommend dropping \.

    Based on this discussion, and feedback from students at the US
    national laboratories and elsewhere, we recommended adding only
    one new operator, for matrix multiplication, to Python.


Iterators

    The planned addition of iterators to Python 2.2 opens up a broader
    scope for this proposal.  As part of the discussion of PEP 201,
    Lockstep Iteration[4], the author of this proposal conducted an
    informal usability experiment[5].  The results showed that users
    are psychologically receptive to "cross-product" loop syntax.  For
    example, most users expected:

        S = [10, 20, 30]
        T = [1, 2, 3]
        for x in S; y in T:
            print x+y,

    to print "11 12 13 21 22 23 31 32 33".  We believe that users will
    have the same reaction to:

        for (x, y) in S @ T:
            print x+y

    i.e. that they will naturally interpret this as a tidy way to
    write loop nests.

    This is where iterators come in.  Actually constructing the
    cross-product of two (or more) sequences before executing the loop
    would be very expensive.  On the other hand, "@" could be defined
    to get its arguments' iterators, and then create an outer iterator
    which returns tuples of the values returned by the inner
    iterators.


Discussion

    1. Adding a named function "across" would have less impact on
       Python than a new infix operator.  However, this would not make
       Python more appealing to numerical programmers, who really do
       care whether they can write matrix multiplication using an
       operator, or whether they have to write it as a function call.

    2. "@" would have be chainable in the same way as comparison
       operators, i.e.:

        (1, 2) @ (3, 4) @ (5, 6)

       would have to return (1, 3, 5) ... (2, 4, 6), and *not*
       ((1, 3), 5) ... ((2, 4), 6).  This should not require special
       support from the parser, as the outer iterator created by the
       first "@" could easily be taught how to combine itself with
       ordinary iterators.

    3. There would have to be some way to distinguish restartable
       iterators from ones that couldn't be restarted.  For example,
       if S is an input stream (e.g. a file), and L is a list, then "S
       @ L" is straightforward, but "L @ S" is not, since iteration
       through the stream cannot be repeated.  This could be treated
       as an error, or by having the outer iterator detect
       non-restartable inner iterators and cache their values.

    4. Whiteboard testing of this proposal in front of three novice
       Python users (all of them experienced programmers) indicates
       that users will expect:

        "ab" @ "cd"

       to return four strings, not four tuples of pairs of
       characters.  Opinion was divided on what:

        ("a", "b") @ "cd"

       ought to return...


Alternatives

    1. Do nothing --- keep Python simple.

    This is always the default choice.

    2. Add a named function instead of an operator.

    Python is not primarily a numerical language; it may not be worth
    complexifying it for this special case.  However, support for real
    matrix multiplication *is* frequently requested, and the proposed
    semantics for "@" for built-in sequence types would simplify
    expression of a very common idiom (nested loops).

    3. Introduce prefixed forms of all existing operators, such as
       "~*" and "~+", as proposed in PEP 225 [1].

    Our objections to this are that there isn't enough demand to
    justify the additional complexity (see Rawlings' comments [3]),
    and that the proposed syntax fails the "low toner" readability
    test.


Acknowledgments

    I am grateful to Huaiyu Zhu for initiating this discussion, and to
    James Rawlings and students in various Python courses for their
    discussions of what numerical programmers really care about.


References

    [1] PEP 225, Elementwise/Objectwise Operators, Zhu, Lielens
        http://www.python.org/dev/peps/pep-0225/

    [2] http://bevo.che.wisc.edu/octave/

    [3] http://www.egroups.com/message/python-numeric/4

    [4] PEP 201, Lockstep Iteration, Warsaw
        http://www.python.org/dev/peps/pep-0201/

    [5] http://mail.python.org/pipermail/python-dev/2000-July/006427.html



pep-0212 Loop Counter Iteration

PEP: 212
Title: Loop Counter Iteration
Version: $Revision$
Last-Modified: $Date$
Author: Peter Schneider-Kamp <nowonder at nowonder.de>
Status: Deferred
Type: Standards Track
Created: 22-Aug-2000
Python-Version: 2.1
Post-History: 

Introduction

    This PEP describes the often proposed feature of exposing the loop
    counter in for-loops.  This PEP tracks the status and ownership of
    this feature.  It contains a description of the feature and
    outlines changes necessary to support the feature.  This PEP
    summarizes discussions held in mailing list forums, and provides
    URLs for further information, where appropriate.  The CVS revision
    history of this file contains the definitive historical record.


Motivation

    Standard for-loops in Python iterate over the elements of a
    sequence[1].  Often it is desirable to loop over the indices or
    both the elements and the indices instead.

    The common idioms used to accomplish this are unintuitive.  This
    PEP proposes two different ways of exposing the indices.


Loop counter iteration

    The current idiom for looping over the indices makes use of the
    built-in 'range' function:

        for i in range(len(sequence)):
            # work with index i

    Looping over both elements and indices can be achieved either by the
    old idiom or by using the new 'zip' built-in function[2]:

        for i in range(len(sequence)):
            e = sequence[i]
            # work with index i and element e

    or

        for i, e in zip(range(len(sequence)), sequence):
            # work with index i and element e


The Proposed Solutions

    There are three solutions that have been discussed.  One adds a
    non-reserved keyword, the other adds two built-in functions.
    A third solution adds methods to sequence objects.


Non-reserved keyword 'indexing'

    This solution would extend the syntax of the for-loop by adding
    an optional '<variable> indexing' clause which can also be used
    instead of the '<variable> in' clause..

    Looping over the indices of a sequence would thus become:

        for i indexing sequence:
            # work with index i

    Looping over both indices and elements would similarly be:

        for i indexing e in sequence:
            # work with index i and element e


Built-in functions 'indices' and 'irange'

    This solution adds two built-in functions 'indices' and 'irange'.
    The semantics of these can be described as follows:

        def indices(sequence):
            return range(len(sequence))

        def irange(sequence):
            return zip(range(len(sequence)), sequence)

    These functions could be implemented either eagerly or lazily and
    should be easy to extend in order to accept more than one sequence
    argument.

    The use of these functions would simplify the idioms for looping
    over the indices and over both elements and indices:

        for i in indices(sequence):
            # work with index i

        for i, e in irange(sequence):
            # work with index i and element e


Methods for sequence objects

    This solution proposes the addition of 'indices', 'items'
    and 'values' methods to sequences, which enable looping over
    indices only, both indices and elements, and elements only
    respectively.

    This would immensely simplify the idioms for looping over indices
    and for looping over both elements and indices:

        for i in sequence.indices():
            # work with index i

        for i, e in sequence.items():
            # work with index i and element e

    Additionally it would allow to do looping over the elements
    of sequences and dicitionaries in a consistent way:

        for e in sequence_or_dict.values():
            # do something with element e


Implementations

    For all three solutions some more or less rough patches exist
    as patches at SourceForge:

        'for i indexing a in l': exposing the for-loop counter[3]
        add indices() and irange() to built-ins[4]
        add items() method to listobject[5]

    All of them have been pronounced on and rejected by the BDFL.

    Note that the 'indexing' keyword is only a NAME in the
    grammar and so does not hinder the general use of 'indexing'.


Backward Compatibility Issues

    As no keywords are added and the semantics of existing code
    remains unchanged, all three solutions can be implemented
    without breaking existing code.


Copyright

    This document has been placed in the public domain.


References

    [1] http://docs.python.org/reference/compound_stmts.html#for
    [2] Lockstep Iteration, PEP 201
    [3] http://sourceforge.net/patch/download.php?id=101138
    [4] http://sourceforge.net/patch/download.php?id=101129
    [5] http://sourceforge.net/patch/download.php?id=101178



pep-0213 Attribute Access Handlers

PEP: 213
Title: Attribute Access Handlers
Version: $Revision$
Last-Modified: $Date$
Author: Paul Prescod <paul at prescod.net>
Status: Deferred
Type: Standards Track
Created: 21-Jul-2000
Python-Version: 2.1
Post-History: 

Introduction

     It is possible (and even relatively common) in Python code and 
     in extension modules to "trap" when an instance's client code 
     attempts to set an attribute and execute code instead. In other 
     words it is possible to allow users to use attribute assignment/
     retrieval/deletion syntax even though the underlying implementation 
     is doing some computation rather than directly modifying a
     binding.

     This PEP describes a feature that makes it easier, more efficient
     and safer to implement these handlers for Python instances.


Justification

    Scenario 1:

        You have a deployed class that works on an attribute named
        "stdout". After a while, you think it would be better to
        check that stdout is really an object with a "write" method
        at the moment of assignment. Rather than change to a
        setstdout method (which would be incompatible with deployed
        code) you would rather trap the assignment and check the
        object's type.

    Scenario 2:

        You want to be as compatible as possible with an object 
        model that has a concept of attribute assignment. It could
        be the W3C Document Object Model or a particular COM 
        interface (e.g. the PowerPoint interface). In that case
        you may well want attributes in the model to show up as
        attributes in the Python interface, even though the 
        underlying implementation may not use attributes at all.

    Scenario 3:

        A user wants to make an attribute read-only.

    In short, this feature allows programmers to separate the 
    interface of their module from the underlying implementation
    for whatever purpose. Again, this is not a new feature but
    merely a new syntax for an existing convention.


Current Solution

    To make some attributes read-only:

    class foo:
       def __setattr__( self, name, val ):
          if name=="readonlyattr":
             raise TypeError
          elif name=="readonlyattr2":
             raise TypeError
          ...
          else:
             self.__dict__["name"]=val

     This has the following problems:

     1. The creator of the method must be intimately aware of whether
        somewhere else in the class hiearchy __setattr__ has also been
        trapped for any particular purpose. If so, she must specifically
        call that method rather than assigning to the dictionary. There
        are many different reasons to overload __setattr__ so there is a
        decent potential for clashes. For instance object database
        implementations often overload setattr for an entirely unrelated
        purpose.

     2. The string-based switch statement forces all attribute handlers 
        to be specified in one place in the code. They may then dispatch
        to task-specific methods (for modularity) but this could cause
        performance problems.

     3. Logic for the setting, getting and deleting must live in 
        __getattr__, __setattr__ and __delattr__. Once again, this can be
        mitigated through an extra level of method call but this is 
        inefficient.


Proposed Syntax

    Special methods should declare themselves with declarations of the
    following form:

    class x:
        def __attr_XXX__(self, op, val ):
            if op=="get":
                return someComputedValue(self.internal)
            elif op=="set":
                self.internal=someComputedValue(val)
            elif op=="del":
                del self.internal

    Client code looks like this:

    fooval=x.foo
    x.foo=fooval+5
    del x.foo


Semantics

     Attribute references of all three kinds should call the method.
     The op parameter can be "get"/"set"/"del". Of course this string
     will be interned so the actual checks for the string will be
     very fast.
 
     It is disallowed to actually have an attribute named XXX in the
     same instance as a method named __attr_XXX__.

     An implementation of __attr_XXX__ takes precedence over an
     implementation of __getattr__ based on the principle that
     __getattr__ is supposed to be invoked only after finding an
     appropriate attribute has failed.

     An implementation of __attr_XXX__ takes precedence over an
     implementation of __setattr__ in order to be consistent. The
     opposite choice seems fairly feasible also, however. The same
     goes for __del_y__.


Proposed Implementation

    There is a new object type called an attribute access handler. 
    Objects of this type have the following attributes:

       name (e.g. XXX, not __attr__XXX__
       method (pointer to a method object
   
    In PyClass_New, methods of the appropriate form will be detected and
    converted into objects (just like unbound method objects). These are
    stored in the class __dict__ under the name XXX. The original method
    is stored as an unbound method under its original name.

    If there are any attribute access handlers in an instance at all,
    a flag is set. Let's call it "I_have_computed_attributes" for
    now. Derived classes inherit the flag from base classes. Instances
    inherit the flag from classes.
 
    A get proceeds as usual until just before the object is returned.
    In addition to the current check whether the returned object is a
    method it would also check whether a returned object is an access
    handler. If so, it would invoke the getter method and return
    the value. To remove an attribute access handler you could directly
    fiddle with the dictionary.
 
    A set proceeds by checking the "I_have_computed_attributes" flag. If
    it is not set, everything proceeds as it does today. If it is set
    then we must do a dictionary get on the requested object name. If it
    returns an attribute access handler then we call the setter function
    with the value. If it returns any other object then we discard the
    result and continue as we do today. Note that having an attribute
    access handler will mildly affect attribute "setting" performance for
    all sets on a particular instance, but no more so than today, using
    __setattr__. Gets are more efficient than they are today with
    __getattr__.
 
    The I_have_computed_attributes flag is intended to eliminate the
    performance degradation of an extra "get" per "set" for objects not
    using this feature. Checking this flag should have miniscule
    performance implications for all objects.

    The implementation of delete is analogous to the implementation
    of set.


Caveats

    1. You might note that I have not proposed any logic to keep
       the I_have_computed_attributes flag up to date as attributes
       are added and removed from the instance's dictionary. This is
       consistent with current Python. If you add a __setattr__ method
       to an object after it is in use, that method will not behave as
       it would if it were available at "compile" time. The dynamism is
       arguably not worth the extra implementation effort. This snippet
       demonstrates the current behavior:

	>>> def prn(*args):print args
	>>> class a:
	...    __setattr__=prn
        >>> a().foo=5
	(<__main__.a instance at 882890>, 'foo', 5)

	>>> class b: pass 
	>>> bi=b()
	>>> bi.__setattr__=prn
	>>> b.foo=5

     2. Assignment to __dict__["XXX"] can overwrite the attribute
	access handler for __attr_XXX__. Typically the access handlers will
        store information away in private __XXX variables

     3. An attribute access handler that attempts to call setattr or getattr
        on the object itself can cause an infinite loop (as with __getattr__)
        Once again, the solution is to use a special (typically private) 
        variable such as __XXX.


Note

    The descriptor mechanism described in PEP 252 is powerful enough
    to support this more directly.  A 'getset' constructor may be
    added to the language making this possible:

      class C:
          def get_x(self):
              return self.__x
          def set_x(self, v):
              self.__x = v
          x = getset(get_x, set_x)

    Additional syntactic sugar might be added, or a naming convention
    could be recognized.



pep-0214 Extended Print Statement

PEP: 214
Title: Extended Print Statement
Version: $Revision$
Last-Modified: $Date$
Author: Barry Warsaw <barry at python.org>
Status: Final
Type: Standards Track
Created: 24-Jul-2000
Python-Version: 2.0
Post-History: 16-Aug-2000

Introduction

    This PEP describes a syntax to extend the standard `print'
    statement so that it can be used to print to any file-like object,
    instead of the default sys.stdout.  This PEP tracks the status and
    ownership of this feature.  It contains a description of the
    feature and outlines changes necessary to support the feature.
    This PEP summarizes discussions held in mailing list forums, and
    provides URLs for further information, where appropriate.  The CVS
    revision history of this file contains the definitive historical
    record.


Proposal

    This proposal introduces a syntax extension to the print
    statement, which allows the programmer to optionally specify the
    output file target.  An example usage is as follows:

        print >> mylogfile, 'this message goes to my log file'

    Formally, the syntax of the extended print statement is
    
        print_stmt: ... | '>>' test [ (',' test)+ [','] ] )

    where the ellipsis indicates the original print_stmt syntax
    unchanged.  In the extended form, the expression just after >>
    must yield an object with a write() method (i.e. a file-like
    object).  Thus these two statements are equivalent:

	print 'hello world'
        print >> sys.stdout, 'hello world'

    As are these two statements:

        print
        print >> sys.stdout

    These two statements are syntax errors:

        print ,
        print >> sys.stdout,


Justification

    `print' is a Python keyword and introduces the print statement as
    described in section 6.6 of the language reference manual[1].
    The print statement has a number of features:

    - it auto-converts the items to strings
    - it inserts spaces between items automatically
    - it appends a newline unless the statement ends in a comma

    The formatting that the print statement performs is limited; for
    more control over the output, a combination of sys.stdout.write(),
    and string interpolation can be used.

    The print statement by definition outputs to sys.stdout.  More
    specifically, sys.stdout must be a file-like object with a write()
    method, but it can be rebound to redirect output to files other
    than specifically standard output.  A typical idiom is

        save_stdout = sys.stdout
        try:
            sys.stdout = mylogfile
            print 'this message goes to my log file'
        finally:
            sys.stdout = save_stdout

    The problem with this approach is that the binding is global, and
    so affects every statement inside the try: clause.  For example,
    if we added a call to a function that actually did want to print
    to stdout, this output too would get redirected to the logfile.

    This approach is also very inconvenient for interleaving prints to
    various output streams, and complicates coding in the face of
    legitimate try/except or try/finally clauses.


Reference Implementation

    A reference implementation, in the form of a patch against the
    Python 2.0 source tree, is available on SourceForge's patch
    manager[2].  This approach adds two new opcodes, PRINT_ITEM_TO and
    PRINT_NEWLINE_TO, which simply pop the file like object off the
    top of the stack and use it instead of sys.stdout as the output
    stream.

    (This reference implementation has been adopted in Python 2.0.)


Alternative Approaches

    An alternative to this syntax change has been proposed (originally
    by Moshe Zadka) which requires no syntax changes to Python.  A
    writeln() function could be provided (possibly as a builtin), that
    would act much like extended print, with a few additional
    features.

	def writeln(*args, **kws):
	    import sys
	    file = sys.stdout
	    sep = ' '
	    end = '\n'
	    if kws.has_key('file'):
		file = kws['file']
		del kws['file']
	    if kws.has_key('nl'):
		if not kws['nl']:
		    end = ' '
		del kws['nl']
	    if kws.has_key('sep'):
		sep = kws['sep']
		del kws['sep']
	    if kws:
		raise TypeError('unexpected keywords')
	    file.write(sep.join(map(str, args)) + end)

    writeln() takes a three optional keyword arguments.  In the
    context of this proposal, the relevant argument is `file' which
    can be set to a file-like object with a write() method.  Thus

        print >> mylogfile, 'this goes to my log file'

    would be written as

        writeln('this goes to my log file', file=mylogfile)

    writeln() has the additional functionality that the keyword
    argument `nl' is a flag specifying whether to append a newline or
    not, and an argument `sep' which specifies the separator to output
    in between each item.


More Justification by the BDFL

    The proposal has been challenged on the newsgroup.  One series of
    challenges doesn't like '>>' and would rather see some other
    symbol.

    Challenge: Why not one of these?

        print in stderr items,.... 
        print + stderr items,.......
        print[stderr] items,.....
        print to stderr items,.....

    Response: If we want to use a special symbol (print <symbol>
    expression), the Python parser requires that it is not already a
    symbol that can start an expression -- otherwise it can't decide
    which form of print statement is used.  (The Python parser is a
    simple LL(1) or recursive descent parser.)

    This means that we can't use the "keyword only in context trick"
    that was used for "import as", because an identifier can start an
    expression.  This rules out +stderr, [sterr], and to stderr.  It
    leaves us with binary operator symbols and other miscellaneous
    symbols that are currently illegal here, such as 'import'.

    If I had to choose between 'print in file' and 'print >> file' I
    would definitely choose '>>'.  In part because 'in' would be a new
    invention (I know of no other language that uses it, while '>>' is
    used in sh, awk, Perl, and C++), in part because '>>', being
    non-alphabetic, stands out more so is more likely to catch the
    reader's attention.

    Challenge: Why does there have to be a comma between the file and
    the rest?

    Response: The comma separating the file from the following expression is
    necessary!  Of course you want the file to be an arbitrary
    expression, not just a single word.  (You definitely want to be
    able to write print >>sys.stderr.)  Without the expression the
    parser would't be able to distinguish where that expression ends
    and where the next one begins, e.g.

        print >>i +1, 2
        print >>a [1], 2
        print >>f (1), 2

    Challenge: Why do you need a syntax extension?  Why not
    writeln(file, item, ...)?

    Response: First of all, this is lacking a feature of the print
    statement: the trailing comma to print which suppresses the final
    newline.  Note that 'print a,' still isn't equivalent to
    'sys.stdout.write(a)' -- print inserts a space between items, and
    takes arbitrary objects as arguments; write() doesn't insert a
    space and requires a single string.

    When you are considering an extension for the print statement,
    it's not right to add a function or method that adds a new feature
    in one dimension (where the output goes) but takes away in another
    dimension (spaces between items, and the choice of trailing
    newline or not).  We could add a whole slew of methods or
    functions to deal with the various cases but that seems to add
    more confusion than necessary, and would only make sense if we
    were to deprecate the print statement altogether.

    I feel that this debate is really about whether print should have
    been a function or method rather than a statement.  If you are in
    the function camp, of course adding special syntax to the existing
    print statement is not something you like.  I suspect the
    objection to the new syntax comes mostly from people who already
    think that the print statement was a bad idea.  Am I right?

    About 10 years ago I debated with myself whether to make the most
    basic from of output a function or a statement; basically I was
    trying to decide between "print(item, ...)" and "print item, ...".
    I chose to make it a statement because printing needs to be taught
    very early on, and is very important in the programs that
    beginners write.  Also, because ABC, which lead the way for so
    many things, made it a statement.  In a move that's typical for
    the interaction between ABC and Python, I changed the name from
    WRITE to print, and reversed the convention for adding newlines
    from requiring extra syntax to add a newline (ABC used trailing
    slashes to indicate newlines) to requiring extra syntax (the
    trailing comma) to suppress the newline.  I kept the feature that
    items are separated by whitespace on output.

    Full example: in ABC,

        WRITE 1
        WRITE 2/

    has the same effect as

        print 1,
        print 2

    has in Python, outputting in effect "1 2\n".

    I'm not 100% sure that the choice for a statement was right (ABC
    had the compelling reason that it used statement syntax for
    anything with side effects, but Python doesn't have this
    convention), but I'm also not convinced that it's wrong.  I
    certainly like the economy of the print statement.  (I'm a rabid
    Lisp-hater -- syntax-wise, not semantics-wise! -- and excessive
    parentheses in syntax annoy me.  Don't ever write return(i) or
    if(x==y): in your Python code! :-)

    Anyway, I'm not ready to deprecate the print statement, and over
    the years we've had many requests for an option to specify the
    file.

    Challenge: Why not > instead of >>?

    Response: To DOS and Unix users, >> suggests "append", while >
    suggests "overwrite"; the semantics are closest to append.  Also,
    for C++ programmers, >> and << are I/O operators.

    Challenge: But in C++, >> is input and << is output!

    Response: doesn't matter; C++ clearly took it from Unix and
    reversed the arrows.  The important thing is that for output, the
    arrow points to the file.

    Challenge: Surely you can design a println() function can do all
    what print>>file can do; why isn't that enough?

    Response: I think of this in terms of a simple programming
    exercise.  Suppose a beginning programmer is asked to write a
    function that prints the tables of multiplication.  A reasonable
    solution is:

        def tables(n):
            for j in range(1, n+1):
                for i in range(1, n+1):
                    print i, 'x', j, '=', i*j
                print

    Now suppose the second exercise is to add printing to a different
    file.  With the new syntax, the programmer only needs to learn one
    new thing: print >> file, and the answer can be like this:

        def tables(n, file=sys.stdout):
            for j in range(1, n+1):
                for i in range(1, n+1):
                    print >> file, i, 'x', j, '=', i*j
                print >> file
        
    With only a print statement and a println() function, the
    programmer first has to learn about println(), transforming the
    original program to using println():

        def tables(n):
            for j in range(1, n+1):
                for i in range(1, n+1):
                    println(i, 'x', j, '=', i*j)
                println()

    and *then* about the file keyword argument:

        def tables(n, file=sys.stdout):
            for j in range(1, n+1):
                for i in range(1, n+1):
                    println(i, 'x', j, '=', i*j, file=sys.stdout)
                println(file=sys.stdout)

    Thus, the transformation path is longer:

        (1) print
        (2) print >> file

    vs.

        (1) print
        (2) println()
        (3) println(file=...)

    Note: defaulting the file argument to sys.stdout at compile time
    is wrong, because it doesn't work right when the caller assigns to
    sys.stdout and then uses tables() without specifying the file.
    This is a common problem (and would occur with a println()
    function too).  The standard solution so far has been:

        def tables(n, file=None):
            if file is None:
                file = sys.stdout
            for j in range(1, n+1):
                for i in range(1, n+1):
                    print >> file, i, 'x', j, '=', i*j
                print >> file

    I've added a feature to the implementation (which I would also
    recommend to println()) whereby if the file argument is None,
    sys.stdout is automatically used.  Thus,

        print >> None, foo bar

    (or, of course, print >> x where x is a variable whose value is
    None) means the same as

        print foo, bar

    and the tables() function can be written as follows:

        def tables(n, file=None):
            for j in range(1, n+1):
                for i in range(1, n+1):
                    print >> file, i, 'x', j, '=', i*j
                print >> file

    [XXX this needs more justification, and a section of its own]


References

    [1] http://docs.python.org/reference/simple_stmts.html#print
    [2] http://sourceforge.net/patch/download.php?id=100970



pep-0215 String Interpolation

PEP: 215
Title: String Interpolation
Version: $Revision$
Last-Modified: $Date$
Author: Ka-Ping Yee <ping at zesty.ca>
Status: Superseded
Type: Standards Track
Created: 24-Jul-2000
Python-Version: 2.1
Post-History: 
Superseded-By: 292

Abstract

    This document proposes a string interpolation feature for Python
    to allow easier string formatting.  The suggested syntax change
    is the introduction of a '$' prefix that triggers the special
    interpretation of the '$' character within a string, in a manner
    reminiscent to the variable interpolation found in Unix shells,
    awk, Perl, or Tcl.


Copyright

    This document is in the public domain.


Specification

    Strings may be preceded with a '$' prefix that comes before the
    leading single or double quotation mark (or triplet) and before
    any of the other string prefixes ('r' or 'u').  Such a string is
    processed for interpolation after the normal interpretation of
    backslash-escapes in its contents.  The processing occurs just
    before the string is pushed onto the value stack, each time the
    string is pushed.  In short, Python behaves exactly as if '$'
    were a unary operator applied to the string.  The operation
    performed is as follows:

    The string is scanned from start to end for the '$' character
    (\x24 in 8-bit strings or \u0024 in Unicode strings).  If there
    are no '$' characters present, the string is returned unchanged.

    Any '$' found in the string, followed by one of the two kinds of
    expressions described below, is replaced with the value of the
    expression as evaluated in the current namespaces.  The value is
    converted with str() if the containing string is an 8-bit string,
    or with unicode() if it is a Unicode string.

    1.  A Python identifier optionally followed by any number of
        trailers, where a trailer consists of:
            - a dot and an identifier,
            - an expression enclosed in square brackets, or
            - an argument list enclosed in parentheses
        (This is exactly the pattern expressed in the Python grammar
        by "NAME trailer*", using the definitions in Grammar/Grammar.)

    2.  Any complete Python expression enclosed in curly braces.

    Two dollar-signs ("$$") are replaced with a single "$".


Examples

    Here is an example of an interactive session exhibiting the
    expected behaviour of this feature.

        >>> a, b = 5, 6
        >>> print $'a = $a, b = $b'
        a = 5, b = 6
        >>> $u'uni${a}ode'
        u'uni5ode'
        >>> print $'\$a'
        5
        >>> print $r'\$a'
        \5
        >>> print $'$$$a.$b'
        $5.6
        >>> print $'a + b = ${a + b}'
        a + b = 11
        >>> import sys
        >>> print $'References to $a: $sys.getrefcount(a)'
        References to 5: 15
        >>> print $"sys = $sys, sys = $sys.modules['sys']"
        sys = <module 'sys' (built-in)>, sys = <module 'sys' (built-in)>
        >>> print $'BDFL = $sys.copyright.split()[4].upper()'
        BDFL = GUIDO


Discussion

    '$' is chosen as the interpolation character within the
    string for the sake of familiarity, since it is already used
    for this purpose in many other languages and contexts.

    It is then natural to choose '$' as a prefix, since it is a
    mnemonic for the interpolation character.

    Trailers are permitted to give this interpolation mechanism
    even more power than the interpolation available in most other
    languages, while the expression to be interpolated remains
    clearly visible and free of curly braces.

    '$' works like an operator and could be implemented as an
    operator, but that prevents the compile-time optimization
    and presents security issues.  So, it is only allowed as a
    string prefix.


Security Issues

    "$" has the power to eval, but only to eval a literal.  As
    described here (a string prefix rather than an operator), it
    introduces no new security issues since the expressions to be
    evaluated must be literally present in the code.


Implementation

    The Itpl module at http://www.lfw.org/python/Itpl.py provides a
    prototype of this feature.  It uses the tokenize module to find
    the end of an expression to be interpolated, then calls eval()
    on the expression each time a value is needed.  In the prototype,
    the expression is parsed and compiled again each time it is
    evaluated.

    As an optimization, interpolated strings could be compiled
    directly into the corresponding bytecode; that is,

        $'a = $a, b = $b'

    could be compiled as though it were the expression

        ('a = ' + str(a) + ', b = ' + str(b))

    so that it only needs to be compiled once.



pep-0216 Docstring Format

PEP: 216
Title: Docstring Format
Version: $Revision$
Last-Modified: $Date$
Author: Moshe Zadka <moshez at zadka.site.co.il>
Status: Rejected
Type: Informational
Created: 31-Jul-2000
Post-History: 
Superseded-By: 287

Notice

    This PEP is rejected by the author.  It has been superseded by PEP
    287.

Abstract

    Named Python objects, such as modules, classes and functions, have a
    string attribute called __doc__. If the first expression inside
    the definition is a literal string, that string is assigned
    to the __doc__ attribute.

    The __doc__ attribute is called a documentation string, or docstring.
    It is often used to summarize the interface of the module, class or
    function. However, since there is no common format for documentation
    string, tools for extracting docstrings and transforming those into
    documentation in a standard format (e.g., DocBook) have not sprang
    up in abundance, and those that do exist are for the most part
    unmaintained and unused.

Perl Documentation

    In Perl, most modules are documented in a format called POD -- Plain
    Old Documentation. This is an easy-to-type, very low level format
    which integrates well with the Perl parser. Many tools exist to turn
    POD documentation into other formats: info, HTML and man pages, among
    others. However, in Perl, the information is not available at run-time.

Java Documentation

    In Java, special comments before classes and functions function to
    document the code. A program to extract these, and turn them into
    HTML documentation is called javadoc, and is part of the standard
    Java distribution. However, the only output format that is supported
    is HTML, and JavaDoc has a very intimate relationship with HTML.

Python Docstring Goals

    Python documentation string are easy to spot during parsing, and are
    also available to the runtime interpreter. This double purpose is
    a bit problematic, sometimes: for example, some are reluctant to have
    too long docstrings, because they do not want to take much space in
    the runtime. In addition, because of the current lack of tools, people
    read objects' docstrings by "print"ing them, so a tendancy to make them
    brief and free of markups has sprung up. This tendancy hinders writing
    better documentation-extraction tools, since it causes docstrings to
    contain little information, which is hard to parse.

High Level Solutions

    To counter the objection that the strings take up place in the running
    program, it is suggested that documentation extraction tools will 
    concatenate a maximum prefix of string literals which appear in the
    beginning of a definition. The first of these will also be available
    in the interactive interpreter, so it should contain a few summary
    lines. 

Docstring Format Goals

    These are the goals for the docstring format, as discussed ad neasum
    in the doc-sig.

    1. It must be easy to type with any standard text editor.
    2. It must be readable to the casual observer.
    3. It must not contain information which can be deduced from parsing 
       the module.
    4. It must contain sufficient information so it can be converted
       to any reasonable markup format.
    5. It must be possible to write a module's entire documentation in
       docstrings, without feeling hampered by the markup language.

Docstring Contents

    For requirement 5. above, it is needed to specify what must be
    in docstrings.

    At least the following must be available:

    a. A tag that means "this is a Python ``something'', guess what"

    Example: In the sentence "The POP3 class", we need to markup "POP3"
    so. The parser will be able to guess it is a class from the contents
    of the poplib module, but we need to make it guess.

    b. Tags that mean "this is a Python class/module/class var/instance var..."

    Example: The usual Python idiom for singleton class A is to have _A
    as the class, and A a function which returns _A objects. It's usual
    to document the class, nonetheless, as being A. This requires the
    strength to say "The class A" and have A hyperlinked and marked-up
    as a class.

    c. An easy way to include Python source code/Python interactive sessions
    d. Emphasis/bold
    e. List/tables

Docstring Basic Structure

    The documentation strings will be in StructuredTextNG
    (http://www.zope.org/Members/jim/StructuredTextWiki/StructuredTextNG)
    Since StructuredText is not yet strong enough to handle (a) and (b)
    above, we will need to extend it. I suggest using 
    '[<optional description>:python identifier]'. 
    E.g.: [class:POP3], [:POP3.list], etc. If the description is missing,
    a guess will be made from the text.

Unresolved Issues

    Is there a way to escape characters in ST? If so, how? 
    (example: * at the beginning of a line without being bullet symbol)

    Is my suggestion above for Python symbols compatible with ST-NG?
    How hard would it be to extend ST-NG to support it?

    How do we describe input and output types of functions?

    What additional constraint do we enforce on each docstring? 
    (module/class/function)?

    What are the guesser rules?

Rejected Suggestions

    XML -- it's very hard to type, and too cluttered to read it 
           comfortably.



pep-0217 Display Hook for Interactive Use

PEP: 217
Title: Display Hook for Interactive Use
Version: $Revision$
Last-Modified: $Date$
Author: Moshe Zadka <moshez at zadka.site.co.il>
Status: Final
Type: Standards Track
Created: 31-Jul-2000
Python-Version: 2.1
Post-History: 

Abstract

    Python's interactive mode is one of the implementation's great
    strengths -- being able to write expressions on the command line
    and get back a meaningful output.  However, the output function
    cannot be all things to all people, and the current output
    function too often falls short of this goal.  This PEP describes a
    way to provides alternatives to the built-in display function in
    Python, so users will have control over the output from the
    interactive interpreter.

Interface

    The current Python solution has worked for many users, and this
    should not break it. Therefore, in the default configuration,
    nothing will change in the REPL loop. To change the way the
    interpreter prints interactively entered expressions, users
    will have to rebind sys.displayhook to a callable object.
    The result of calling this object with the result of the
    interactively entered expression should be print-able,
    and this is what will be printed on sys.stdout.

Solution

    The bytecode PRINT_EXPR will call sys.displayhook(POP())
    A displayhook() will be added to the sys builtin module, which is
    equivalent to

    import __builtin__
    def displayhook(o):
        if o is None:
            return
        __builtin__._ = None
        print `o`
        __builtin__._ = o
        

Jython Issues

    The method Py.printResult will be similarily changed.


pep-0218 Adding a Built-In Set Object Type

PEP: 218
Title: Adding a Built-In Set Object Type
Version: $Revision$
Last-Modified: $Date$
Author: Greg Wilson <gvwilson at ddj.com>, Raymond Hettinger <python at rcn.com>
Status: Final
Type: Standards Track
Created: 31-Jul-2000
Python-Version: 2.2
Post-History: 

Introduction

    This PEP proposes adding a Set module to the standard Python
    library, and to then make sets a built-in Python type if that
    module is widely used.  After explaining why sets are desirable,
    and why the common idiom of using dictionaries in their place is
    inadequate, we describe how we intend built-in sets to work, and
    then how the preliminary Set module will behave.  The last
    section discusses the mutability (or otherwise) of sets and set
    elements, and the solution which the Set module will implement.


Rationale

    Sets are a fundamental mathematical structure, and are very
    commonly used in algorithm specifications.  They are much less
    frequently used in implementations, even when they are the "right"
    structure.  Programmers frequently use lists instead, even when
    the ordering information in lists is irrelevant, and by-value
    lookups are frequent.  (Most medium-sized C programs contain a
    depressing number of start-to-end searches through malloc'd
    vectors to determine whether particular items are present or
    not...)

    Programmers are often told that they can implement sets as
    dictionaries with "don't care" values.  Items can be added to
    these "sets" by assigning the "don't care" value to them;
    membership can be tested using "dict.has_key"; and items can be
    deleted using "del".  However, the other main operations on sets
    (union, intersection, and difference) are not directly supported
    by this representation, since their meaning is ambiguous for
    dictionaries containing key/value pairs.


Proposal

    The long-term goal of this PEP is to add a built-in set type to
    Python.  This type will be an unordered collection of unique
    values, just as a dictionary is an unordered collection of
    key/value pairs.

    Iteration and comprehension will be implemented in the obvious
    ways, so that:

        for x in S:

    will step through the elements of S in arbitrary order, while:

        set(x**2 for x in S)

    will produce a set containing the squares of all elements in S,
    Membership will be tested using "in" and "not in", and basic set
    operations will be implemented by a mixture of overloaded
    operators:

        |               union
        &               intersection
        ^               symmetric difference
        -               asymmetric difference
        == !=           equality and inequality tests
        < <= >= >       subset and superset tests


    and methods:

        S.add(x)        Add "x" to the set.

        S.update(s)     Add all elements of sequence "s" to the set.

        S.remove(x)     Remove "x" from the set.  If "x" is not
                        present, this method raises a LookupError
                        exception.

        S.discard(x)    Remove "x" from the set if it is present, or
                        do nothing if it is not.

        S.pop()         Remove and return an arbitrary element,
                        raising a LookupError if the element is not
                        present.

        S.clear()       Remove all elements from this set.

        S.copy()        Make a new set.

        s.issuperset()  Check for a superset relationship.

        s.issubset()    Check for a subset relationship.
        

    and two new built-in conversion functions:

        set(x)          Create a set containing the elements of the
                        collection "x".

        frozenset(x)    Create an immutable set containing the elements
                        of the collection "x".

    Notes:

    1. We propose using the bitwise operators "|&" for intersection
       and union.  While "+" for union would be intuitive, "*" for
       intersection is not (very few of the people asked guessed what
       it did correctly).

    2. We considered using "+" to add elements to a set, rather than
       "add".  However, Guido van Rossum pointed out that "+" is
       symmetric for other built-in types (although "*" is not).  Use
       of "add" will also avoid confusion between that operation and
       set union.


Set Notation

    The PEP originally proposed {1,2,3} as the set notation and {-} for
    the empty set.  Experience with Python 2.3's sets.py showed that
    the notation was not necessary.  Also, there was some risk of making
    dictionaries less instantly recognizable.

    It was also contemplated that the braced notation would support set
    comprehensions; however, Python 2.4 provided generator expressions
    which fully met that need and did so it a more general way.
    (See PEP 289 for details on generator expressions).

    So, Guido ruled that there would not be a set syntax; however, the
    issue could be revisited for Python 3000 (see PEP 3000).


History

    To gain experience with sets, a pure python module was introduced
    in Python 2.3.  Based on that implementation, the set and frozenset
    types were introduced in Python 2.4.  The improvements are:

        * Better hash algorithm for frozensets
        * More compact pickle format (storing only an element list
          instead of a dictionary of key:value pairs where the value
          is always True).
        * Use a __reduce__ function so that deep copying is automatic.
        * The BaseSet concept was eliminated.
        * The union_update() method became just update().
        * Auto-conversion between mutable and immutable sets was dropped.
        * The _repr method was dropped (the need is met by the new
          sorted() built-in function).

    Tim Peters believes that the class's constructor should take a
    single sequence as an argument, and populate the set with that
    sequence's elements.  His argument is that in most cases,
    programmers will be creating sets from pre-existing sequences, so
    that this case should be the common one.  However, this would
    require users to remember an extra set of parentheses when
    initializing a set with known values:

    >>> Set((1, 2, 3, 4))       # case 1

    On the other hand, feedback from a small number of novice Python
    users (all of whom were very experienced with other languages)
    indicates that people will find a "parenthesis-free" syntax more
    natural:

    >>> Set(1, 2, 3, 4)         # case 2

    Ultimately, we adopted the first strategy in which the initializer
    takes a single iterable argument.


Mutability

    The most difficult question to resolve in this proposal was
    whether sets ought to be able to contain mutable elements.  A
    dictionary's keys must be immutable in order to support fast,
    reliable lookup.  While it would be easy to require set elements
    to be immutable, this would preclude sets of sets (which are
    widely used in graph algorithms and other applications).

    Earlier drafts of PEP 218 had only a single set type, but the
    sets.py implementation in Python 2.3 has two, Set and
    ImmutableSet.  For Python 2.4, the new built-in types were named
    set and frozenset which are slightly less cumbersome.

    There are two classes implemented in the "sets" module.  Instances
    of the Set class can be modified by the addition or removal of
    elements, and the ImmutableSet class is "frozen", with an
    unchangeable collection of elements.  Therefore, an ImmutableSet
    may be used as a dictionary key or as a set element, but cannot be
    updated.  Both types of set require that their elements are
    immutable, hashable objects.  Parallel comments apply to the "set"
    and "frozenset" built-in types.


Copyright

    This document has been placed in the Public Domain.



pep-0219 Stackless Python

PEP: 219
Title: Stackless Python
Version: $Revision$
Last-Modified: $Date$
Author: Gordon McMillan <gmcm at hypernet.com>
Status: Deferred
Type: Standards Track
Created: 14-Aug-2000
Python-Version: 2.1
Post-History: 

Introduction

    This PEP discusses changes required to core Python in order to
    efficiently support generators, microthreads and coroutines. It is
    related to PEP 220, which describes how Python should be extended
    to support these facilities. The focus of this PEP is strictly on
    the changes required to allow these extensions to work.

    While these PEPs are based on Christian Tismer's Stackless[1]
    implementation, they do not regard Stackless as a reference
    implementation.  Stackless (with an extension module) implements
    continuations, and from continuations one can implement
    coroutines, microthreads (as has been done by Will Ware[2]) and
    generators. But in more that a year, no one has found any other
    productive use of continuations, so there seems to be no demand
    for their support.

    However, Stackless support for continuations is a relatively minor
    piece of the implementation, so one might regard it as "a"
    reference implementation (rather than "the" reference
    implementation).


Background

    Generators and coroutines have been implemented in a number of
    languages in a number of ways. Indeed, Tim Peters has done pure
    Python implementations of generators[3] and coroutines[4] using
    threads (and a thread-based coroutine implementation exists for
    Java). However, the horrendous overhead of a thread-based
    implementation severely limits the usefulness of this approach.

    Microthreads (a.k.a "green" or "user" threads) and coroutines
    involve transfers of control that are difficult to accommodate in
    a language implementation based on a single stack. (Generators can
    be done on a single stack, but they can also be regarded as a very
    simple case of coroutines.)

    Real threads allocate a full-sized stack for each thread of
    control, and this is the major source of overhead. However,
    coroutines and microthreads can be implemented in Python in a way
    that involves almost no overhead.  This PEP, therefor, offers a
    way for making Python able to realistically manage thousands of
    separate "threads" of activity (vs. todays limit of perhaps dozens
    of separate threads of activity).

    Another justification for this PEP (explored in PEP 220) is that
    coroutines and generators often allow a more direct expression of
    an algorithm than is possible in today's Python.


Discussion

    The first thing to note is that Python, while it mingles
    interpreter data (normal C stack usage) with Python data (the
    state of the interpreted program) on the stack, the two are
    logically separate. They just happen to use the same stack.

    A real thread gets something approaching a process-sized stack
    because the implementation has no way of knowing how much stack
    space the thread will require. The stack space required for an
    individual frame is likely to be reasonable, but stack switching
    is an arcane and non-portable process, not supported by C.

    Once Python stops putting Python data on the C stack, however,
    stack switching becomes easy.

    The fundamental approach of the PEP is based on these two
    ideas. First, separate C's stack usage from Python's stack
    usage. Secondly, associate with each frame enough stack space to
    handle that frame's execution.

    In the normal usage, Stackless Python has a normal stack
    structure, except that it is broken into chunks. But in the
    presence of a coroutine / microthread extension, this same
    mechanism supports a stack with a tree structure.  That is, an
    extension can support transfers of control between frames outside
    the normal "call / return" path.


Problems

    The major difficulty with this approach is C calling Python. The
    problem is that the C stack now holds a nested execution of the
    byte-code interpreter. In that situation, a coroutine /
    microthread extension cannot be permitted to transfer control to a
    frame in a different invocation of the byte-code interpreter. If a
    frame were to complete and exit back to C from the wrong
    interpreter, the C stack could be trashed.

    The ideal solution is to create a mechanism where nested
    executions of the byte code interpreter are never needed. The easy
    solution is for the coroutine / microthread extension(s) to
    recognize the situation and refuse to allow transfers outside the
    current invocation.

    We can categorize code that involves C calling Python into two
    camps: Python's implementation, and C extensions. And hopefully we
    can offer a compromise: Python's internal usage (and C extension
    writers who want to go to the effort) will no longer use a nested
    invocation of the interpreter. Extensions which do not go to the
    effort will still be safe, but will not play well with coroutines
    / microthreads.

    Generally, when a recursive call is transformed into a loop, a bit
    of extra bookkeeping is required. The loop will need to keep its
    own "stack" of arguments and results since the real stack can now
    only hold the most recent. The code will be more verbose, because
    it's not quite as obvious when we're done. While Stackless is not
    implemented this way, it has to deal with the same issues.

    In normal Python, PyEval_EvalCode is used to build a frame and
    execute it. Stackless Python introduces the concept of a
    FrameDispatcher. Like PyEval_EvalCode, it executes one frame. But
    the interpreter may signal the FrameDispatcher that a new frame
    has been swapped in, and the new frame should be executed. When a
    frame completes, the FrameDispatcher follows the back pointer to
    resume the "calling" frame.

    So Stackless transforms recursions into a loop, but it is not the
    FrameDispatcher that manages the frames. This is done by the
    interpreter (or an extension that knows what it's doing).

    The general idea is that where C code needs to execute Python
    code, it creates a frame for the Python code, setting its back
    pointer to the current frame. Then it swaps in the frame, signals
    the FrameDispatcher and gets out of the way. The C stack is now
    clean - the Python code can transfer control to any other frame
    (if an extension gives it the means to do so).

    In the vanilla case, this magic can be hidden from the programmer
    (even, in most cases, from the Python-internals programmer). Many
    situations present another level of difficulty, however.

    The map builtin function involves two obstacles to this
    approach. It cannot simply construct a frame and get out of the
    way, not just because there's a loop involved, but each pass
    through the loop requires some "post" processing. In order to play
    well with others, Stackless constructs a frame object for map
    itself.

    Most recursions of the interpreter are not this complex, but
    fairly frequently, some "post" operations are required. Stackless
    does not fix these situations because of amount of code changes
    required. Instead, Stackless prohibits transfers out of a nested
    interpreter. While not ideal (and sometimes puzzling), this
    limitation is hardly crippling.
    

Advantages

    For normal Python, the advantage to this approach is that C stack
    usage becomes much smaller and more predictable. Unbounded
    recursion in Python code becomes a memory error, instead of a
    stack error (and thus, in non-Cupertino operating systems,
    something that can be recovered from).  The price, of course, is
    the added complexity that comes from transforming recursions of
    the byte-code interpreter loop into a higher order loop (and the
    attendant bookkeeping involved).

    The big advantage comes from realizing that the Python stack is
    really a tree, and the frame dispatcher can transfer control
    freely between leaf nodes of the tree, thus allowing things like
    microthreads and coroutines.


References

    [1] http://www.stackless.com
    [2] http://world.std.com/~wware/uthread.html
    [3] Demo/threads/Generator.py in the source distribution
    [4] http://www.stackless.com/coroutines.tim.peters.html



pep-0220 Coroutines, Generators, Continuations

PEP: 220
Title: Coroutines, Generators, Continuations
Version: $Revision$
Last-Modified: $Date$
Author: Gordon McMillan <gmcm at hypernet.com>
Status: Rejected
Type: Informational
Created: 14-Aug-2000
Post-History: 

Abstract

    Demonstrates why the changes described in the stackless PEP are
    desirable.  A low-level continuations module exists.  With it,
    coroutines and generators and "green" threads can be written.  A
    higher level module that makes coroutines and generators easy to
    create is desirable (and being worked on).  The focus of this PEP
    is on showing how coroutines, generators, and green threads can
    simplify common programming problems.



pep-0221 Import As

PEP: 221
Title: Import As
Version: $Revision$
Last-Modified: $Date$
Author: Thomas Wouters <thomas at python.org>
Status: Final
Type: Standards Track
Created: 15-Aug-2000
Python-Version: 2.0
Post-History: 

Introduction

    This PEP describes the `import as' proposal for Python 2.0.  This
    PEP tracks the status and ownership of this feature.  It contains
    a description of the feature and outlines changes necessary to
    support the feature.  The CVS revision history of this file
    contains the definitive historical record.


Rationale

    This PEP proposes an extention of Python syntax regarding the
    `import' and `from <module> import' statements.  These statements
    load in a module, and either bind that module to a local name, or
    binds objects from that module to a local name.  However, it is
    sometimes desirable to bind those objects to a different name, for
    instance to avoid name clashes.  This can currently be achieved
    using the following idiom:

        import os
        real_os = os
        del os
    
    And similarly for the `from ... import' statement:
    
        from os import fdopen, exit, stat
        os_fdopen = fdopen
        os_stat = stat
        del fdopen, stat
    
    The proposed syntax change would add an optional `as' clause to
    both these statements, as follows:

        import os as real_os
        from os import fdopen as os_fdopen, exit, stat as os_stat
    
    The `as' name is not intended to be a keyword, and some trickery
    has to be used to convince the CPython parser it isn't one.  For
    more advanced parsers/tokenizers, however, this should not be a
    problem.

    A slightly special case exists for importing sub-modules.  The
    statement

        import os.path    

    stores the module `os' locally as `os', so that the imported
    submodule `path' is accessible as `os.path'.  As a result,
    
        import os.path as p

    stores `os.path', not `os', in `p'.  This makes it effectively the
    same as
    
        from os import path as p


Implementation details

    This PEP has been accepted, and the suggested code change has been
    checked in.  The patch can still be found in the SourceForge patch
    manager[1].  Currently, a NAME field is used in the grammar rather
    than a bare string, to avoid the keyword issue.  It introduces a
    new bytecode, IMPORT_STAR, which performs the `from module import
    *' behaviour, and changes the behaviour of the IMPORT_FROM
    bytecode so that it loads the requested name (which is always a
    single name) onto the stack, to be subsequently stored by a STORE
    opcode. As a result, all names explicitly imported now follow the
    `global' directives.

    The special case of `from module import *' remains a special case,
    in that it cannot accomodate an `as' clause, and that no STORE
    opcodes are generated; the objects imported are loaded directly
    into the local namespace. This also means that names imported in
    this fashion are always local, and do not follow the `global'
    directive.
    
    An additional change to this syntax has also been suggested, to
    generalize the expression given after the `as' clause.  Rather
    than a single name, it could be allowed to be any expression that
    yields a valid l-value; anything that can be assigned to.  The
    change to accomodate this is minimal, as the patch[2] proves, and
    the resulting generalization allows a number of new constructs
    that run completely parallel with other Python assignment
    constructs. However, this idea has been rejected by Guido, as
    `hypergeneralization'.


Copyright

    This document has been placed in the Public Domain.


References

    [1] http://sourceforge.net/patch/?func=detailpatch&patch_id=101135&group_id=5470

    [2] http://sourceforge.net/patch/?func=detailpatch&patch_id=101234&group_id=5470



pep-0222 Web Library Enhancements

PEP: 222
Title: Web Library Enhancements
Version: $Revision$
Last-Modified: $Date$
Author: A.M. Kuchling <amk at amk.ca>
Status: Deferred
Type: Standards Track
Created: 18-Aug-2000
Python-Version: 2.1
Post-History: 22-Dec-2000

Abstract

    This PEP proposes a set of enhancements to the CGI development
    facilities in the Python standard library.  Enhancements might be
    new features, new modules for tasks such as cookie support, or
    removal of obsolete code.

    The original intent was to make improvements to Python 2.1.
    However, there seemed little interest from the Python community,
    and time was lacking, so this PEP has been deferred to some future
    Python release.
    

Open Issues

    This section lists changes that have been suggested, but about
    which no firm decision has yet been made.  In the final version of
    this PEP, this section should be empty, as all the changes should
    be classified as accepted or rejected.

    cgi.py: We should not be told to create our own subclass just so
    we can handle file uploads. As a practical matter, I have yet to
    find the time to do this right, so I end up reading cgi.py's temp
    file into, at best, another file. Some of our legacy code actually
    reads it into a second temp file, then into a final destination!
    And even if we did, that would mean creating yet another object
    with its __init__ call and associated overhead.

    cgi.py: Currently, query data with no `=' are ignored.  Even if
    keep_blank_values is set, queries like `...?value=&...' are
    returned with blank values but queries like `...?value&...' are
    completely lost.  It would be great if such data were made
    available through the FieldStorage interface, either as entries
    with None as values, or in a separate list.

    Utility function: build a query string from a list of 2-tuples

    Dictionary-related utility classes: NoKeyErrors (returns an empty
    string, never a KeyError), PartialStringSubstitution (returns 
    the original key string, never a KeyError)


    

New Modules

    This section lists details about entire new packages or modules
    that should be added to the Python standard library.

    * fcgi.py : A new module adding support for the FastCGI protocol.
      Robin Dunn's code needs to be ported to Windows, though.

Major Changes to Existing Modules

    This section lists details of major changes to existing modules,
    whether in implementation or in interface.  The changes in this
    section therefore carry greater degrees of risk, either in
    introducing bugs or a backward incompatibility.

    The cgi.py module would be deprecated.  (XXX A new module or
    package name hasn't been chosen yet: 'web'?  'cgilib'?)

Minor Changes to Existing Modules

    This section lists details of minor changes to existing modules.
    These changes should have relatively small implementations, and
    have little risk of introducing incompatibilities with previous
    versions.


Rejected Changes

    The changes listed in this section were proposed for Python 2.1,
    but were rejected as unsuitable.  For each rejected change, a
    rationale is given describing why the change was deemed
    inappropriate.

    * An HTML generation module is not part of this PEP.  Several such
      modules exist, ranging from HTMLgen's purely programming
      interface to ASP-inspired simple templating to DTML's complex
      templating.  There's no indication of which templating module to
      enshrine in the standard library, and that probably means that
      no module should be so chosen.

    * cgi.py: Allowing a combination of query data and POST data.
      This doesn't seem to be standard at all, and therefore is
      dubious practice.

Proposed Interface

    XXX open issues: naming convention (studlycaps or
    underline-separated?); need to look at the cgi.parse*() functions
    and see if they can be simplified, too.

    Parsing functions: carry over most of the parse* functions from
    cgi.py
    
    # The Response class borrows most of its methods from Zope's
    # HTTPResponse class.
    
    class Response:
        """
        Attributes:
        status: HTTP status code to return
        headers: dictionary of response headers
        body: string containing the body of the HTTP response
        """
        
        def __init__(self, status=200, headers={}, body=""):
            pass
    
        def setStatus(self, status, reason=None):
            "Set the numeric HTTP response code"
            pass
    
        def setHeader(self, name, value):
            "Set an HTTP header"
            pass
    
        def setBody(self, body):
            "Set the body of the response"
            pass
    
        def setCookie(self, name, value,
                      path = '/',  
                      comment = None, 
                      domain = None, 
                      max-age = None,
                      expires = None,
                      secure = 0
                      ):
            "Set a cookie"
            pass
    
        def expireCookie(self, name):
            "Remove a cookie from the user"
            pass
    
        def redirect(self, url):
            "Redirect the browser to another URL"
            pass
    
        def __str__(self):
            "Convert entire response to a string"
            pass
    
        def dump(self):
            "Return a string representation useful for debugging"
            pass
            
        # XXX methods for specific classes of error:serverError, 
        # badRequest, etc.?
    
    
    class Request:
    
        """
        Attributes: 

        XXX should these be dictionaries, or dictionary-like objects?
        .headers : dictionary containing HTTP headers
        .cookies : dictionary of cookies
        .fields  : data from the form
        .env     : environment dictionary
        """
        
        def __init__(self, environ=os.environ, stdin=sys.stdin,
                     keep_blank_values=1, strict_parsing=0):
            """Initialize the request object, using the provided environment
            and standard input."""
            pass
    
        # Should people just use the dictionaries directly?
        def getHeader(self, name, default=None):
            pass
    
        def getCookie(self, name, default=None):
            pass
    
        def getField(self, name, default=None):
            "Return field's value as a string (even if it's an uploaded file)"
            pass
            
        def getUploadedFile(self, name):
            """Returns a file object that can be read to obtain the contents
            of an uploaded file.  XXX should this report an error if the 
            field isn't actually an uploaded file?  Or should it wrap
            a StringIO around simple fields for consistency?
            """
            
        def getURL(self, n=0, query_string=0):
            """Return the URL of the current request, chopping off 'n' path
            components from the right.  Eg. if the URL is
            "http://foo.com/bar/baz/quux", n=2 would return
            "http://foo.com/bar".  Does not include the query string (if
            any)
            """

        def getBaseURL(self, n=0):
            """Return the base URL of the current request, adding 'n' path
            components to the end to recreate more of the whole URL.  
            
            Eg. if the request URL is
            "http://foo.com/q/bar/baz/qux", n=0 would return
            "http://foo.com/", and n=2 "http://foo.com/q/bar".
            
            Returned URL does not include the query string, if any.
            """
        
        def dump(self):
            "String representation suitable for debugging output"
            pass
    
        # Possibilities?  I don't know if these are worth doing in the 
        # basic objects.
        def getBrowser(self):
            "Returns Mozilla/IE/Lynx/Opera/whatever"
    
        def isSecure(self):
            "Return true if this is an SSLified request"
            

    # Module-level function        
    def wrapper(func, logfile=sys.stderr):
        """
        Calls the function 'func', passing it the arguments
        (request, response, logfile).  Exceptions are trapped and
        sent to the file 'logfile'.  
        """
        # This wrapper will detect if it's being called from the command-line,
        # and if so, it will run in a debugging mode; name=value pairs 
        # can be entered on standard input to set field values.
        # (XXX how to do file uploads in this syntax?)

    

Copyright

    This document has been placed in the public domain.



pep-0223 Change the Meaning of \x Escapes

PEP: 223
Title: Change the Meaning of \x Escapes
Version: $Revision$
Last-Modified: $Date$
Author: Tim Peters <tim at zope.com>
Status: Final
Type: Standards Track
Created: 20-Aug-2000
Python-Version: 2.0
Post-History: 23-Aug-2000

Abstract

    Change \x escapes, in both 8-bit and Unicode strings, to consume
    exactly the two hex digits following.  The proposal views this as
    correcting an original design flaw, leading to clearer expression
    in all flavors of string, a cleaner Unicode story, better
    compatibility with Perl regular expressions, and with minimal risk
    to existing code.


Syntax

    The syntax of \x escapes, in all flavors of non-raw strings, becomes

        \xhh

    where h is a hex digit (0-9, a-f, A-F).  The exact syntax in 1.5.2 is
    not clearly specified in the Reference Manual; it says

        \xhh...

    implying "two or more" hex digits, but one-digit forms are also
    accepted by the 1.5.2 compiler, and a plain \x is "expanded" to
    itself (i.e., a backslash followed by the letter x).  It's unclear
    whether the Reference Manual intended either of the 1-digit or
    0-digit behaviors.


Semantics

    In an 8-bit non-raw string,
        \xij
    expands to the character
        chr(int(ij, 16))
    Note that this is the same as in 1.6 and before.

    In a Unicode string,
        \xij
    acts the same as
        \u00ij
    i.e. it expands to the obvious Latin-1 character from the initial
    segment of the Unicode space.

    An \x not followed by at least two hex digits is a compile-time error,
    specifically ValueError in 8-bit strings, and UnicodeError (a subclass
    of ValueError) in Unicode strings.  Note that if an \x is followed by
    more than two hex digits, only the first two are "consumed".  In 1.6
    and before all but the *last* two were silently ignored.


Example

    In 1.5.2:

        >>> "\x123465"  # same as "\x65"
        'e'
        >>> "\x65"
        'e'
        >>> "\x1"
        '\001'
        >>> "\x\x"
        '\\x\\x'
        >>>

    In 2.0:

        >>> "\x123465" # \x12 -> \022, "3456" left alone
        '\0223456'
        >>> "\x65"
        'e'
        >>> "\x1"
        [ValueError is raised]
        >>> "\x\x"
        [ValueError is raised]
        >>>


History and Rationale

    \x escapes were introduced in C as a way to specify variable-width
    character encodings.  Exactly which encodings those were, and how many
    hex digits they required, was left up to each implementation.  The
    language simply stated that \x "consumed" *all* hex digits following,
    and left the meaning up to each implementation.  So, in effect, \x in C
    is a standard hook to supply platform-defined behavior.

    Because Python explicitly aims at platform independence, the \x escape
    in Python (up to and including 1.6) has been treated the same way
    across all platforms:  all *except* the last two hex digits were
    silently ignored.  So the only actual use for \x escapes in Python was
    to specify a single byte using hex notation.

    Larry Wall appears to have realized that this was the only real use for
    \x escapes in a platform-independent language, as the proposed rule for
    Python 2.0 is in fact what Perl has done from the start (although you
    need to run in Perl -w mode to get warned about \x escapes with fewer
    than 2 hex digits following -- it's clearly more Pythonic to insist on
    2 all the time).

    When Unicode strings were introduced to Python, \x was generalized so
    as to ignore all but the last *four* hex digits in Unicode strings.
    This caused a technical difficulty for the new regular expression engine:
    SRE tries very hard to allow mixing 8-bit and Unicode patterns and
    strings in intuitive ways, and it no longer had any way to guess what,
    for example, r"\x123456" should mean as a pattern:  is it asking to match
    the 8-bit character \x56 or the Unicode character \u3456?

    There are hacky ways to guess, but it doesn't end there.  The ISO C99
    standard also introduces 8-digit \U12345678 escapes to cover the entire
    ISO 10646 character space, and it's also desired that Python 2 support
    that from the start.  But then what are \x escapes supposed to mean?
    Do they ignore all but the last *eight* hex digits then?  And if less
    than 8 following in a Unicode string, all but the last 4?  And if less
    than 4, all but the last 2?

    This was getting messier by the minute, and the proposal cuts the
    Gordian knot by making \x simpler instead of more complicated.  Note
    that the 4-digit generalization to \xijkl in Unicode strings was also
    redundant, because it meant exactly the same thing as \uijkl in Unicode
    strings.  It's more Pythonic to have just one obvious way to specify a
    Unicode character via hex notation.


Development and Discussion

    The proposal was worked out among Guido van Rossum, Fredrik Lundh and
    Tim Peters in email.  It was subsequently explained and disussed on
    Python-Dev under subject "Go \x yourself", starting 2000-08-03.
    Response was overwhelmingly positive; no objections were raised.


Backward Compatibility

    Changing the meaning of \x escapes does carry risk of breaking existing
    code, although no instances of incompabitility have yet been discovered.
    The risk is believed to be minimal.

    Tim Peters verified that, except for pieces of the standard test suite
    deliberately provoking end cases, there are no instances of \xabcdef...
    with fewer or more than 2 hex digits following, in either the Python
    CVS development tree, or in assorted Python packages sitting on his
    machine.

    It's unlikely there are any with fewer than 2, because the Reference
    Manual implied they weren't legal (although this is debatable!).  If
    there are any with more than 2, Guido is ready to argue they were buggy
    anyway <0.9 wink>.

    Guido reported that the O'Reilly Python books *already* document that
    Python works the proposed way, likely due to their Perl editing
    heritage (as above, Perl worked (very close to) the proposed way from
    its start).

    Finn Bock reported that what JPython does with \x escapes is
    unpredictable today.  This proposal gives a clear meaning that can be
    consistently and easily implemented across all Python implementations.


Effects on Other Tools

    Believed to be none.  The candidates for breakage would mostly be
    parsing tools, but the author knows of none that worry about the
    internal structure of Python strings beyond the approximation "when
    there's a backslash, swallow the next character".  Tim Peters checked
    python-mode.el, the std tokenize.py and pyclbr.py, and the IDLE syntax
    coloring subsystem, and believes there's no need to change any of
    them.  Tools like tabnanny.py and checkappend.py inherit their immunity
    from tokenize.py.


Reference Implementation

    The code changes are so simple that a separate patch will not be produced.
    Fredrik Lundh is writing the code, is an expert in the area, and will
    simply check the changes in before 2.0b1 is released.


BDFL Pronouncements

    Yes, ValueError, not SyntaxError.  "Problems with literal interpretations
    traditionally raise 'runtime' exceptions rather than syntax errors."


Copyright

    This document has been placed in the public domain.



pep-0224 Attribute Docstrings

PEP: 224
Title: Attribute Docstrings
Version: $Revision$
Last-Modified: $Date$
Author: Marc-AndrĂŠ Lemburg <mal at lemburg.com>
Status: Rejected
Type: Standards Track
Created: 23-Aug-2000
Python-Version: 2.1
Post-History: 

Introduction

    This PEP describes the "attribute docstring" proposal for Python
    2.0.  This PEP tracks the status and ownership of this feature.
    It contains a description of the feature and outlines changes
    necessary to support the feature.  The CVS revision history of
    this file contains the definitive historical record.


Rationale

    This PEP proposes a small addition to the way Python currently
    handles docstrings embedded in Python code.

    Python currently only handles the case of docstrings which appear
    directly after a class definition, a function definition or as
    first string literal in a module.  The string literals are added
    to the objects in question under the __doc__ attribute and are
    from then on available for introspection tools which can extract
    the contained information for help, debugging and documentation
    purposes.

    Docstrings appearing in locations other than the ones mentioned
    are simply ignored and don't result in any code generation.

    Here is an example:

        class C:
            "class C doc-string"

            a = 1
            "attribute C.a doc-string (1)"

            b = 2
            "attribute C.b doc-string (2)"

    The docstrings (1) and (2) are currently being ignored by the
    Python byte code compiler, but could obviously be put to good use
    for documenting the named assignments that precede them.
    
    This PEP proposes to also make use of these cases by proposing
    semantics for adding their content to the objects in which they
    appear under new generated attribute names.

    The original idea behind this approach which also inspired the
    above example was to enable inline documentation of class
    attributes, which can currently only be documented in the class's
    docstring or using comments which are not available for
    introspection.


Implementation

    Docstrings are handled by the byte code compiler as expressions.
    The current implementation special cases the few locations
    mentioned above to make use of these expressions, but otherwise
    ignores the strings completely.

    To enable use of these docstrings for documenting named
    assignments (which is the natural way of defining e.g. class
    attributes), the compiler will have to keep track of the last
    assigned name and then use this name to assign the content of the
    docstring to an attribute of the containing object by means of
    storing it in as a constant which is then added to the object's
    namespace during object construction time.

    In order to preserve features like inheritance and hiding of
    Python's special attributes (ones with leading and trailing double
    underscores), a special name mangling has to be applied which
    uniquely identifies the docstring as belonging to the name
    assignment and allows finding the docstring later on by inspecting
    the namespace.

    The following name mangling scheme achieves all of the above:

        __doc_<attributename>__

    To keep track of the last assigned name, the byte code compiler
    stores this name in a variable of the compiling structure.  This
    variable defaults to NULL.  When it sees a docstring, it then
    checks the variable and uses the name as basis for the above name
    mangling to produce an implicit assignment of the docstring to the
    mangled name.  It then resets the variable to NULL to avoid
    duplicate assignments.

    If the variable does not point to a name (i.e. is NULL), no
    assignments are made.  These will continue to be ignored like
    before.  All classical docstrings fall under this case, so no
    duplicate assignments are done.

    In the above example this would result in the following new class
    attributes to be created:

        C.__doc_a__ == "attribute C.a doc-string (1)"
        C.__doc_b__ == "attribute C.b doc-string (2)"

    A patch to the current CVS version of Python 2.0 which implements
    the above is available on SourceForge at [1].


Caveats of the Implementation

    Since the implementation does not reset the compiling structure
    variable when processing a non-expression, e.g. a function
    definition, the last assigned name remains active until either the
    next assignment or the next occurrence of a docstring.

    This can lead to cases where the docstring and assignment may be
    separated by other expressions:

        class C:
            "C doc string"

            b = 2

            def x(self):
                "C.x doc string"
                y = 3
                return 1

            "b's doc string"

    Since the definition of method "x" currently does not reset the
    used assignment name variable, it is still valid when the compiler
    reaches the docstring "b's doc string" and thus assigns the string
    to __doc_b__.

    A possible solution to this problem would be resetting the name
    variable for all non-expression nodes in the compiler.


Possible Problems

    Even though highly unlikely, attribute docstrings could get
    accidentally concatenated to the attribute's value:

    class C:
	  x = "text" \
	  "x's docstring"

    The trailing slash would cause the Python compiler to concatenate
    the attribute value and the docstring.

    A modern syntax highlighting editor would easily make this
    accident visible, though, and by simply inserting emtpy lines
    between the attribute definition and the docstring you can avoid
    the possible concatenation completely, so the problem is
    negligible.

    Another possible problem is that of using triple quoted strings as
    a way to uncomment parts of your code. 

    If there happens to be an assignment just before the start of the
    comment string, then the compiler will treat the comment as
    docstring attribute and apply the above logic to it.

    Besides generating a docstring for an otherwise undocumented
    attribute there is no breakage.


Comments from our BDFL

    Early comments on the PEP from Guido:

        I "kinda" like the idea of having attribute docstrings (meaning
        it's not of great importance to me) but there are two things I
        don't like in your current proposal:

        1. The syntax you propose is too ambiguous: as you say,
        stand-alone string literal are used for other purposes and could
        suddenly become attribute docstrings.

        2. I don't like the access method either (__doc_<attrname>__).

    The author's reply:

        > 1. The syntax you propose is too ambiguous: as you say, stand-alone
        >    string literal are used for other purposes and could suddenly
        >    become attribute docstrings.

        This can be fixed by introducing some extra checks in the
        compiler to reset the "doc attribute" flag in the compiler
        struct.

        > 2. I don't like the access method either (__doc_<attrname>__).

        Any other name will do. It will only have to match these
        criteria:

        * must start with two underscores (to match __doc__)
        * must be extractable using some form of inspection (e.g. by using
          a naming convention which includes some fixed name part)
        * must be compatible with class inheritance (i.e. should be
          stored as attribute)

    Later on in March, Guido pronounced on this PEP in March 2001 (on
    python-dev). Here are his reasons for rejection mentioned in
    private mail to the author of this PEP:

        ...

        It might be useful, but I really hate the proposed syntax.

            a = 1
            "foo bar"
            b = 1

        I really have no way to know whether "foo bar" is a docstring
        for a or for b.

        ...
        
        You can use this convention:

            a = 1
            __doc_a__ = "doc string for a"

        This makes it available at runtime.

        > Are you completely opposed to adding attribute documentation
        > to Python or is it just the way the implementation works ? I
        > find the syntax proposed in the PEP very intuitive and many
        > other users on c.l.p and in private emails have supported it
        > at the time I wrote the PEP.

        It's not the implementation, it's the syntax.  It doesn't
        convey a clear enough coupling between the variable and the
        doc string.

        

Copyright

    This document has been placed in the Public Domain.


References

    [1] http://sourceforge.net/patch/?func=detailpatch&patch_id=101264&group_id=5470



pep-0225 Elementwise/Objectwise Operators

PEP: 225
Title: Elementwise/Objectwise Operators
Version: $Revision$
Last-Modified: $Date$
Author: Huaiyu Zhu <hzhu at users.sourceforge.net>, Gregory Lielens <gregory.lielens at fft.be>
Status: Deferred
Type: Standards Track
Created: 19-Sep-2000
Python-Version: 2.1
Post-History: 

Introduction

    This PEP describes a proposal to add new operators to Python which
    are useful for distinguishing elementwise and objectwise
    operations, and summarizes discussions in the news group
    comp.lang.python on this topic.  See Credits and Archives section
    at end.  Issues discussed here include:

    - Background.
    - Description of proposed operators and implementation issues.
    - Analysis of alternatives to new operators.
    - Analysis of alternative forms.
    - Compatibility issues
    - Description of wider extensions and other related ideas.

    A substantial portion of this PEP describes ideas that do not go
    into the proposed extension.  They are presented because the
    extension is essentially syntactic sugar, so its adoption must be
    weighed against various possible alternatives.  While many
    alternatives may be better in some aspects, the current proposal
    appears to be overall advantageous.

    The issues concerning elementwise-objectwise operations extends to
    wider areas than numerical computation.  This document also
    describes how the current proposal may be integrated with more
    general future extensions.


Background

    Python provides six binary infix math operators: + - * / % **
    hereafter generically represented by "op".  They can be overloaded
    with new semantics for user-defined classes.  However, for objects
    composed of homogeneous elements, such as arrays, vectors and
    matrices in numerical computation, there are two essentially
    distinct flavors of semantics.  The objectwise operations treat
    these objects as points in multidimensional spaces.  The
    elementwise operations treat them as collections of individual
    elements.  These two flavors of operations are often intermixed in
    the same formulas, thereby requiring syntactical distinction.

    Many numerical computation languages provide two sets of math
    operators.  For example, in MatLab, the ordinary op is used for
    objectwise operation while .op is used for elementwise operation.
    In R, op stands for elementwise operation while %op% stands for
    objectwise operation.

    In Python, there are other methods of representation, some of
    which already used by available numerical packages, such as

    - function:   mul(a,b)
    - method:     a.mul(b)
    - casting:    a.E*b 

    In several aspects these are not as adequate as infix operators.
    More details will be shown later, but the key points are

    - Readability: Even for moderately complicated formulas, infix
      operators are much cleaner than alternatives.

    - Familiarity: Users are familiar with ordinary math operators.

    - Implementation: New infix operators will not unduly clutter
      Python syntax.  They will greatly ease the implementation of
      numerical packages.

    While it is possible to assign current math operators to one
    flavor of semantics, there is simply not enough infix operators to
    overload for the other flavor.  It is also impossible to maintain
    visual symmetry between these two flavors if one of them does not
    contain symbols for ordinary math operators.


Proposed extension

    - Six new binary infix operators ~+ ~- ~* ~/ ~% ~** are added to
      core Python.  They parallel the existing operators + - * / % **.

    - Six augmented assignment operators ~+= ~-= ~*= ~/= ~%= ~**= are
      added to core Python.  They parallel the operators += -= *= /=
      %= **= available in Python 2.0.

    - Operator ~op retains the syntactical properties of operator op,
      including precedence.

    - Operator ~op retains the semantical properties of operator op on
      built-in number types.

    - Operator ~op raise syntax error on non-number builtin types.
      This is temporary until the proper behavior can be agreed upon.

    - These operators are overloadable in classes with names that
      prepend "t" (for tilde) to names of ordinary math operators.
      For example, __tadd__ and __rtadd__ work for ~+ just as __add__
      and __radd__ work for +.

    - As with existing operators, the __r*__() methods are invoked when
      the left operand does not provide the appropriate method.

    It is intended that one set of op or ~op is used for elementwise
    operations, the other for objectwise operations, but it is not
    specified which version of operators stands for elementwise or
    objectwise operations, leaving the decision to applications.

    The proposed implementation is to patch several files relating to
    the tokenizer, parser, grammar and compiler to duplicate the
    functionality of corresponding existing operators as necessary.
    All new semantics are to be implemented in the classes that
    overload them.

    The symbol ~ is already used in Python as the unary "bitwise not"
    operator.  Currently it is not allowed for binary operators.  The
    new operators are completely backward compatible.


Prototype Implementation

    Greg Lielens implemented the infix ~op as a patch against Python
    2.0b1 source[1].

    To allow ~ to be part of binary operators, the tokenizer would
    treat ~+ as one token.  This means that currently valid expression
    ~+1 would be tokenized as ~+ 1 instead of ~ + 1.  The parser would
    then treat ~+ as composite of ~ +.  The effect is invisible to
    applications.

    Notes about current patch:

    - It does not include ~op= operators yet.

    - The ~op behaves the same as op on lists, instead of raising
      exceptions.

    These should be fixed when the final version of this proposal is
    ready.

    - It reserves xor as an infix operator with the semantics
      equivalent to:
      
        def __xor__(a, b):
            if not b: return a
            elif not a: return b
            else: 0

   This preserves true value as much as possible, otherwise preserve
   left hand side value if possible.

   This is done so that bitwise operators could be regarded as
   elementwise logical operators in the future (see below).
   

Alternatives to adding new operators

    The discussions on comp.lang.python and python-dev mailing list
    explored many alternatives.  Some of the leading alternatives are
    listed here, using the multiplication operator as an example.

    1. Use function mul(a,b).

       Advantage:
       -  No need for new operators.

       Disadvantage: 
       - Prefix forms are cumbersome for composite formulas.
       - Unfamiliar to the intended users.
       - Too verbose for the intended users.
       - Unable to use natural precedence rules.

    2. Use method call a.mul(b)

       Advantage:
       - No need for new operators.

       Disadvantage:
       - Asymmetric for both operands.
       - Unfamiliar to the intended users.
       - Too verbose for the intended users.
       - Unable to use natural precedence rules.

    3. Use "shadow classes".  For matrix class define a shadow array
       class accessible through a method .E, so that for matrices a
       and b, a.E*b would be a matrix object that is
       elementwise_mul(a,b).

       Likewise define a shadow matrix class for arrays accessible
       through a method .M so that for arrays a and b, a.M*b would be
       an array that is matrixwise_mul(a,b).

       Advantage:
       - No need for new operators.
       - Benefits of infix operators with correct precedence rules.
       - Clean formulas in applications.

       Disadvantage:
       - Hard to maintain in current Python because ordinary numbers
         cannot have user defined class methods; i.e. a.E*b will fail
         if a is a pure number.
       - Difficult to implement, as this will interfere with existing
         method calls, like .T for transpose, etc.
       - Runtime overhead of object creation and method lookup.
       - The shadowing class cannot replace a true class, because it
         does not return its own type.  So there need to be a M class
         with shadow E class, and an E class with shadow M class.
       - Unnatural to mathematicians.

    4. Implement matrixwise and elementwise classes with easy casting
       to the other class.  So matrixwise operations for arrays would
       be like a.M*b.M and elementwise operations for matrices would
       be like a.E*b.E.  For error detection a.E*b.M would raise
       exceptions.

       Advantage:
       - No need for new operators.
       - Similar to infix notation with correct precedence rules.

       Disadvantage:
       - Similar difficulty due to lack of user-methods for pure numbers.
       - Runtime overhead of object creation and method lookup.
       - More cluttered formulas
       - Switching of flavor of objects to facilitate operators
         becomes persistent.  This introduces long range context
         dependencies in application code that would be extremely hard
         to maintain.

    5. Using mini parser to parse formulas written in arbitrary
       extension placed in quoted strings.

       Advantage:
       - Pure Python, without new operators

       Disadvantage:
       - The actual syntax is within the quoted string, which does not
         resolve the problem itself.
       - Introducing zones of special syntax.
       - Demanding on the mini-parser.

    6. Introducing a single operator, such as @, for matrix
       multiplication.

       Advantage:
       - Introduces less operators

       Disadvantage:
       - The distinctions for operators like + - ** are equally
         important.  Their meaning in matrix or array-oriented
         packages would be reversed (see below).
       - The new operator occupies a special character.
       - This does not work well with more general object-element issues.

    Among these alternatives, the first and second are used in current
    applications to some extent, but found inadequate.  The third is
    the most favorite for applications, but it will incur huge
    implementation complexity.  The fourth would make applications
    codes very context-sensitive and hard to maintain.  These two
    alternatives also share significant implementational difficulties
    due to current type/class split.  The fifth appears to create more
    problems than it would solve.  The sixth does not cover the same
    range of applications.


Alternative forms of infix operators

    Two major forms and several minor variants of new infix operators
    were discussed:

    - Bracketed form

        (op)
        [op]
        {op}
        <op>
        :op:
        ~op~
        %op%

    - Meta character form

        .op
        @op
        ~op

      Alternatively the meta character is put after the operator.

    - Less consistent variations of these themes.  These are
      considered unfavorably.  For completeness some are listed here

        - Use @/ and /@ for left and right division
        - Use [*] and (*) for outer and inner products
        - Use a single operator @ for multiplication.

    - Use __call__ to simulate multiplication.
      a(b)  or (a)(b)

    Criteria for choosing among the representations include:

        - No syntactical ambiguities with existing operators.  

        - Higher readability in actual formulas.  This makes the
          bracketed forms unfavorable.  See examples below.

        - Visually similar to existing math operators.

        - Syntactically simple, without blocking possible future
          extensions.

    With these criteria the overall winner in bracket form appear to
    be {op}.  A clear winner in the meta character form is ~op.
    Comparing these it appears that ~op is the favorite among them
    all.

    Some analysis are as follows:

        - The .op form is ambiguous: 1.+a would be different from 1 .+a

        - The bracket type operators are most favorable when standing
          alone, but not in formulas, as they interfere with visual
          parsing of parenthesis for precedence and function argument.
          This is so for (op) and [op], and somewhat less so for {op}
          and <op>.

        - The <op> form has the potential to be confused with < > and =

        - The @op is not favored because @ is visually heavy (dense,
          more like a letter): a@+b is more readily read as a@ + b
          than a @+ b.

        - For choosing meta-characters: Most of existing ASCII symbols
          have already been used.  The only three unused are @ $ ?.


Semantics of new operators

    There are convincing arguments for using either set of operators
    as objectwise or elementwise.  Some of them are listed here:

    1. op for element, ~op for object

       - Consistent with current multiarray interface of Numeric package
       - Consistent with some other languages
       - Perception that elementwise operations are more natural
       - Perception that elementwise operations are used more frequently

    2. op for object, ~op for element

       - Consistent with current linear algebra interface of MatPy package
       - Consistent with some other languages
       - Perception that objectwise operations are more natural
       - Perception that objectwise operations are used more frequently
       - Consistent with the current behavior of operators on lists
       - Allow ~ to be a general elementwise meta-character in future
         extensions.

    It is generally agreed upon that 

       - there is no absolute reason to favor one or the other
       - it is easy to cast from one representation to another in a
         sizable chunk of code, so the other flavor of operators is
         always minority
       - there are other semantic differences that favor existence of
         array-oriented and matrix-oriented packages, even if their
         operators are unified.
       - whatever the decision is taken, codes using existing
         interfaces should not be broken for a very long time.

    Therefore not much is lost, and much flexibility retained, if the
    semantic flavors of these two sets of operators are not dictated
    by the core language.  The application packages are responsible
    for making the most suitable choice.  This is already the case for
    NumPy and MatPy which use opposite semantics.  Adding new
    operators will not break this.  See also observation after
    subsection 2 in the Examples below.

    The issue of numerical precision was raised, but if the semantics
    is left to the applications, the actual precisions should also go
    there.


Examples

    Following are examples of the actual formulas that will appear
    using various operators or other representations described above.

    1. The matrix inversion formula:

       - Using op for object and ~op for element:

         b = a.I - a.I * u / (c.I + v/a*u) * v / a

         b = a.I - a.I * u * (c.I + v*a.I*u).I * v * a.I

       - Using op for element and ~op for object:

         b = a.I @- a.I @* u @/ (c.I @+ v@/a@*u) @* v @/ a

         b = a.I ~- a.I ~* u ~/ (c.I ~+ v~/a~*u) ~* v ~/ a

         b = a.I (-) a.I (*) u (/) (c.I (+) v(/)a(*)u) (*) v (/) a

         b = a.I [-] a.I [*] u [/] (c.I [+] v[/]a[*]u) [*] v [/] a

         b = a.I <-> a.I <*> u </> (c.I <+> v</>a<*>u) <*> v </> a

         b = a.I {-} a.I {*} u {/} (c.I {+} v{/}a{*}u) {*} v {/} a

       Observation: For linear algebra using op for object is preferable.

       Observation: The ~op type operators look better than (op) type
       in complicated formulas.

       - using named operators

         b = a.I @sub a.I @mul u @div (c.I @add v @div a @mul u) @mul v @div a

         b = a.I ~sub a.I ~mul u ~div (c.I ~add v ~div a ~mul u) ~mul v ~div a

       Observation: Named operators are not suitable for math formulas.

    2. Plotting a 3d graph

       - Using op for object and ~op for element:

         z = sin(x~**2 ~+ y~**2);    plot(x,y,z)

       - Using op for element and ~op for object:

         z = sin(x**2 + y**2);   plot(x,y,z)

        Observation: Elementwise operations with broadcasting allows
        much more efficient implementation than MatLab.

        Observation: It is useful to have two related classes with the
        semantics of op and ~op swapped.  Using these the ~op
        operators would only need to appear in chunks of code where
        the other flavor dominates, while maintaining consistent
        semantics of the code.

    3. Using + and - with automatic broadcasting

         a = b - c;  d = a.T*a

       Observation: This would silently produce hard-to-trace bugs if
       one of b or c is row vector while the other is column vector.


Miscellaneous issues:

    - Need for the ~+ ~- operators.  The objectwise + - are important
      because they provide important sanity checks as per linear
      algebra.  The elementwise + - are important because they allow
      broadcasting that are very efficient in applications.

    - Left division (solve).  For matrix, a*x is not necessarily equal
      to x*a.  The solution of a*x==b, denoted x=solve(a,b), is
      therefore different from the solution of x*a==b, denoted
      x=div(b,a).  There are discussions about finding a new symbol
      for solve.  [Background: MatLab use b/a for div(b,a) and a\b for
      solve(a,b).]

      It is recognized that Python provides a better solution without
      requiring a new symbol: the inverse method .I can be made to be
      delayed so that a.I*b and b*a.I are equivalent to Mat lab's a\b
      and b/a.  The implementation is quite simple and the resulting
      application code clean.

    - Power operator.  Python's use of a**b as pow(a,b) has two
      perceived disadvantages:

      - Most mathematicians are more familiar with a^b for this purpose.
      - It results in long augmented assignment operator ~**=.

      However, this issue is distinct from the main issue here.

    - Additional multiplication operators.  Several forms of
      multiplications are used in (multi-)linear algebra.  Most can be
      seen as variations of multiplication in linear algebra sense
      (such as Kronecker product).  But two forms appear to be more
      fundamental: outer product and inner product.  However, their
      specification includes indices, which can be either

      - associated with the operator, or
      - associated with the objects.

      The latter (the Einstein notation) is used extensively on paper,
      and is also the easier one to implement.  By implementing a
      tensor-with-indices class, a general form of multiplication
      would cover both outer and inner products, and specialize to
      linear algebra multiplication as well.  The index rule can be
      defined as class methods, like,

          a = b.i(1,2,-1,-2) * c.i(4,-2,3,-1)   # a_ijkl = b_ijmn c_lnkm

      Therefore one objectwise multiplication is sufficient.

    - Bitwise operators. 

      - The proposed new math operators use the symbol ~ that is
        "bitwise not" operator.  This poses no compatibility problem
        but somewhat complicates implementation.

      - The symbol ^ might be better used for pow than bitwise xor.
        But this depends on the future of bitwise operators.  It does
        not immediately impact on the proposed math operator.

      - The symbol | was suggested to be used for matrix solve.  But
        the new solution of using delayed .I is better in several
        ways.

      - The current proposal fits in a larger and more general
        extension that will remove the need for special bitwise
        operators.  (See elementization below.)

    - Alternative to special operator names used in definition,

          def "+"(a, b)      in place of       def __add__(a, b)

      This appears to require greater syntactical change, and would
      only be useful when arbitrary additional operators are allowed.


Impact on general elementization

    The distinction between objectwise and elementwise operations are
    meaningful in other contexts as well, where an object can be
    conceptually regarded as a collection of elements.  It is
    important that the current proposal does not preclude possible
    future extensions.

    One general future extension is to use ~ as a meta operator to
    "elementize" a given operator.  Several examples are listed here:

    1. Bitwise operators.  Currently Python assigns six operators to
       bitwise operations: and (&), or (|), xor (^), complement (~),
       left shift (<<) and right shift (>>), with their own precedence
       levels.

       Among them, the & | ^ ~ operators can be regarded as
       elementwise versions of lattice operators applied to integers
       regarded as bit strings.

           5 and 6                # 6
           5 or 6                 # 5

           5 ~and 6               # 4
           5 ~or 6                # 7

       These can be regarded as general elementwise lattice operators,
       not restricted to bits in integers.

       In order to have named operators for xor ~xor, it is necessary
       to make xor a reserved word.

    2. List arithmetics. 

           [1, 2] + [3, 4]        # [1, 2, 3, 4]
           [1, 2] ~+ [3, 4]       # [4, 6]

           ['a', 'b'] * 2         # ['a', 'b', 'a', 'b']
           'ab' * 2               # 'abab'

           ['a', 'b'] ~* 2        # ['aa', 'bb']
           [1, 2] ~* 2            # [2, 4]

       It is also consistent to Cartesian product

           [1,2]*[3,4]            # [(1,3),(1,4),(2,3),(2,4)]

    3. List comprehension.

           a = [1, 2]; b = [3, 4]
           ~f(a,b)                # [f(x,y) for x, y in zip(a,b)]
           ~f(a*b)                # [f(x,y) for x in a for y in b]
           a ~+ b                 # [x + y for x, y in zip(a,b)]

    4. Tuple generation (the zip function in Python 2.0)

          [1, 2, 3], [4, 5, 6]   # ([1,2, 3], [4, 5, 6])
          [1, 2, 3]~,[4, 5, 6]   # [(1,4), (2, 5), (3,6)]

    5. Using ~ as generic elementwise meta-character to replace map

          ~f(a, b)               # map(f, a, b)
          ~~f(a, b)              # map(lambda *x:map(f, *x), a, b)

       More generally,

          def ~f(*x): return map(f, *x)
          def ~~f(*x): return map(~f, *x)
          ...

    6. Elementwise format operator (with broadcasting)

          a = [1,2,3,4,5]
          print ["%5d "] ~% a 
          a = [[1,2],[3,4]]
          print ["%5d "] ~~% a

    7.  Rich comparison

          [1, 2, 3]  ~< [3, 2, 1]  # [1, 0, 0]
          [1, 2, 3] ~== [3, 2, 1]  # [0, 1, 0]

    8. Rich indexing

          [a, b, c, d] ~[2, 3, 1]  # [c, d, b]

    9. Tuple flattening

          a = (1,2);  b = (3,4)
          f(~a, ~b)                # f(1,2,3,4)      

    10. Copy operator

          a ~= b                   # a = b.copy()

        There can be specific levels of deep copy

          a ~~= b                  # a = b.copy(2)

    Notes:

    1. There are probably many other similar situations.  This general
       approach seems well suited for most of them, in place of
       several separated extensions for each of them (parallel and
       cross iteration, list comprehension, rich comparison, etc).

    2. The semantics of "elementwise" depends on applications.  For
       example, an element of matrix is two levels down from the
       list-of-list point of view.  This requires more fundamental
       change than the current proposal.  In any case, the current
       proposal will not negatively impact on future possibilities of
       this nature.

    Note that this section describes a type of future extensions that
    is consistent with current proposal, but may present additional
    compatibility or other problems.  They are not tied to the current
    proposal.


Impact on named operators

    The discussions made it generally clear that infix operators is a
    scarce resource in Python, not only in numerical computation, but
    in other fields as well.  Several proposals and ideas were put
    forward that would allow infix operators be introduced in ways
    similar to named functions.  We show here that the current
    extension does not negatively impact on future extensions in this
    regard.

    1. Named infix operators.

        Choose a meta character, say @, so that for any identifier
        "opname", the combination "@opname" would be a binary infix
        operator, and

        a @opname b == opname(a,b)

        Other representations mentioned include .name ~name~ :name:
        (.name) %name% and similar variations.  The pure bracket based
        operators cannot be used this way.

        This requires a change in the parser to recognize @opname, and
        parse it into the same structure as a function call.  The
        precedence of all these operators would have to be fixed at
        one level, so the implementation would be different from
        additional math operators which keep the precedence of
        existing math operators.

        The current proposed extension do not limit possible future
        extensions of such form in any way.

    2. More general symbolic operators.

        One additional form of future extension is to use meta
        character and operator symbols (symbols that cannot be used in
        syntactical structures other than operators).  Suppose @ is
        the meta character.  Then

            a + b,    a @+ b,    a @@+ b,  a @+- b

        would all be operators with a hierarchy of precedence, defined by

            def "+"(a, b)
            def "@+"(a, b)
            def "@@+"(a, b)
            def "@+-"(a, b)

        One advantage compared with named operators is greater
        flexibility for precedences based on either the meta character
        or the ordinary operator symbols.  This also allows operator
        composition.  The disadvantage is that they are more like
        "line noise".  In any case the current proposal does not
        impact its future possibility.

        These kinds of future extensions may not be necessary when
        Unicode becomes generally available.

        Note that this section discusses compatibility of the proposed
        extension with possible future extensions.  The desirability
        or compatibility of these other extensions themselves are
        specifically not considered here.


Credits and archives

    The discussions mostly happened in July to August of 2000 on news
    group comp.lang.python and the mailing list python-dev.  There are
    altogether several hundred postings, most can be retrieved from
    these two pages (and searching word "operator"):

        http://www.python.org/pipermail/python-list/2000-July/
        http://www.python.org/pipermail/python-list/2000-August/

    The names of contributers are too numerous to mention here,
    suffice to say that a large proportion of ideas discussed here are
    not our own.

    Several key postings (from our point of view) that may help to
    navigate the discussions include:

        http://www.python.org/pipermail/python-list/2000-July/108893.html
        http://www.python.org/pipermail/python-list/2000-July/108777.html
        http://www.python.org/pipermail/python-list/2000-July/108848.html
        http://www.python.org/pipermail/python-list/2000-July/109237.html
        http://www.python.org/pipermail/python-list/2000-July/109250.html
        http://www.python.org/pipermail/python-list/2000-July/109310.html
        http://www.python.org/pipermail/python-list/2000-July/109448.html
        http://www.python.org/pipermail/python-list/2000-July/109491.html
        http://www.python.org/pipermail/python-list/2000-July/109537.html
        http://www.python.org/pipermail/python-list/2000-July/109607.html
        http://www.python.org/pipermail/python-list/2000-July/109709.html
        http://www.python.org/pipermail/python-list/2000-July/109804.html
        http://www.python.org/pipermail/python-list/2000-July/109857.html
        http://www.python.org/pipermail/python-list/2000-July/110061.html
        http://www.python.org/pipermail/python-list/2000-July/110208.html
        http://www.python.org/pipermail/python-list/2000-August/111427.html
        http://www.python.org/pipermail/python-list/2000-August/111558.html
        http://www.python.org/pipermail/python-list/2000-August/112551.html
        http://www.python.org/pipermail/python-list/2000-August/112606.html
        http://www.python.org/pipermail/python-list/2000-August/112758.html

        http://www.python.org/pipermail/python-dev/2000-July/013243.html
        http://www.python.org/pipermail/python-dev/2000-July/013364.html
        http://www.python.org/pipermail/python-dev/2000-August/014940.html

    These are earlier drafts of this PEP:

        http://www.python.org/pipermail/python-list/2000-August/111785.html
        http://www.python.org/pipermail/python-list/2000-August/112529.html
        http://www.python.org/pipermail/python-dev/2000-August/014906.html

    There is an alternative PEP (officially, PEP 211) by Greg Wilson,
    titled "Adding New Linear Algebra Operators to Python".

    Its first (and current) version is at:

        http://www.python.org/pipermail/python-dev/2000-August/014876.html
        http://www.python.org/dev/peps/pep-0211/


Additional References

    [1] http://MatPy.sourceforge.net/Misc/index.html



pep-0226 Python 2.1 Release Schedule

PEP: 226
Title: Python 2.1 Release Schedule
Version: $Revision$
Last-Modified: $Date$
Author: Jeremy Hylton <jeremy at alum.mit.edu>
Status: Final
Type: Informational
Created: 16-Oct-2000
Python-Version: 2.1
Post-History: 

Abstract

    This document describes the post Python 2.0 development and
    release schedule.  According to this schedule, Python 2.1 will be
    released in April of 2001.  The schedule primarily concerns
    itself with PEP-size items.  Small bug fixes and changes will
    occur up until the first beta release.


Release Schedule

    Tentative future release dates

    [bugfix release dates go here]

    Past release dates:

    17-Apr-2001: 2.1 final release
    15-Apr-2001: 2.1 release candidate 2
    13-Apr-2001: 2.1 release candidate 1
    23-Mar-2001: Python 2.1 beta 2 release
    02-Mar-2001: First 2.1 beta release
    02-Feb-2001: Python 2.1 alpha 2 release
    22-Jan-2001: Python 2.1 alpha 1 release
    16-Oct-2000: Python 2.0 final release

Open issues for Python 2.0 beta 2

    Add a default unit testing framework to the standard library.

Guidelines for making changes for Python 2.1

    The guidelines and schedule will be revised based on discussion in
    the python-dev@python.org mailing list.

    The PEP system was instituted late in the Python 2.0 development
    cycle and many changes did not follow the process described in PEP
    1.  The development process for 2.1, however, will follow the PEP
    process as documented.

    The first eight weeks following 2.0 final will be the design and
    review phase.  By the end of this period, any PEP that is proposed
    for 2.1 should be ready for review.  This means that the PEP is
    written and discussion has occurred on the python-dev@python.org
    and python-list@python.org mailing lists.

    The next six weeks will be spent reviewing the PEPs and
    implementing and testing the accepted PEPs.  When this period
    stops, we will end consideration of any incomplete PEPs.  Near the
    end of this period, there will be a feature freeze where any small
    features not worthy of a PEP will not be accepted.

    Before the final release, we will have six weeks of beta testing
    and a release candidate or two.

General guidelines for submitting patches and making changes

    Use good sense when committing changes.  You should know what we
    mean by good sense or we wouldn't have given you commit privileges
    <0.5 wink>.  Some specific examples of good sense include:

    - Do whatever the dictator tells you.

    - Discuss any controversial changes on python-dev first.  If you
      get a lot of +1 votes and no -1 votes, make the change.  If you
      get a some -1 votes, think twice; consider asking Guido what he
      thinks.

    - If the change is to code you contributed, it probably makes
      sense for you to fix it.

    - If the change affects code someone else wrote, it probably makes
      sense to ask him or her first.

    - You can use the SourceForge (SF) Patch Manager to submit a patch
      and assign it to someone for review.

    Any significant new feature must be described in a PEP and
    approved before it is checked in.

    Any significant code addition, such as a new module or large
    patch, must include test cases for the regression test and
    documentation.  A patch should not be checked in until the tests
    and documentation are ready.

    If you fix a bug, you should write a test case that would have
    caught the bug.

    If you commit a patch from the SF Patch Manager or fix a bug from
    the Jitterbug database, be sure to reference the patch/bug number
    in the CVS log message.  Also be sure to change the status in the
    patch manager or bug database (if you have access to the bug
    database).

    It is not acceptable for any checked in code to cause the
    regression test to fail.  If a checkin causes a failure, it must
    be fixed within 24 hours or it will be backed out.

    All contributed C code must be ANSI C.  If possible check it with
    two different compilers, e.g. gcc and MSVC.

    All contributed Python code must follow Guido's Python style
    guide.  http://www.python.org/doc/essays/styleguide.html

    It is understood that any code contributed will be released under
    an Open Source license.  Do not contribute code if it can't be
    released this way.



pep-0227 Statically Nested Scopes

PEP: 227
Title: Statically Nested Scopes
Version: $Revision$
Last-Modified: $Date$
Author: Jeremy Hylton <jeremy at alum.mit.edu>
Status: Final
Type: Standards Track
Created: 01-Nov-2000
Python-Version: 2.1
Post-History: 

Abstract

    This PEP describes the addition of statically nested scoping
    (lexical scoping) for Python 2.2, and as an source level option
    for python 2.1.  In addition, Python 2.1 will issue warnings about
    constructs whose meaning may change when this feature is enabled.

    The old language definition (2.0 and before) defines exactly three
    namespaces that are used to resolve names -- the local, global,
    and built-in namespaces.  The addition of nested scopes allows
    resolution of unbound local names in enclosing functions'
    namespaces.

    The most visible consequence of this change is that lambdas (and
    other nested functions) can reference variables defined in the
    surrounding namespace.  Currently, lambdas must often use default
    arguments to explicitly creating bindings in the lambda's
    namespace.

Introduction

    This proposal changes the rules for resolving free variables in
    Python functions.  The new name resolution semantics will take
    effect with Python 2.2.  These semantics will also be available in
    Python 2.1 by adding "from __future__ import nested_scopes" to the
    top of a module.  (See PEP 236.)

    The Python 2.0 definition specifies exactly three namespaces to
    check for each name -- the local namespace, the global namespace,
    and the builtin namespace.  According to this definition, if a
    function A is defined within a function B, the names bound in B
    are not visible in A.  The proposal changes the rules so that
    names bound in B are visible in A (unless A contains a name
    binding that hides the binding in B).

    This specification introduces rules for lexical scoping that are
    common in Algol-like languages.  The combination of lexical
    scoping and existing support for first-class functions is
    reminiscent of Scheme.

    The changed scoping rules address two problems -- the limited
    utility of lambda expressions (and nested functions in general),
    and the frequent confusion of new users familiar with other
    languages that support nested lexical scopes, e.g. the inability
    to define recursive functions except at the module level.

    The lambda expression yields an unnamed function that evaluates a
    single expression.  It is often used for callback functions.  In
    the example below (written using the Python 2.0 rules), any name
    used in the body of the lambda must be explicitly passed as a
    default argument to the lambda.

      from Tkinter import *
      root = Tk()
      Button(root, text="Click here",
             command=lambda root=root: root.test.configure(text="..."))

    This approach is cumbersome, particularly when there are several
    names used in the body of the lambda.  The long list of default
    arguments obscures the purpose of the code.  The proposed
    solution, in crude terms, implements the default argument approach
    automatically.  The "root=root" argument can be omitted.

    The new name resolution semantics will cause some programs to
    behave differently than they did under Python 2.0.  In some cases,
    programs will fail to compile.  In other cases, names that were
    previously resolved using the global namespace will be resolved
    using the local namespace of an enclosing function.  In Python
    2.1, warnings will be issued for all statements that will behave
    differently.

Specification

    Python is a statically scoped language with block structure, in
    the traditional of Algol.  A code block or region, such as a
    module, class definition, or function body, is the basic unit of a
    program.

    Names refer to objects.  Names are introduced by name binding
    operations.  Each occurrence of a name in the program text refers
    to the binding of that name established in the innermost function
    block containing the use.

    The name binding operations are argument declaration, assignment,
    class and function definition, import statements, for statements,
    and except clauses.  Each name binding occurs within a block
    defined by a class or function definition or at the module level
    (the top-level code block).

    If a name is bound anywhere within a code block, all uses of the
    name within the block are treated as references to the current
    block.  (Note: This can lead to errors when a name is used within
    a block before it is bound.)

    If the global statement occurs within a block, all uses of the
    name specified in the statement refer to the binding of that name
    in the top-level namespace.  Names are resolved in the top-level
    namespace by searching the global namespace, i.e. the namespace of
    the module containing the code block, and in the builtin
    namespace, i.e. the namespace of the __builtin__ module.  The
    global namespace is searched first.  If the name is not found
    there, the builtin namespace is searched.  The global statement
    must precede all uses of the name.

    If a name is used within a code block, but it is not bound there
    and is not declared global, the use is treated as a reference to
    the nearest enclosing function region.  (Note: If a region is
    contained within a class definition, the name bindings that occur
    in the class block are not visible to enclosed functions.)

    A class definition is an executable statement that may contain
    uses and definitions of names.  These references follow the normal
    rules for name resolution.  The namespace of the class definition
    becomes the attribute dictionary of the class.

    The following operations are name binding operations.  If they
    occur within a block, they introduce new local names in the
    current block unless there is also a global declaration.

    Function definition: def name ...
    Argument declaration: def f(...name...), lambda ...name...
    Class definition: class name ...
    Assignment statement: name = ...    
    Import statement: import name, import module as name,
        from module import name
    Implicit assignment: names are bound by for statements and except
        clauses

    There are several cases where Python statements are illegal when
    used in conjunction with nested scopes that contain free
    variables.

    If a variable is referenced in an enclosed scope, it is an error
    to delete the name.  The compiler will raise a SyntaxError for
    'del name'.

    If the wild card form of import (import *) is used in a function
    and the function contains a nested block with free variables, the
    compiler will raise a SyntaxError.

    If exec is used in a function and the function contains a nested
    block with free variables, the compiler will raise a SyntaxError
    unless the exec explicitly specifies the local namespace for the
    exec.  (In other words, "exec obj" would be illegal, but 
    "exec obj in ns" would be legal.)

    If a name bound in a function scope is also the name of a module
    global name or a standard builtin name, and the function contains
    a nested function scope that references the name, the compiler
    will issue a warning.  The name resolution rules will result in
    different bindings under Python 2.0 than under Python 2.2.  The
    warning indicates that the program may not run correctly with all
    versions of Python.

Discussion

    The specified rules allow names defined in a function to be
    referenced in any nested function defined with that function.  The
    name resolution rules are typical for statically scoped languages,
    with three primary exceptions:

        - Names in class scope are not accessible.
        - The global statement short-circuits the normal rules.
        - Variables are not declared.

    Names in class scope are not accessible.  Names are resolved in
    the innermost enclosing function scope.  If a class definition
    occurs in a chain of nested scopes, the resolution process skips
    class definitions.  This rule prevents odd interactions between
    class attributes and local variable access.  If a name binding
    operation occurs in a class definition, it creates an attribute on
    the resulting class object.  To access this variable in a method,
    or in a function nested within a method, an attribute reference
    must be used, either via self or via the class name.

    An alternative would have been to allow name binding in class
    scope to behave exactly like name binding in function scope.  This
    rule would allow class attributes to be referenced either via
    attribute reference or simple name.  This option was ruled out
    because it would have been inconsistent with all other forms of
    class and instance attribute access, which always use attribute
    references.  Code that used simple names would have been obscure.

    The global statement short-circuits the normal rules.  Under the
    proposal, the global statement has exactly the same effect that it
    does for Python 2.0.  It is also noteworthy because it allows name
    binding operations performed in one block to change bindings in
    another block (the module).

    Variables are not declared.  If a name binding operation occurs
    anywhere in a function, then that name is treated as local to the
    function and all references refer to the local binding.  If a
    reference occurs before the name is bound, a NameError is raised.
    The only kind of declaration is the global statement, which allows
    programs to be written using mutable global variables.  As a
    consequence, it is not possible to rebind a name defined in an
    enclosing scope.  An assignment operation can only bind a name in
    the current scope or in the global scope.  The lack of
    declarations and the inability to rebind names in enclosing scopes
    are unusual for lexically scoped languages; there is typically a
    mechanism to create name bindings (e.g. lambda and let in Scheme)
    and a mechanism to change the bindings (set! in Scheme).

    XXX Alex Martelli suggests comparison with Java, which does not
    allow name bindings to hide earlier bindings.  

Examples

    A few examples are included to illustrate the way the rules work.

    XXX Explain the examples

    >>> def make_adder(base):
    ...     def adder(x):
    ...         return base + x
    ...     return adder
    >>> add5 = make_adder(5)
    >>> add5(6)
    11

    >>> def make_fact():
    ...     def fact(n):
    ...         if n == 1:
    ...             return 1L
    ...         else:
    ...             return n * fact(n - 1)
    ...     return fact
    >>> fact = make_fact()
    >>> fact(7)    
    5040L

    >>> def make_wrapper(obj):
    ...     class Wrapper:
    ...         def __getattr__(self, attr):
    ...             if attr[0] != '_':
    ...                 return getattr(obj, attr)
    ...             else:
    ...                 raise AttributeError, attr
    ...     return Wrapper()
    >>> class Test:
    ...     public = 2
    ...     _private = 3
    >>> w = make_wrapper(Test())
    >>> w.public
    2
    >>> w._private
    Traceback (most recent call last):
      File "<stdin>", line 1, in ?
    AttributeError: _private

    An example from Tim Peters demonstrates the potential pitfalls of
    nested scopes in the absence of declarations:

    i = 6
    def f(x):
        def g():
            print i
        # ...
        # skip to the next page
        # ...
        for i in x:  # ah, i *is* local to f, so this is what g sees
            pass
        g()

    The call to g() will refer to the variable i bound in f() by the for
    loop.  If g() is called before the loop is executed, a NameError will
    be raised.

    XXX need some counterexamples

Backwards compatibility

    There are two kinds of compatibility problems caused by nested
    scopes.  In one case, code that behaved one way in earlier
    versions behaves differently because of nested scopes.  In the
    other cases, certain constructs interact badly with nested scopes
    and will trigger SyntaxErrors at compile time.

    The following example from Skip Montanaro illustrates the first
    kind of problem:

    x = 1
    def f1():
        x = 2
        def inner():
            print x
        inner()

    Under the Python 2.0 rules, the print statement inside inner()
    refers to the global variable x and will print 1 if f1() is
    called.  Under the new rules, it refers to the f1()'s namespace,
    the nearest enclosing scope with a binding.

    The problem occurs only when a global variable and a local
    variable share the same name and a nested function uses that name
    to refer to the global variable.  This is poor programming
    practice, because readers will easily confuse the two different
    variables.  One example of this problem was found in the Python
    standard library during the implementation of nested scopes.

    To address this problem, which is unlikely to occur often, the
    Python 2.1 compiler (when nested scopes are not enabled) issues a
    warning.

    The other compatibility problem is caused by the use of 'import *'
    and 'exec' in a function body, when that function contains a
    nested scope and the contained scope has free variables.  For
    example:

    y = 1
    def f():
        exec "y = 'gotcha'" # or from module import *
        def g():
            return y
        ...

    At compile-time, the compiler cannot tell whether an exec that
    operates on the local namespace or an import * will introduce
    name bindings that shadow the global y.  Thus, it is not possible
    to tell whether the reference to y in g() should refer to the
    global or to a local name in f().

    In discussion of the python-list, people argued for both possible
    interpretations.  On the one hand, some thought that the reference
    in g() should be bound to a local y if one exists.  One problem
    with this interpretation is that it is impossible for a human
    reader of the code to determine the binding of y by local
    inspection.  It seems likely to introduce subtle bugs.  The other
    interpretation is to treat exec and import * as dynamic features
    that do not effect static scoping.  Under this interpretation, the
    exec and import * would introduce local names, but those names
    would never be visible to nested scopes.  In the specific example
    above, the code would behave exactly as it did in earlier versions
    of Python.

    Since each interpretation is problematic and the exact meaning
    ambiguous, the compiler raises an exception.  The Python 2.1
    compiler issues a warning when nested scopes are not enabled.

    A brief review of three Python projects (the standard library,
    Zope, and a beta version of PyXPCOM) found four backwards
    compatibility issues in approximately 200,000 lines of code.
    There was one example of case #1 (subtle behavior change) and two
    examples of import * problems in the standard library.

    (The interpretation of the import * and exec restriction that was
    implemented in Python 2.1a2 was much more restrictive, based on
    language that in the reference manual that had never been
    enforced.  These restrictions were relaxed following the release.)

Compatibility of C API

    The implementation causes several Python C API functions to
    change, including PyCode_New().  As a result, C extensions may
    need to be updated to work correctly with Python 2.1.  

locals() / vars()

    These functions return a dictionary containing the current scope's
    local variables.  Modifications to the dictionary do not affect
    the values of variables.  Under the current rules, the use of
    locals() and globals() allows the program to gain access to all
    the namespaces in which names are resolved.

    An analogous function will not be provided for nested scopes.
    Under this proposal, it will not be possible to gain
    dictionary-style access to all visible scopes.

Warnings and Errors

    The compiler will issue warnings in Python 2.1 to help identify
    programs that may not compile or run correctly under future
    versions of Python.  Under Python 2.2 or Python 2.1 if the
    nested_scopes future statement is used, which are collectively
    referred to as "future semantics" in this section, the compiler
    will issue SyntaxErrors in some cases.

    The warnings typically apply when a function that contains a
    nested function that has free variables.  For example, if function
    F contains a function G and G uses the builtin len(), then F is a
    function that contains a nested function (G) with a free variable
    (len).  The label "free-in-nested" will be used to describe these
    functions. 

    import * used in function scope

        The language reference specifies that import * may only occur
        in a module scope.  (Sec. 6.11)  The implementation of C
        Python has supported import * at the function scope.

        If import * is used in the body of a free-in-nested function,
        the compiler will issue a warning.  Under future semantics,
        the compiler will raise a SyntaxError.

    bare exec in function scope

        The exec statement allows two optional expressions following
        the keyword "in" that specify the namespaces used for locals
        and globals.  An exec statement that omits both of these
        namespaces is a bare exec.

        If a bare exec is used in the body of a free-in-nested
        function, the compiler will issue a warning.  Under future
        semantics, the compiler will raise a SyntaxError.

    local shadows global

        If a free-in-nested function has a binding for a local
        variable that (1) is used in a nested function and (2) is the
        same as a global variable, the compiler will issue a warning.

Rebinding names in enclosing scopes

    There are technical issues that make it difficult to support
    rebinding of names in enclosing scopes, but the primary reason
    that it is not allowed in the current proposal is that Guido is
    opposed to it.  His motivation: it is difficult to support,
    because it would require a new mechanism that would allow the
    programmer to specify that an assignment in a block is supposed to
    rebind the name in an enclosing block; presumably a keyword or
    special syntax (x := 3) would make this possible.  Given that this
    would encourage the use of local variables to hold state that is
    better stored in a class instance, it's not worth adding new
    syntax to make this possible (in Guido's opinion).

    The proposed rules allow programmers to achieve the effect of
    rebinding, albeit awkwardly.  The name that will be effectively
    rebound by enclosed functions is bound to a container object.  In
    place of assignment, the program uses modification of the
    container to achieve the desired effect:

    def bank_account(initial_balance):
        balance = [initial_balance]
        def deposit(amount):
            balance[0] = balance[0] + amount
            return balance
        def withdraw(amount):
            balance[0] = balance[0] - amount
            return balance
        return deposit, withdraw

    Support for rebinding in nested scopes would make this code
    clearer.  A class that defines deposit() and withdraw() methods
    and the balance as an instance variable would be clearer still.
    Since classes seem to achieve the same effect in a more
    straightforward manner, they are preferred.

Implementation

    XXX Jeremy, is this still the case?

    The implementation for C Python uses flat closures [1].  Each def
    or lambda expression that is executed will create a closure if the
    body of the function or any contained function has free
    variables.  Using flat closures, the creation of closures is
    somewhat expensive but lookup is cheap.

    The implementation adds several new opcodes and two new kinds of
    names in code objects.  A variable can be either a cell variable
    or a free variable for a particular code object.  A cell variable
    is referenced by containing scopes; as a result, the function
    where it is defined must allocate separate storage for it on each
    invocation.  A free variable is referenced via a function's
    closure. 

    The choice of free closures was made based on three factors.
    First, nested functions are presumed to be used infrequently,
    deeply nested (several levels of nesting) still less frequently.
    Second, lookup of names in a nested scope should be fast.
    Third, the use of nested scopes, particularly where a function
    that access an enclosing scope is returned, should not prevent
    unreferenced objects from being reclaimed by the garbage
    collector. 

    XXX Much more to say here

References

    [1] Luca Cardelli.  Compiling a functional language.  In Proc. of
    the 1984 ACM Conference on Lisp and Functional Programming,
    pp. 208-217, Aug. 1984
        http://citeseer.ist.psu.edu/cardelli84compiling.html

Copyright

    XXX


pep-0228 Reworking Python's Numeric Model

PEP: 228
Title: Reworking Python's Numeric Model
Version: $Revision$
Last-Modified: $Date$
Author: Moshe Zadka <moshez at zadka.site.co.il>, Guido van Rossum <guido at python.org>
Status: Withdrawn
Type: Standards Track
Created: 4-Nov-2000
Python-Version: ??
Post-History: 

Withdrawal

    This PEP has been withdrawn in favor of PEP 3141.


Abstract

    Today, Python's numerical model is similar to the C numeric model:
    there are several unrelated numerical types, and when operations
    between numerical types are requested, coercions happen.  While
    the C rationale for the numerical model is that it is very similar
    to what happens at the hardware level, that rationale does not
    apply to Python.  So, while it is acceptable to C programmers that
    2/3 == 0, it is surprising to many Python programmers.

    NOTE: in the light of recent discussions in the newsgroup, the
    motivation in this PEP (and details) need to be extended.


Rationale

    In usability studies, one of the least usable aspect of Python was
    the fact that integer division returns the floor of the division.
    This makes it hard to program correctly, requiring casts to
    float() in various parts through the code.  Python's numerical
    model stems from C, while an model that might be easier to work with 
    can be based on the mathematical understanding of numbers.


Other Numerical Models

    Perl's numerical model is that there is one type of numbers --
    floating point numbers.  While it is consistent and superficially
    non-surprising, it tends to have subtle gotchas.  One of these is
    that printing numbers is very tricky, and requires correct
    rounding.  In Perl, there is also a mode where all numbers are
    integers.  This mode also has its share of problems, which arise
    from the fact that there is not even an approximate way of
    dividing numbers and getting meaningful answers.


Suggested Interface For Python's Numerical Model

    While coercion rules will remain for add-on types and classes, the
    built in type system will have exactly one Python type -- a
    number.  There are several things which can be considered "number
    methods":

    1. isnatural()
    2. isintegral()
    3. isrational()
    4. isreal()
    5. iscomplex()

    a. isexact()

    Obviously, a number which answers true to a question from 1 to 5, will
    also answer true to any following question. If "isexact()" is not true, 
    then any answer might be wrong.
    (But not horribly wrong: it's close to the truth.)

    Now, there is two thing the models promises for the field operations
    (+, -, /, *): 

    - If both operands satisfy isexact(), the result satisfies
      isexact().

    - All field rules are true, except that for not-isexact() numbers,
      they might be only approximately true.

    One consequence of these two rules is that all exact calcutions
    are done as (complex) rationals: since the field laws must hold,
    then

        (a/b)*b == a

    must hold.

    There is built-in function, inexact() which takes a number
    and returns an inexact number which is a good approximation.
    Inexact numbers must be as least as accurate as if they were
    using IEEE-754.

    Several of the classical Python functions will return exact numbers
    even when given inexact numbers: e.g, int().

Coercion

    The number type does not define nb_coerce
    Any numeric operation slot, when receiving something other then PyNumber,
    refuses to implement it.

Inexact Operations

    The functions in the "math" module will be allowed to return
    inexact results for exact values.  However, they will never return
    a non-real number.  The functions in the "cmath" module are also
    allowed to return an inexact result for an exact argument, and are
    furthermore allowed to return a complex result for a real
    argument.


Numerical Python Issues

    People who use Numerical Python do so for high-performance vector
    operations.  Therefore, NumPy should keep its hardware based
    numeric model.


Unresolved Issues

    Which number literals will be exact, and which inexact?

    How do we deal with IEEE 754 operations? (probably, isnan/isinf should
    be methods)

    On 64-bit machines, comparisons between ints and floats may be
    broken when the comparison involves conversion to float.  Ditto
    for comparisons between longs and floats.  This can be dealt with
    by avoiding the conversion to float.  (Due to Andrew Koenig.)


Copyright

    This document has been placed in the public domain.



pep-0229 Using Distutils to Build Python

PEP: 229
Title: Using Distutils to Build Python
Version: $Revision$
Last-Modified: $Date$
Author: A.M. Kuchling <amk at amk.ca>
Status: Final
Type: Standards Track
Created: 16-Nov-2000
Post-History: 

Introduction

    The Modules/Setup mechanism has some flaws:

    * People have to remember to uncomment bits of Modules/Setup in
      order to get all the possible modules.

    * Moving Setup to a new version of Python is tedious; new modules
      have been added, so you can't just copy the older version, but
      have to reconcile the two versions.

    * Users have to figure out where the needed libraries, such as
      zlib, are installed.


Proposal

    Use the Distutils to build the modules that come with Python.

    The changes can be broken up into several pieces:

    1. The Distutils needs some Python modules to be able to build
       modules.  Currently I believe the minimal list is posix, _sre,
       and string.

       These modules will have to be built before the Distutils can be
       used, so they'll simply be hardwired into Modules/Makefile and
       be automatically built.

    2. A top-level setup.py script will be written that checks the
       libraries installed on the system and compiles as many modules
       as possible.

    3. Modules/Setup will be kept and settings in it will override
       setup.py's usual behavior, so you can disable a module known
       to be buggy, or specify particular compilation or linker flags.
       However, in the common case where setup.py works correctly,
       everything in Setup will remain commented out.  The other
       Setup.* become unnecessary, since nothing will be generating
       Setup automatically.

    The patch was checked in for Python 2.1, and has been subsequently 
    modified.


Implementation

    Patch #102588 on SourceForge contains the proposed patch.
    Currently the patch tries to be conservative and to change as few
    files as possible, in order to simplify backing out the patch.
    For example, no attempt is made to rip out the existing build
    mechanisms.  Such simplifications can wait for later in the beta
    cycle, when we're certain the patch will be left in, or they can
    wait for Python 2.2.
    
    The patch makes the following changes:

    * Makes some required changes to distutils/sysconfig (these will
      be checked in separately)

    * In the top-level Makefile.in, the "sharedmods" target simply 
      runs "./python setup.py build", and "sharedinstall" runs
      "./python setup.py install".  The "clobber" target also deletes
      the build/ subdirectory where Distutils puts its output.

    * Modules/Setup.config.in only contains entries for the gc and thread
      modules; the readline, curses, and db modules are removed because 
      it's now setup.py's job to handle them.

    * Modules/Setup.dist now contains entries for only 3 modules --
      _sre, posix, and strop.

    * The configure script builds setup.cfg from setup.cfg.in.  This
      is needed for two reasons: to make building in subdirectories
      work, and to get the configured installation prefix.

    * Adds setup.py to the top directory of the source tree.  setup.py
      is the largest piece of the puzzle, though not the most
      complicated.  setup.py contains a subclass of the BuildExt
      class, and extends it with a detect_modules() method that does
      the work of figuring out when modules can be compiled, and adding 
      them to the 'exts' list.


Unresolved Issues

    Do we need to make it possible to disable the 3 hard-wired modules
    without manually hacking the Makefiles?  [Answer: No.]

    The Distutils always compile modules as shared libraries.  How do
    we support compiling them statically into the resulting Python
    binary?

    [Answer: building a Python binary with the Distutils should be
    feasible, though no one has implemented it yet.  This should be
    done someday, but isn't a pressing priority as messing around with
    the top-level Makefile.pre.in is good enough.]


Copyright

    This document has been placed in the public domain.



pep-0230 Warning Framework

PEP: 230
Title: Warning Framework
Version: $Revision$
Last-Modified: $Date$
Author: Guido van Rossum <guido at python.org>
Status: Final
Type: Standards Track
Created: 
Python-Version: 2.1
Post-History: 05-Nov-2000

Abstract

    This PEP proposes a C and Python level API, as well as command
    line flags, to issue warning messages and control what happens to
    them.  This is mostly based on GvR's proposal posted to python-dev
    on 05-Nov-2000, with some ideas (such as using classes to
    categorize warnings) merged in from Paul Prescod's
    counter-proposal posted on the same date.  Also, an attempt to
    implement the proposal caused several small tweaks.


Motivation

    With Python 3000 looming, it is necessary to start issuing
    warnings about the use of obsolete or deprecated features, in
    addition to errors.  There are also lots of other reasons to be
    able to issue warnings, both from C and from Python code, both at
    compile time and at run time.

    Warnings aren't fatal, and thus it's possible that a program
    triggers the same warning many times during a single execution.
    It would be annoying if a program emitted an endless stream of
    identical warnings.  Therefore, a mechanism is needed that
    suppresses multiple identical warnings.

    It is also desirable to have user control over which warnings are
    printed.  While in general it is useful to see all warnings all
    the time, there may be times where it is impractical to fix the
    code right away in a production program.  In this case, there
    should be a way to suppress warnings.

    It is also useful to be able to suppress specific warnings during
    program development, e.g. when a warning is generated by a piece
    of 3rd party code that cannot be fixed right away, or when there
    is no way to fix the code (possibly a warning message is generated
    for a perfectly fine piece of code).  It would be unwise to offer
    to suppress all warnings in such cases: the developer would miss
    warnings about the rest of the code.

    On the other hand, there are also situations conceivable where
    some or all warnings are better treated as errors.  For example,
    it may be a local coding standard that a particular deprecated
    feature should not be used.  In order to enforce this, it is
    useful to be able to turn the warning about this particular
    feature into an error, raising an exception (without necessarily
    turning all warnings into errors).

    Therefore, I propose to introduce a flexible "warning filter"
    which can filter out warnings or change them into exceptions,
    based on:

    - Where in the code they are generated (per package, module, or
      function)

    - The warning category (warning categories are discussed below)

    - A specific warning message

    The warning filter must be controllable both from the command line
    and from Python code.


APIs For Issuing Warnings

    - To issue a warning from Python:

        import warnings
        warnings.warn(message[, category[, stacklevel]])

      The category argument, if given, must be a warning category
      class (see below); it defaults to warnings.UserWarning.  This
      may raise an exception if the particular warning issued is
      changed into an error by the warnings filter.  The stacklevel
      can be used by wrapper functions written in Python, like this:

      def deprecation(message):
          warn(message, DeprecationWarning, level=2)

      This makes the warning refer to the deprecation()'s caller,
      rather than to the source of deprecation() itself (since the
      latter would defeat the purpose of the warning message).

    - To issue a warning from C:

        int PyErr_Warn(PyObject *category, char *message);

      Return 0 normally, 1 if an exception is raised (either because
      the warning was transformed into an exception, or because of a
      malfunction in the implementation, such as running out of
      memory).  The category argument must be a warning category class
      (see below) or NULL, in which case it defaults to
      PyExc_RuntimeWarning.  When PyErr_Warn() function returns 1, the
      caller should do normal exception handling.

      The current C implementation of PyErr_Warn() imports the
      warnings module (implemented in Python) and calls its warn()
      function.  This minimizes the amount of C code that needs to be
      added to implement the warning feature.

      [XXX Open Issue: what about issuing warnings during lexing or
      parsing, which don't have the exception machinery available?]


Warnings Categories

    There are a number of built-in exceptions that represent warning
    categories.  This categorization is useful to be able to filter
    out groups of warnings.  The following warnings category classes
    are currently defined:

    - Warning -- this is the base class of all warning category
      classes and it itself a subclass of Exception

    - UserWarning -- the default category for warnings.warn()

    - DeprecationWarning -- base category for warnings about deprecated
      features

    - SyntaxWarning -- base category for warnings about dubious
      syntactic features

    - RuntimeWarning -- base category for warnings about dubious
      runtime features

    [XXX: Other warning categories may be proposed during the review
    period for this PEP.]

    These standard warning categories are available from C as
    PyExc_Warning, PyExc_UserWarning, etc.  From Python, they are
    available in the __builtin__ module, so no import is necessary.

    User code can define additional warning categories by subclassing
    one of the standard warning categories.  A warning category must
    always be a subclass of the Warning class.


The Warnings Filter

    The warnings filter control whether warnings are ignored,
    displayed, or turned into errors (raising an exception).

    There are three sides to the warnings filter:

    - The data structures used to efficiently determine the
      disposition of a particular warnings.warn() or PyErr_Warn()
      call.

    - The API to control the filter from Python source code.

    - The command line switches to control the filter.

    The warnings filter works in several stages.  It is optimized for
    the (expected to be common) case where the same warning is issued
    from the same place in the code over and over.

    First, the warning filter collects the module and line number
    where the warning is issued; this information is readily available
    through sys._getframe().

    Conceptually, the warnings filter maintains an ordered list of
    filter specifications; any specific warning is matched against
    each filter specification in the list in turn until a match is
    found; the match determines the disposition of the match.  Each
    entry is a tuple as follows:

      (category, message, module, lineno, action)

    - category is a class (a subclass of warnings.Warning) of which
      the warning category must be a subclass in order to match

    - message is a compiled regular expression that the warning
      message must match (the match is case-insensitive)

    - module is a compiled regular expression that the module name
      must match

    - lineno is an integer that the line number where the warning
      occurred must match, or 0 to match all line numbers

    - action is one of the following strings:

        - "error" -- turn matching warnings into exceptions

        - "ignore" -- never print matching warnings

        - "always" -- always print matching warnings

        - "default" -- print the first occurrence of matching warnings
          for each location where the warning is issued

        - "module" -- print the first occurrence of matching warnings
          for each module where the warning is issued

        - "once" -- print only the first occurrence of matching
          warnings

    Since the Warning class is derived from the built-in Exception
    class, to turn a warning into an error we simply raise
    category(message).


Warnings Output And Formatting Hooks

    When the warnings filter decides to issue a warning (but not when
    it decides to raise an exception), it passes the information about
    the function warnings.showwarning(message, category, filename, lineno).
    The default implementation of this function writes the warning text
    to sys.stderr, and shows the source line of the filename.  It has
    an optional 5th argument which can be used to specify a different
    file than sys.stderr.

    The formatting of warnings is done by a separate function,
    warnings.formatwarning(message, category, filename, lineno).  This
    returns a string (that may contain newlines and ends in a newline)
    that can be printed to get the identical effect of the
    showwarning() function.


API For Manipulating Warning Filters

      warnings.filterwarnings(message, category, module, lineno, action)

    This checks the types of the arguments, compiles the message and
    module regular expressions, and inserts them as a tuple in front
    of the warnings filter.

      warnings.resetwarnings()

    Reset the warnings filter to empty.


Command Line Syntax

    There should be command line options to specify the most common
    filtering actions, which I expect to include at least:

    - suppress all warnings

    - suppress a particular warning message everywhere

    - suppress all warnings in a particular module

    - turn all warnings into exceptions

    I propose the following command line option syntax:

    -Waction[:message[:category[:module[:lineno]]]]

    Where:

    - 'action' is an abbreviation of one of the allowed actions
      ("error", "default", "ignore", "always", "once", or "module")

    - 'message' is a message string; matches warnings whose message
      text is an initial substring of 'message' (matching is
      case-insensitive)

    - 'category' is an abbreviation of a standard warning category
      class name *or* a fully-qualified name for a user-defined
      warning category class of the form [package.]module.classname

    - 'module' is a module name (possibly package.module)

    - 'lineno' is an integral line number

    All parts except 'action' may be omitted, where an empty value
    after stripping whitespace is the same as an omitted value.

    The C code that parses the Python command line saves the body of
    all -W options in a list of strings, which is made available to
    the warnings module as sys.warnoptions.  The warnings module
    parses these when it is first imported.  Errors detected during
    the parsing of sys.warnoptions are not fatal; a message is written
    to sys.stderr and processing continues with the option.

    Examples:

    -Werror
        Turn all warnings into errors

    -Wall
        Show all warnings

    -Wignore
        Ignore all warnings

    -Wi:hello
        Ignore warnings whose message text starts with "hello"

    -We::Deprecation
        Turn deprecation warnings into errors

    -Wi:::spam:10
        Ignore all warnings on line 10 of module spam

    -Wi:::spam -Wd:::spam:10
        Ignore all warnings in module spam except on line 10

    -We::Deprecation -Wd::Deprecation:spam
        Turn deprecation warnings into errors except in module spam


Open Issues

    Some open issues off the top of my head:

    - What about issuing warnings during lexing or parsing, which
      don't have the exception machinery available?

    - The proposed command line syntax is a bit ugly (although the
      simple cases aren't so bad: -Werror, -Wignore, etc.).  Anybody
      got a better idea?

    - I'm a bit worried that the filter specifications are too
      complex.  Perhaps filtering only on category and module (not on
      message text and line number) would be enough?

    - There's a bit of confusion between module names and file names.
      The reporting uses file names, but the filter specification uses
      module names.  Maybe it should allow filenames as well?

    - I'm not at all convinced that packages are handled right.

    - Do we need more standard warning categories?  Fewer?

    - In order to minimize the start-up overhead, the warnings module
      is imported by the first call to PyErr_Warn().  It does the
      command line parsing for -W options upon import.  Therefore, it
      is possible that warning-free programs will not complain about
      invalid -W options.


Rejected Concerns

    Paul Prescod, Barry Warsaw and Fred Drake have brought up several
    additional concerns that I feel aren't critical.  I address them
    here (the concerns are paraphrased, not exactly their words):

    - Paul: warn() should be a built-in or a statement to make it easily
      available.

      Response: "from warnings import warn" is easy enough.

    - Paul: What if I have a speed-critical module that triggers
      warnings in an inner loop.  It should be possible to disable the
      overhead for detecting the warning (not just suppress the
      warning).

      Response: rewrite the inner loop to avoid triggering the
      warning.

    - Paul: What if I want to see the full context of a warning?

      Response: use -Werror to turn it into an exception.

    - Paul: I prefer ":*:*:" to ":::" for leaving parts of the warning
      spec out.

      Response: I don't.

    - Barry: It would be nice if lineno can be a range specification.

      Response: Too much complexity already.

    - Barry: I'd like to add my own warning action.  Maybe if `action'
      could be a callable as well as a string.  Then in my IDE, I
      could set that to "mygui.popupWarningsDialog".

      Response: For that purpose you would override
      warnings.showwarning().

    - Fred: why do the Warning category classes have to be in
      __builtin__?

      Response: that's the simplest implementation, given that the
      warning categories must be available in C before the first
      PyErr_Warn() call, which imports the warnings module.  I see no
      problem with making them available as built-ins.


Implementation

    Here's a prototype implementation:

  http://sourceforge.net/patch/?func=detailpatch&patch_id=102715&group_id=5470


pep-0231 __findattr__()

PEP: 231
Title: __findattr__()
Version: $Revision$
Last-Modified: $Date$
Author: Barry Warsaw <barry at python.org>
Status: Rejected
Type: Standards Track
Created: 30-Nov-2000
Python-Version: 2.1
Post-History: 

Introduction

    This PEP describes an extension to instance attribute lookup and
    modification machinery, which allows pure-Python implementations
    of many interesting programming models.  This PEP tracks the
    status and ownership of this feature.  It contains a description
    of the feature and outlines changes necessary to support the
    feature.  This PEP summarizes discussions held in mailing list
    forums, and provides URLs for further information, where
    appropriate.  The CVS revision history of this file contains the
    definitive historical record.


Background

    The semantics for Python instances allow the programmer to
    customize some aspects of attribute lookup and attribute
    modification, through the special methods __getattr__() and
    __setattr__() [1].

    However, because of certain restrictions imposed by these methods,
    there are useful programming techniques that can not be written in
    Python alone, e.g. strict Java Bean-like[2] interfaces and Zope
    style acquisitions[3].  In the latter case, Zope solves this by
    including a C extension called ExtensionClass[5] which modifies
    the standard class semantics, and uses a metaclass hook in
    Python's class model called alternatively the "Don Beaudry Hook"
    or "Don Beaudry Hack"[6].

    While Zope's approach works, it has several disadvantages.  First,
    it requires a C extension.  Second it employs a very arcane, but
    truck-sized loophole in the Python machinery.  Third, it can be
    difficult for other programmers to use and understand (the
    metaclass has well-known brain exploding properties).  And fourth,
    because ExtensionClass instances aren't "real" Python instances,
    some aspects of the Python runtime system don't work with
    ExtensionClass instances.

    Proposals for fixing this problem have often been lumped under the
    rubric of fixing the "class/type dichotomy"; that is, eliminating
    the difference between built-in types and classes[7].  While a
    laudable goal itself, repairing this rift is not necessary in
    order to achieve the types of programming constructs described
    above.  This proposal provides an 80% solution with a minimum of
    modification to Python's class and instance objects.  It does
    nothing to address the type/class dichotomy.


Proposal

    This proposal adds a new special method called __findattr__() with
    the following semantics:

    * If defined in a class, it will be called on all instance
      attribute resolutions instead of __getattr__() and
      __setattr__().

    * __findattr__() is never called recursively.  That is, when a
      specific instance's __findattr__() is on the call stack, further
      attribute accesses for that instance will use the standard
      __getattr__() and __setattr__() methods.

    * __findattr__() is called for both attribute access (`getting')
      and attribute modification (`setting').  It is not called for
      attribute deletion.

    * When called for getting, it is passed a single argument (not
      counting `self'): the name of the attribute being accessed.

    * When called for setting, it is called with third argument, which
      is the value to set the attribute to.

    * __findattr__() methods have the same caching semantics as
      __getattr__() and __setattr__(); i.e. if they are present in the
      class at class definition time, they are used, but if they are
      subsequently added to a class later they are not.


Key Differences with the Existing Protocol

    __findattr__()'s semantics are different from the existing
    protocol in key ways:

    First, __getattr__() is never called if the attribute is found in
    the instance's __dict__.  This is done for efficiency reasons, and
    because otherwise, __setattr__() would have no way to get to the
    instance's attributes.

    Second, __setattr__() cannot use "normal" syntax for setting
    instance attributes, e.g. "self.name = foo" because that would
    cause recursive calls to __setattr__().

    __findattr__() is always called regardless of whether the
    attribute is in __dict__ or not, and a flag in the instance object
    prevents recursive calls to __findattr__().  This gives the class
    a chance to perform some action for every attribute access.  And
    because it is called for both gets and sets, it is easy to write
    similar policy for all attribute access.  Further, efficiency is
    not a problem because it is only paid when the extended mechanism
    is used.


Related Work

    PEP 213 [9] describes a different approach to hooking into
    attribute access and modification.  The semantics proposed in PEP
    213 can be implemented using the __findattr__() hook described
    here, with one caveat.  The current reference implementation of
    __findattr__() does not support hooking on attribute deletion.
    This could be added if it's found desirable.  See example below.


Examples

    One programming style that this proposal allows is a Java
    Bean-like interface to objects, where unadorned attribute access
    and modification is transparently mapped to a functional
    interface.  E.g.

        class Bean:
            def __init__(self, x):
                self.__myfoo = x

            def __findattr__(self, name, *args):
                if name.startswith('_'):
                    # Private names
                    if args: setattr(self, name, args[0])
                    else:    return getattr(self, name)
                else:
                    # Public names
                    if args: name = '_set_' + name
                    else:    name = '_get_' + name
                    return getattr(self, name)(*args)

            def _set_foo(self, x):
                self.__myfoo = x

            def _get_foo(self):
                return self.__myfoo


        b = Bean(3)
        print b.foo
        b.foo = 9
        print b.foo
    

    A second, more elaborate example is the implementation of both
    implicit and explicit acquisition in pure Python:

        import types

        class MethodWrapper:
            def __init__(self, container, method):
                self.__container = container
                self.__method = method

            def __call__(self, *args, **kws):
                return self.__method.im_func(self.__container, *args, **kws)


        class WrapperImplicit:
            def __init__(self, contained, container):
                self.__contained = contained
                self.__container = container

            def __repr__(self):
                return '<Wrapper: [%s | %s]>' % (self.__container,
                                                 self.__contained)

            def __findattr__(self, name, *args):
                # Some things are our own
                if name.startswith('_WrapperImplicit__'):
                    if args: return setattr(self, name, *args)
                    else:    return getattr(self, name)
                # setattr stores the name on the contained object directly
                if args:
                    return setattr(self.__contained, name, args[0])
                # Other special names
                if name == 'aq_parent':
                    return self.__container
                elif name == 'aq_self':
                    return self.__contained
                elif name == 'aq_base':
                    base = self.__contained
                    try:
                        while 1:
                            base = base.aq_self
                    except AttributeError:
                        return base
                # no acquisition for _ names
                if name.startswith('_'):
                    return getattr(self.__contained, name)
                # Everything else gets wrapped
                missing = []
                which = self.__contained
                obj = getattr(which, name, missing)
                if obj is missing:
                    which = self.__container
                    obj = getattr(which, name, missing)
                    if obj is missing:
                        raise AttributeError, name
                of = getattr(obj, '__of__', missing)
                if of is not missing:
                    return of(self)
                elif type(obj) == types.MethodType:
                    return MethodWrapper(self, obj)
                return obj


        class WrapperExplicit:
            def __init__(self, contained, container):
                self.__contained = contained
                self.__container = container

            def __repr__(self):
                return '<Wrapper: [%s | %s]>' % (self.__container,
                                                 self.__contained)

            def __findattr__(self, name, *args):
                # Some things are our own
                if name.startswith('_WrapperExplicit__'):
                    if args: return setattr(self, name, *args)
                    else:    return getattr(self, name)
                # setattr stores the name on the contained object directly
                if args:
                    return setattr(self.__contained, name, args[0])
                # Other special names
                if name == 'aq_parent':
                    return self.__container
                elif name == 'aq_self':
                    return self.__contained
                elif name == 'aq_base':
                    base = self.__contained
                    try:
                        while 1:
                            base = base.aq_self
                    except AttributeError:
                        return base
                elif name == 'aq_acquire':
                    return self.aq_acquire
                # explicit acquisition only
                obj = getattr(self.__contained, name)
                if type(obj) == types.MethodType:
                    return MethodWrapper(self, obj)
                return obj

            def aq_acquire(self, name):
                # Everything else gets wrapped
                missing = []
                which = self.__contained
                obj = getattr(which, name, missing)
                if obj is missing:
                    which = self.__container
                    obj = getattr(which, name, missing)
                    if obj is missing:
                        raise AttributeError, name
                of = getattr(obj, '__of__', missing)
                if of is not missing:
                    return of(self)
                elif type(obj) == types.MethodType:
                    return MethodWrapper(self, obj)
                return obj


        class Implicit:
            def __of__(self, container):
                return WrapperImplicit(self, container)

            def __findattr__(self, name, *args):
                # ignore setattrs
                if args:
                    return setattr(self, name, args[0])
                obj = getattr(self, name)
                missing = []
                of = getattr(obj, '__of__', missing)
                if of is not missing:
                    return of(self)
                return obj


        class Explicit(Implicit):
            def __of__(self, container):
                return WrapperExplicit(self, container)


        # tests
        class C(Implicit):
            color = 'red'

        class A(Implicit):
            def report(self):
                return self.color

        # simple implicit acquisition
        c = C()
        a = A()
        c.a = a
        assert c.a.report() == 'red'

        d = C()
        d.color = 'green'
        d.a = a
        assert d.a.report() == 'green'

        try:
            a.report()
        except AttributeError:
            pass
        else:
            assert 0, 'AttributeError expected'


        # special names
        assert c.a.aq_parent is c
        assert c.a.aq_self is a

        c.a.d = d
        assert c.a.d.aq_base is d
        assert c.a is not a


        # no acquisiton on _ names
        class E(Implicit):
            _color = 'purple'

        class F(Implicit):
            def report(self):
                return self._color

        e = E()
        f = F()
        e.f = f
        try:
            e.f.report()
        except AttributeError:
            pass
        else:
            assert 0, 'AttributeError expected'


        # explicit
        class G(Explicit):
            color = 'pink'

        class H(Explicit):
            def report(self):
                return self.aq_acquire('color')

            def barf(self):
                return self.color

        g = G()
        h = H()
        g.h = h
        assert g.h.report() == 'pink'

        i = G()
        i.color = 'cyan'
        i.h = h
        assert i.h.report() == 'cyan'

        try:
            g.i.barf()
        except AttributeError:
            pass
        else:
            assert 0, 'AttributeError expected'
    

    C++-like access control can also be accomplished, although less
    cleanly because of the difficulty of figuring out what method is
    being called from the runtime call stack:

        import sys
        import types

        PUBLIC = 0
        PROTECTED = 1
        PRIVATE = 2

        try:
            getframe = sys._getframe
        except ImportError:
            def getframe(n):
                try: raise Exception
                except Exception:
                    frame = sys.exc_info()[2].tb_frame
                while n > 0:
                    frame = frame.f_back
                    if frame is None:
                        raise ValueError, 'call stack is not deep enough'
                return frame


        class AccessViolation(Exception):
            pass


        class Access:
            def __findattr__(self, name, *args):
                methcache = self.__dict__.setdefault('__cache__', {})
                missing = []
                obj = getattr(self, name, missing)
                # if obj is missing we better be doing a setattr for
                # the first time
                if obj is not missing and type(obj) == types.MethodType:
                    # Digusting hack because there's no way to
                    # dynamically figure out what the method being
                    # called is from the stack frame.
                    methcache[obj.im_func.func_code] = obj.im_class
                #
                # What's the access permissions for this name?
                access, klass = getattr(self, '__access__', {}).get(
                    name, (PUBLIC, 0))
                if access is not PUBLIC:
                    # Now try to see which method is calling us
                    frame = getframe(0).f_back
                    if frame is None:
                        raise AccessViolation
                    # Get the class of the method that's accessing
                    # this attribute, by using the code object cache
                    if frame.f_code.co_name == '__init__':
                        # There aren't entries in the cache for ctors,
                        # because the calling mechanism doesn't go
                        # through __findattr__().  Are there other
                        # methods that might have the same behavior?
                        # Since we can't know who's __init__ we're in,
                        # for now we'll assume that only protected and
                        # public attrs can be accessed.
                        if access is PRIVATE:
                            raise AccessViolation
                    else:
                        methclass = self.__cache__.get(frame.f_code)
                        if not methclass:
                            raise AccessViolation
                        if access is PRIVATE and methclass is not klass:
                            raise AccessViolation
                        if access is PROTECTED and not issubclass(methclass,
                                                                  klass):
                            raise AccessViolation
                # If we got here, it must be okay to access the attribute
                if args:
                    return setattr(self, name, *args)
                return obj

        # tests
        class A(Access):
            def __init__(self, foo=0, name='A'):
                self._foo = foo
                # can't set private names in __init__
                self.__initprivate(name)

            def __initprivate(self, name):
                self._name = name

            def getfoo(self):
                return self._foo

            def setfoo(self, newfoo):
                self._foo = newfoo

            def getname(self):
                return self._name

        A.__access__ = {'_foo'      : (PROTECTED, A),
                        '_name'     : (PRIVATE, A),
                        '__dict__'  : (PRIVATE, A),
                        '__access__': (PRIVATE, A),
                        }

        class B(A):
            def setfoo(self, newfoo):
                self._foo = newfoo + 3

            def setname(self, name):
                self._name = name

        b = B(1)
        b.getfoo()

        a = A(1)
        assert a.getfoo() == 1
        a.setfoo(2)
        assert a.getfoo() == 2

        try:
            a._foo
        except AccessViolation:
            pass
        else:
            assert 0, 'AccessViolation expected'

        try:
            a._foo = 3
        except AccessViolation:
            pass
        else:
            assert 0, 'AccessViolation expected'

        try:
            a.__dict__['_foo']
        except AccessViolation:
            pass
        else:
            assert 0, 'AccessViolation expected'


        b = B()
        assert b.getfoo() == 0
        b.setfoo(2)
        assert b.getfoo() == 5
        try:
            b.setname('B')
        except AccessViolation:
            pass
        else:
            assert 0, 'AccessViolation expected'

        assert b.getname() == 'A'


    Here's an implementation of the attribute hook described in PEP
    213 (except that hooking on attribute deletion isn't supported by
    the current reference implementation).

        class Pep213:
            def __findattr__(self, name, *args):
                hookname = '__attr_%s__' % name
                if args:
                    op = 'set'
                else:
                    op = 'get'
                # XXX: op = 'del' currently not supported
                missing = []
                meth = getattr(self, hookname, missing)
                if meth is missing:
                    if op == 'set':
                        return setattr(self, name, *args)
                    else:
                        return getattr(self, name)
                else:
                    return meth(op, *args)


        def computation(i):
            print 'doing computation:', i
            return i + 3


        def rev_computation(i):
            print 'doing rev_computation:', i
            return i - 3


        class X(Pep213):
            def __init__(self, foo=0):
                self.__foo = foo

            def __attr_foo__(self, op, val=None):
                if op == 'get':
                    return computation(self.__foo)
                elif op == 'set':
                    self.__foo = rev_computation(val)
                # XXX: 'del' not yet supported

        x = X()
        fooval = x.foo
        print fooval
        x.foo = fooval + 5
        print x.foo
        # del x.foo


Reference Implementation

   The reference implementation, as a patch to the Python core, can be
   found at this URL:

   http://sourceforge.net/patch/?func=detailpatch&patch_id=102613&group_id=5470


References

    [1] http://docs.python.org/reference/datamodel.html#customizing-attribute-access
    [2] http://www.javasoft.com/products/javabeans/
    [3] http://www.digicool.com/releases/ExtensionClass/Acquisition.html
    [5] http://www.digicool.com/releases/ExtensionClass
    [6] http://www.python.org/doc/essays/metaclasses/
    [7] http://www.foretec.com/python/workshops/1998-11/dd-ascher-sum.html
    [8] http://docs.python.org/howto/regex.html
    [9] PEP 213, Attribute Access Handlers, Prescod
        http://www.python.org/dev/peps/pep-0213/


Rejection

    There are serious problems with the recursion-protection feature.
    As described here it's not thread-safe, and a thread-safe solution
    has other problems.  In general, it's not clear how helpful the
    recursion-protection feature is; it makes it hard to write code
    that needs to be callable inside __findattr__ as well as outside
    it.  But without the recursion-protection, it's hard to implement
    __findattr__ at all (since __findattr__ would invoke itself
    recursively for every attribute it tries to access).  There seems
    to be no good solution here.

    It's also dubious how useful it is to support __findattr__ both
    for getting and for setting attributes -- __setattr__ gets called
    in all cases alrady.

    The examples can all be implemented using __getattr__ if care is
    taken not to store instance variables under their own names.


Copyright

    This document has been placed in the Public Domain.



pep-0232 Function Attributes

PEP: 232
Title: Function Attributes
Version: $Revision$
Last-Modified: $Date$
Author: Barry Warsaw <barry at python.org>
Status: Final
Type: Standards Track
Created: 02-Dec-2000
Python-Version: 2.1
Post-History: 20-Feb-2001

Introduction

    This PEP describes an extension to Python, adding attribute
    dictionaries to functions and methods.  This PEP tracks the status
    and ownership of this feature.  It contains a description of the
    feature and outlines changes necessary to support the feature.
    This PEP summarizes discussions held in mailing list forums, and
    provides URLs for further information, where appropriate.  The CVS
    revision history of this file contains the definitive historical
    record.


Background

    Functions already have a number of attributes, some of which are
    writable, e.g. func_doc, a.k.a. func.__doc__.  func_doc has the
    interesting property that there is special syntax in function (and
    method) definitions for implicitly setting the attribute.  This
    convenience has been exploited over and over again, overloading
    docstrings with additional semantics.

    For example, John Aycock has written a system where docstrings are
    used to define parsing rules[1].  Zope's ZPublisher ORB[2] uses
    docstrings to signal "publishable" methods, i.e. methods that can
    be called through the web.

    The problem with this approach is that the overloaded semantics
    may conflict with each other.  For example, if we wanted to add a
    doctest unit test to a Zope method that should not be publishable
    through the web.


Proposal

    This proposal adds a new dictionary to function objects, called
    func_dict (a.k.a. __dict__).  This dictionary can be set and get
    using ordinary attribute set and get syntax.

    Methods also gain `getter' syntax, and they currently access the
    attribute through the dictionary of the underlying function
    object.  It is not possible to set attributes on bound or unbound
    methods, except by doing so explicitly on the underlying function
    object.  See the `Future Directions' discussion below for
    approaches in subsequent versions of Python.

    A function object's __dict__ can also be set, but only to a
    dictionary object.  Deleting a function's __dict__, or setting it
    to anything other than a concrete dictionary object results in a
    TypeError.  If no function attributes have ever been set, the
    function's __dict__ will be empty.


Examples

    Here are some examples of what you can do with this feature.

        def a():
	    pass

	a.publish = 1
	a.unittest = '''...'''

	if a.publish:
	    print a()

	if hasattr(a, 'unittest'):
	    testframework.execute(a.unittest)

	class C:
	    def a(self):
	        'just a docstring'
            a.publish = 1

	c = C()
	if c.a.publish:
            publish(c.a())


Other Uses

    Paul Prescod enumerated a bunch of other uses:

    http://mail.python.org/pipermail/python-dev/2000-April/003364.html


Future Directions

    Here are a number of future directions to consider.  Any adoption
    of these ideas would require a new PEP, which referenced this one,
    and would have to be targeted at a Python version subsequent to
    the 2.1 release.

    - A previous version of this PEP allowed for both setter and
      getter of attributes on unbound methods, and only getter on
      bound methods.  A number of problems were discovered with this
      policy.

      Because method attributes were stored in the underlying
      function, this caused several potentially surprising results:

      class C:
          def a(self): pass

      c1 = C()
      c2 = C()
      c1.a.publish = 1
      # c2.a.publish would now be == 1 also!

      Because a change to `a' bound c1 also caused a change to `a'
      bound to c2, setting of attributes on bound methods was
      disallowed.  However, even allowing setting of attributes on
      unbound methods has its ambiguities:

      class D(C): pass
      class E(C): pass

      D.a.publish = 1
      # E.a.publish would now be == 1 also!

      For this reason, the current PEP disallows setting attributes on
      either bound or unbound methods, but does allow for getting
      attributes on either -- both return the attribute value on the
      underlying function object.

      A future PEP might propose to implement setting (bound or
      unbound) method attributes by setting attributes on the instance
      or class, using special naming conventions.  I.e.

      class C:
          def a(self): pass

      C.a.publish = 1
      C.__a_publish__ == 1 # true

      c = C()
      c.a.publish = 2
      c.__a_publish__ == 2 # true

      d = C()
      d.__a_publish__ == 1 # true

      Here, a lookup on the instance would look to the instance's
      dictionary first, followed by a lookup on the class's
      dictionary, and finally a lookup on the function object's
      dictionary.

    - Currently, Python supports function attributes only on Python
      functions (i.e. those that are written in Python, not those that
      are built-in).  Should it be worthwhile, a separate patch can be
      crafted that will add function attributes to built-ins.

    - __doc__ is the only function attribute that currently has
      syntactic support for conveniently setting.  It may be
      worthwhile to eventually enhance the language for supporting
      easy function attribute setting.  Here are some syntaxes
      suggested by PEP reviewers:

      def a {
          'publish' : 1,
          'unittest': '''...''',
          }
          (args):
          # ...

      def a(args):
          """The usual docstring."""
          {'publish' : 1,
           'unittest': '''...''',
           # etc.
           }

      def a(args) having (publish = 1):
          # see reference [3]
          pass

      The BDFL is currently against any such special syntactic support
      for setting arbitrary function attributes.  Any syntax proposals
      would have to be outlined in new PEPs.


Dissenting Opinion

    When this was discussed on the python-dev mailing list in April
    2000, a number of dissenting opinions were voiced.  For
    completeness, the discussion thread starts here:

    http://mail.python.org/pipermail/python-dev/2000-April/003361.html

    The dissenting arguments appear to fall under the following
    categories:

    - no clear purpose (what does it buy you?)
    - other ways to do it (e.g. mappings as class attributes)
    - useless until syntactic support is included

    Countering some of these arguments is the observation that with
    vanilla Python 2.0, __doc__ can in fact be set to any type of
    object, so some semblance of writable function attributes are
    already feasible.  But that approach is yet another corruption of
    __doc__.

    And while it is of course possible to add mappings to class
    objects (or in the case of function attributes, to the function's
    module), it is more difficult and less obvious how to extract the
    attribute values for inspection.

    Finally, it may be desirable to add syntactic support, much the
    same way that __doc__ syntactic support exists.  This can be
    considered separately from the ability to actually set and get
    function attributes.


Reference Implementation

    This PEP has been accepted and the implementation has been
    integrated into Python 2.1.


References

    [1] Aycock, "Compiling Little Languages in Python",
    http://www.foretec.com/python/workshops/1998-11/proceedings/papers/aycock-little/aycock-little.html

    [2] http://classic.zope.org:8080/Documentation/Reference/ORB

    [3] Hudson, Michael, SourceForge patch implementing this syntax,
    http://sourceforge.net/tracker/index.php?func=detail&aid=403441&group_id=5470&atid=305470


Copyright

    This document has been placed in the Public Domain.



pep-0233 Python Online Help

PEP: 233
Title: Python Online Help
Version: $Revision$
Last-Modified: $Date$
Author: Paul Prescod <paul at prescod.net>
Status: Deferred
Type: Standards Track
Created: 11-Dec-2000
Python-Version: 2.1
Post-History: 

Abstract

    This PEP describes a command-line driven online help facility for
    Python.  The facility should be able to build on existing
    documentation facilities such as the Python documentation and
    docstrings.  It should also be extensible for new types and
    modules.


Interactive use:

    Simply typing "help" describes the help function (through repr()
    overloading).

    "help" can also be used as a function:

    The function takes the following forms of input:

        help( "string" ) -- built-in topic or global
        help( <ob> ) -- docstring from object or type
        help( "doc:filename" ) -- filename from Python documentation

    If you ask for a global, it can be a fully-qualified name such as
    help("xml.dom").

    You can also use the facility from a command-line

    python --help if

    In either situation, the output does paging similar to the "more"
    command.


Implementation

    The help function is implemented in an onlinehelp module which is
    demand-loaded.

    There should be options for fetching help information from
    environments other than the command line through the onlinehelp
    module:

        onlinehelp.gethelp(object_or_string) -> string

    It should also be possible to override the help display function
    by assigning to onlinehelp.displayhelp(object_or_string).

    The module should be able to extract module information from
    either the HTML or LaTeX versions of the Python documentation.
    Links should be accommodated in a "lynx-like" manner.

    Over time, it should also be able to recognize when docstrings are
    in "special" syntaxes like structured text, HTML and LaTeX and
    decode them appropriately.

    A prototype implementation is available with the Python source
    distribution as nondist/sandbox/doctools/onlinehelp.py.


Built-in Topics

    help( "intro" )  - What is Python? Read this first!
    help( "keywords" )  - What are the keywords?
    help( "syntax" )  - What is the overall syntax?
    help( "operators" )  - What operators are available?
    help( "builtins" )  - What functions, types, etc. are built-in?
    help( "modules" )  - What modules are in the standard library?
    help( "copyright" )  - Who owns Python?
    help( "moreinfo" )  - Where is there more information?
    help( "changes" )  - What changed in Python 2.0?
    help( "extensions" )  - What extensions are installed?
    help( "faq" )  - What questions are frequently asked?
    help( "ack" )  - Who has done work on Python lately?


Security Issues

    This module will attempt to import modules with the same names as
    requested topics.  Don't use the modules if you are not confident
    that everything in your PYTHONPATH is from a trusted source.



pep-0234 Iterators

PEP: 234
Title: Iterators
Version: $Revision$
Last-Modified: $Date$
Author: Ka-Ping Yee <ping at zesty.ca>, Guido van Rossum <guido at python.org>
Status: Final
Type: Standards Track
Created: 30-Jan-2001
Python-Version: 2.1
Post-History: 30-Apr-2001

Abstract

    This document proposes an iteration interface that objects can
    provide to control the behaviour of 'for' loops.  Looping is
    customized by providing a method that produces an iterator object.
    The iterator provides a 'get next value' operation that produces
    the next item in the sequence each time it is called, raising an
    exception when no more items are available.

    In addition, specific iterators over the keys of a dictionary and
    over the lines of a file are proposed, and a proposal is made to
    allow spelling dict.has_key(key) as "key in dict".

    Note: this is an almost complete rewrite of this PEP by the second
    author, describing the actual implementation checked into the
    trunk of the Python 2.2 CVS tree.  It is still open for
    discussion.  Some of the more esoteric proposals in the original
    version of this PEP have been withdrawn for now; these may be the
    subject of a separate PEP in the future.


C API Specification

    A new exception is defined, StopIteration, which can be used to
    signal the end of an iteration.

    A new slot named tp_iter for requesting an iterator is added to
    the type object structure.  This should be a function of one
    PyObject * argument returning a PyObject *, or NULL.  To use this
    slot, a new C API function PyObject_GetIter() is added, with the
    same signature as the tp_iter slot function.

    Another new slot, named tp_iternext, is added to the type
    structure, for obtaining the next value in the iteration.  To use
    this slot, a new C API function PyIter_Next() is added.  The
    signature for both the slot and the API function is as follows,
    although the NULL return conditions differ:  the argument is a
    PyObject * and so is the return value.  When the return value is
    non-NULL, it is the next value in the iteration.  When it is NULL,
    then for the tp_iternext slot there are three possibilities:

    - No exception is set; this implies the end of the iteration.

    - The StopIteration exception (or a derived exception class) is
      set; this implies the end of the iteration.

    - Some other exception is set; this means that an error occurred
      that should be propagated normally.

    The higher-level PyIter_Next() function clears the StopIteration
    exception (or derived exception) when it occurs, so its NULL return
    conditions are simpler:

    - No exception is set; this means iteration has ended.

    - Some exception is set; this means an error occurred, and should
      be propagated normally.

    Iterators implemented in C should *not* implement a next() method
    with similar semantics as the tp_iternext slot!  When the type's
    dictionary is initialized (by PyType_Ready()), the presence of a
    tp_iternext slot causes a method next() wrapping that slot to be
    added to the type's tp_dict.  (Exception: if the type doesn't use
    PyObject_GenericGetAttr() to access instance attributes, the
    next() method in the type's tp_dict may not be seen.)  (Due to a
    misunderstanding in the original text of this PEP, in Python 2.2,
    all iterator types implemented a next() method that was overridden
    by the wrapper; this has been fixed in Python 2.3.)

    To ensure binary backwards compatibility, a new flag
    Py_TPFLAGS_HAVE_ITER is added to the set of flags in the tp_flags
    field, and to the default flags macro.  This flag must be tested
    before accessing the tp_iter or tp_iternext slots.  The macro
    PyIter_Check() tests whether an object has the appropriate flag
    set and has a non-NULL tp_iternext slot.  There is no such macro
    for the tp_iter slot (since the only place where this slot is
    referenced should be PyObject_GetIter(), and this can check for
    the Py_TPFLAGS_HAVE_ITER flag directly).

    (Note: the tp_iter slot can be present on any object; the
    tp_iternext slot should only be present on objects that act as
    iterators.)

    For backwards compatibility, the PyObject_GetIter() function
    implements fallback semantics when its argument is a sequence that
    does not implement a tp_iter function: a lightweight sequence
    iterator object is constructed in that case which iterates over
    the items of the sequence in the natural order.

    The Python bytecode generated for 'for' loops is changed to use
    new opcodes, GET_ITER and FOR_ITER, that use the iterator protocol
    rather than the sequence protocol to get the next value for the
    loop variable.  This makes it possible to use a 'for' loop to loop
    over non-sequence objects that support the tp_iter slot.  Other
    places where the interpreter loops over the values of a sequence
    should also be changed to use iterators.

    Iterators ought to implement the tp_iter slot as returning a
    reference to themselves; this is needed to make it possible to
    use an iterator (as opposed to a sequence) in a for loop.

    Iterator implementations (in C or in Python) should guarantee that
    once the iterator has signalled its exhaustion, subsequent calls
    to tp_iternext or to the next() method will continue to do so.  It
    is not specified whether an iterator should enter the exhausted
    state when an exception (other than StopIteration) is raised.
    Note that Python cannot guarantee that user-defined or 3rd party
    iterators implement this requirement correctly.


Python API Specification

    The StopIteration exception is made visible as one of the
    standard exceptions.  It is derived from Exception.

    A new built-in function is defined, iter(), which can be called in
    two ways:

    - iter(obj) calls PyObject_GetIter(obj).

    - iter(callable, sentinel) returns a special kind of iterator that
      calls the callable to produce a new value, and compares the
      return value to the sentinel value.  If the return value equals
      the sentinel, this signals the end of the iteration and
      StopIteration is raised rather than returning normal; if the
      return value does not equal the sentinel, it is returned as the
      next value from the iterator.  If the callable raises an
      exception, this is propagated normally; in particular, the
      function is allowed to raise StopIteration as an alternative way
      to end the iteration.  (This functionality is available from the
      C API as PyCallIter_New(callable, sentinel).)

    Iterator objects returned by either form of iter() have a next()
    method.  This method either returns the next value in the
    iteration, or raises StopIteration (or a derived exception class)
    to signal the end of the iteration.  Any other exception should be
    considered to signify an error and should be propagated normally,
    not taken to mean the end of the iteration.

    Classes can define how they are iterated over by defining an
    __iter__() method; this should take no additional arguments and
    return a valid iterator object.  A class that wants to be an
    iterator should implement two methods: a next() method that behaves
    as described above, and an __iter__() method that returns self.

    The two methods correspond to two distinct protocols:

    1. An object can be iterated over with "for" if it implements
       __iter__() or __getitem__().

    2. An object can function as an iterator if it implements next().

    Container-like objects usually support protocol 1.  Iterators are
    currently required to support both protocols.  The semantics of
    iteration come only from protocol 2; protocol 1 is present to make
    iterators behave like sequences; in particular so that code
    receiving an iterator can use a for-loop over the iterator.


Dictionary Iterators

    - Dictionaries implement a sq_contains slot that implements the
      same test as the has_key() method.  This means that we can write

          if k in dict: ...

      which is equivalent to

          if dict.has_key(k): ...

    - Dictionaries implement a tp_iter slot that returns an efficient
      iterator that iterates over the keys of the dictionary.  During
      such an iteration, the dictionary should not be modified, except
      that setting the value for an existing key is allowed (deletions
      or additions are not, nor is the update() method).  This means
      that we can write

          for k in dict: ...

      which is equivalent to, but much faster than

          for k in dict.keys(): ...

      as long as the restriction on modifications to the dictionary
      (either by the loop or by another thread) are not violated.

    - Add methods to dictionaries that return different kinds of
      iterators explicitly:

          for key in dict.iterkeys(): ...

          for value in dict.itervalues(): ...

          for key, value in dict.iteritems(): ...

      This means that "for x in dict" is shorthand for "for x in
      dict.iterkeys()".

    Other mappings, if they support iterators at all, should also
    iterate over the keys.  However, this should not be taken as an
    absolute rule; specific applications may have different
    requirements.


File Iterators

    The following proposal is useful because it provides us with a
    good answer to the complaint that the common idiom to iterate over
    the lines of a file is ugly and slow.

    - Files implement a tp_iter slot that is equivalent to
      iter(f.readline, "").  This means that we can write

          for line in file:
              ...

      as a shorthand for

          for line in iter(file.readline, ""):
              ...

      which is equivalent to, but faster than

          while 1:
              line = file.readline()
              if not line:
                  break
              ...

    This also shows that some iterators are destructive: they consume
    all the values and a second iterator cannot easily be created that
    iterates independently over the same values.  You could open the
    file for a second time, or seek() to the beginning, but these
    solutions don't work for all file types, e.g. they don't work when
    the open file object really represents a pipe or a stream socket.

    Because the file iterator uses an internal buffer, mixing this
    with other file operations (e.g. file.readline()) doesn't work
    right.  Also, the following code:

      for line in file:
          if line == "\n":
              break
      for line in file:
          print line,

    doesn't work as you might expect, because the iterator created by
    the second for-loop doesn't take the buffer read-ahead by the
    first for-loop into account.  A correct way to write this is:

      it = iter(file)
      for line in it:
          if line == "\n":
              break
      for line in it:
          print line,

    (The rationale for these restrictions are that "for line in file"
    ought to become the recommended, standard way to iterate over the
    lines of a file, and this should be as fast as can be.  The
    iterator version is considerable faster than calling readline(),
    due to the internal buffer in the iterator.)


Rationale

    If all the parts of the proposal are included, this addresses many
    concerns in a consistent and flexible fashion.  Among its chief
    virtues are the following four -- no, five -- no, six -- points:

    1. It provides an extensible iterator interface.

    2. It allows performance enhancements to list iteration.

    3. It allows big performance enhancements to dictionary iteration.

    4. It allows one to provide an interface for just iteration
       without pretending to provide random access to elements.

    5. It is backward-compatible with all existing user-defined
       classes and extension objects that emulate sequences and
       mappings, even mappings that only implement a subset of
       {__getitem__, keys, values, items}.

    6. It makes code iterating over non-sequence collections more
       concise and readable.


Resolved Issues

    The following topics have been decided by consensus or BDFL
    pronouncement.

    - Two alternative spellings for next() have been proposed but
      rejected: __next__(), because it corresponds to a type object
      slot (tp_iternext); and __call__(), because this is the only
      operation.

      Arguments against __next__(): while many iterators are used in
      for loops, it is expected that user code will also call next()
      directly, so having to write __next__() is ugly; also, a
      possible extension of the protocol would be to allow for prev(),
      current() and reset() operations; surely we don't want to use
      __prev__(), __current__(), __reset__().

      Arguments against __call__() (the original proposal): taken out
      of context, x() is not very readable, while x.next() is clear;
      there's a danger that every special-purpose object wants to use
      __call__() for its most common operation, causing more confusion
      than clarity.

      (In retrospect, it might have been better to go for __next__()
      and have a new built-in, next(it), which calls it.__next__().
      But alas, it's too late; this has been deployed in Python 2.2
      since December 2001.)

    - Some folks have requested the ability to restart an iterator.
      This should be dealt with by calling iter() on a sequence
      repeatedly, not by the iterator protocol itself.  (See also
      requested extensions below.)

    - It has been questioned whether an exception to signal the end of
      the iteration isn't too expensive.  Several alternatives for the
      StopIteration exception have been proposed: a special value End
      to signal the end, a function end() to test whether the iterator
      is finished, even reusing the IndexError exception.

      - A special value has the problem that if a sequence ever
        contains that special value, a loop over that sequence will
        end prematurely without any warning.  If the experience with
        null-terminated C strings hasn't taught us the problems this
        can cause, imagine the trouble a Python introspection tool
        would have iterating over a list of all built-in names,
        assuming that the special End value was a built-in name!

      - Calling an end() function would require two calls per
        iteration.  Two calls is much more expensive than one call
        plus a test for an exception.  Especially the time-critical
        for loop can test very cheaply for an exception.

      - Reusing IndexError can cause confusion because it can be a
        genuine error, which would be masked by ending the loop
        prematurely.

    - Some have asked for a standard iterator type.  Presumably all
      iterators would have to be derived from this type.  But this is
      not the Python way: dictionaries are mappings because they
      support __getitem__() and a handful other operations, not
      because they are derived from an abstract mapping type.

    - Regarding "if key in dict": there is no doubt that the
      dict.has_key(x) interpretation of "x in dict" is by far the
      most useful interpretation, probably the only useful one.  There
      has been resistance against this because "x in list" checks
      whether x is present among the values, while the proposal makes
      "x in dict" check whether x is present among the keys.  Given
      that the symmetry between lists and dictionaries is very weak,
      this argument does not have much weight.

    - The name iter() is an abbreviation.  Alternatives proposed
      include iterate(), traverse(), but these appear too long.
      Python has a history of using abbrs for common builtins,
      e.g. repr(), str(), len().

      Resolution: iter() it is.

    - Using the same name for two different operations (getting an
      iterator from an object and making an iterator for a function
      with an sentinel value) is somewhat ugly.  I haven't seen a
      better name for the second operation though, and since they both
      return an iterator, it's easy to remember.

      Resolution: the builtin iter() takes an optional argument, which
      is the sentinel to look for.

    - Once a particular iterator object has raised StopIteration, will
      it also raise StopIteration on all subsequent next() calls?
      Some say that it would be useful to require this, others say
      that it is useful to leave this open to individual iterators.
      Note that this may require an additional state bit for some
      iterator implementations (e.g. function-wrapping iterators).

      Resolution: once StopIteration is raised, calling it.next()
      continues to raise StopIteration.

      Note: this was in fact not implemented in Python 2.2; there are
      many cases where an iterator's next() method can raise
      StopIteration on one call but not on the next.  This has been
      remedied in Python 2.3.

    - It has been proposed that a file object should be its own
      iterator, with a next() method returning the next line.  This
      has certain advantages, and makes it even clearer that this
      iterator is destructive.  The disadvantage is that this would
      make it even more painful to implement the "sticky
      StopIteration" feature proposed in the previous bullet.

      Resolution: tentatively rejected (though there are still people
      arguing for this).

    - Some folks have requested extensions of the iterator protocol,
      e.g. prev() to get the previous item, current() to get the
      current item again, finished() to test whether the iterator is
      finished, and maybe even others, like rewind(), __len__(),
      position().

      While some of these are useful, many of these cannot easily be
      implemented for all iterator types without adding arbitrary
      buffering, and sometimes they can't be implemented at all (or
      not reasonably).  E.g. anything to do with reversing directions
      can't be done when iterating over a file or function.  Maybe a
      separate PEP can be drafted to standardize the names for such
      operations when the are implementable.

      Resolution: rejected.

    - There has been a long discussion about whether

          for x in dict: ...

      should assign x the successive keys, values, or items of the
      dictionary.  The symmetry between "if x in y" and "for x in y"
      suggests that it should iterate over keys.  This symmetry has been
      observed by many independently and has even been used to "explain"
      one using the other.  This is because for sequences, "if x in y"
      iterates over y comparing the iterated values to x.  If we adopt
      both of the above proposals, this will also hold for
      dictionaries.

      The argument against making "for x in dict" iterate over the keys
      comes mostly from a practicality point of view: scans of the
      standard library show that there are about as many uses of "for x
      in dict.items()" as there are of "for x in dict.keys()", with the
      items() version having a small majority.  Presumably many of the
      loops using keys() use the corresponding value anyway, by writing
      dict[x], so (the argument goes) by making both the key and value
      available, we could support the largest number of cases.  While
      this is true, I (Guido) find the correspondence between "for x in
      dict" and "if x in dict" too compelling to break, and there's not
      much overhead in having to write dict[x] to explicitly get the
      value.

      For fast iteration over items, use "for key, value in
      dict.iteritems()".  I've timed the difference between

          for key in dict: dict[key]

      and

          for key, value in dict.iteritems(): pass

      and found that the latter is only about 7% faster.

      Resolution: By BDFL pronouncement, "for x in dict" iterates over
      the keys, and dictionaries have iteritems(), iterkeys(), and
      itervalues() to return the different flavors of dictionary
      iterators.


Mailing Lists

    The iterator protocol has been discussed extensively in a mailing
    list on SourceForge:

        http://lists.sourceforge.net/lists/listinfo/python-iterators

    Initially, some of the discussion was carried out at Yahoo;
    archives are still accessible:

        http://groups.yahoo.com/group/python-iter


Copyright

    This document is in the public domain.



pep-0235 Import on Case-Insensitive Platforms

PEP: 235
Title: Import on Case-Insensitive Platforms
Version: $Revision$
Last-Modified: $Date$
Author: Tim Peters <tim at zope.com>
Status: Final
Type: Standards Track
Created: 
Python-Version: 2.1
Post-History: 16 February 2001

Note

    This is essentially a retroactive PEP: the issue came up too late
    in the 2.1 release process to solicit wide opinion before deciding
    what to do, and can't be put off until 2.2 without also delaying
    the Cygwin and MacOS X ports.


Motivation

    File systems vary across platforms in whether or not they preserve
    the case of filenames, and in whether or not the platform C
    library file-opening functions do or don't insist on
    case-sensitive matches:

                      case-preserving     case-destroying
                     +-------------------+------------------+
    case-sensitive   | most Unix flavors | brrrrrrrrrr      |
                     +-------------------+------------------+
    case-insensitive | Windows           | some unfortunate |
                     | MacOSX HFS+       | network schemes  |
                     | Cygwin            |                  |
                     |                   | OpenVMS          |
                     +-------------------+------------------+

    In the upper left box, if you create "fiLe" it's stored as "fiLe",
    and only open("fiLe") will open it (open("file") will not, nor
    will the 14 other variations on that theme).

    In the lower right box, if you create "fiLe", there's no telling
    what it's stored as -- but most likely as "FILE" -- and any of the
    16 obvious variations on open("FilE") will open it.

    The lower left box is a mix: creating "fiLe" stores "fiLe" in the
    platform directory, but you don't have to match case when opening
    it; any of the 16 obvious variations on open("FILe") work.

    NONE OF THAT IS CHANGING!  Python will continue to follow platform
    conventions w.r.t. whether case is preserved when creating a file,
    and w.r.t. whether open() requires a case-sensitive match.  In
    practice, you should always code as if matches were
    case-sensitive, else your program won't be portable.

    What's proposed is to change the semantics of Python "import"
    statements, and there *only* in the lower left box.


Current Lower-Left Semantics

    Support for MacOSX HFS+, and for Cygwin, is new in 2.1, so nothing
    is changing there.  What's changing is Windows behavior.  Here are
    the current rules for import on Windows:

    1. Despite that the filesystem is case-insensitive, Python insists
       on a case-sensitive match.  But not in the way the upper left
       box works: if you have two files, FiLe.py and file.py on
       sys.path, and do

           import file

       then if Python finds FiLe.py first, it raises a NameError.  It
       does *not* go on to find file.py; indeed, it's impossible to
       import any but the first case-insensitive match on sys.path,
       and then only if case matches exactly in the first
       case-insensitive match.

    2. An ugly exception: if the first case-insensitive match on
       sys.path is for a file whose name is entirely in upper case
       (FILE.PY or FILE.PYC or FILE.PYO), then the import silently
       grabs that, no matter what mixture of case was used in the
       import statement.  This is apparently to cater to miserable old
       filesystems that really fit in the lower right box.  But this
       exception is unique to Windows, for reasons that may or may not
       exist.

    3. And another exception: if the environment variable PYTHONCASEOK
       exists, Python silently grabs the first case-insensitive match
       of any kind.

    So these Windows rules are pretty complicated, and neither match
    the Unix rules nor provide semantics natural for the native
    filesystem.  That makes them hard to explain to Unix *or* Windows
    users.  Nevertheless, they've worked fine for years, and in
    isolation there's no compelling reason to change them.

    However, that was before the MacOSX HFS+ and Cygwin ports arrived.
    They also have case-preserving case-insensitive filesystems, but
    the people doing the ports despised the Windows rules.  Indeed, a
    patch to make HFS+ act like Unix for imports got past a reviewer
    and into the code base, which incidentally made Cygwin also act
    like Unix (but this met the unbounded approval of the Cygwin
    folks, so they sure didn't complain -- they had patches of their
    own pending to do this, but the reviewer for those balked).

    At a higher level, we want to keep Python consistent, by following
    the same rules on *all* platforms with case-preserving
    case-insensitive filesystems.


Proposed Semantics

    The proposed new semantics for the lower left box:

    A. If the PYTHONCASEOK environment variable exists, same as
       before: silently accept the first case-insensitive match of any
       kind; raise ImportError if none found.

    B. Else search sys.path for the first case-sensitive match; raise
       ImportError if none found.

    #B is the same rule as is used on Unix, so this will improve cross-
    platform portability.  That's good.  #B is also the rule the Mac
    and Cygwin folks want (and wanted enough to implement themselves,
    multiple times, which is a powerful argument in PythonLand).  It
    can't cause any existing non-exceptional Windows import to fail,
    because any existing non-exceptional Windows import finds a
    case-sensitive match first in the path -- and it still will.  An
    exceptional Windows import currently blows up with a NameError or
    ImportError, in which latter case it still will, or in which
    former case will continue searching, and either succeed or blow up
    with an ImportError.

    #A is needed to cater to case-destroying filesystems mounted on Windows,
    and *may* also be used by people so enamored of "natural" Windows
    behavior that they're willing to set an environment variable to
    get it.  I don't intend to implement #A for Unix too, but that's
    just because I'm not clear on how I *could* do so efficiently (I'm
    not going to slow imports under Unix just for theoretical purity).

    The potential damage is here: #2 (matching on ALLCAPS.PY) is
    proposed to be dropped.  Case-destroying filesystems are a
    vanishing breed, and support for them is ugly.  We're already
    supporting (and will continue to support) PYTHONCASEOK for their
    benefit, but they don't deserve multiple hacks in 2001.



pep-0236 Back to the __future__

PEP: 236
Title: Back to the __future__
Version: $Revision$
Last-Modified: $Date$
Author: Tim Peters <tim at zope.com>
Status: Final
Type: Standards Track
Created: 26-Feb-2001
Python-Version: 2.1
Post-History: 26-Feb-2001

Motivation

    From time to time, Python makes an incompatible change to the
    advertised semantics of core language constructs, or changes their
    accidental (implementation-dependent) behavior in some way.  While this
    is never done capriciously, and is always done with the aim of
    improving the language over the long term, over the short term it's
    contentious and disrupting.

    PEP 5, Guidelines for Language Evolution[1] suggests ways to ease
    the pain, and this PEP introduces some machinery in support of that.

    PEP 227, Statically Nested Scopes[2] is the first application, and
    will be used as an example here.


Intent

    [Note:  This is policy, and so should eventually move into PEP 5 [1]]

    When an incompatible change to core language syntax or semantics is
    being made:

    1. The release C that introduces the change does not change the
       syntax or semantics by default.

    2. A future release R is identified in which the new syntax or semantics
       will be enforced.

    3. The mechanisms described in PEP 3, Warning Framework[3] are
       used to generate warnings, whenever possible, about constructs
       or operations whose meaning may[4] change in release R.

    4. The new future_statement (see below) can be explicitly included in a
       module M to request that the code in module M use the new syntax or
       semantics in the current release C.

    So old code continues to work by default, for at least one release,
    although it may start to generate new warning messages.  Migration to
    the new syntax or semantics can proceed during that time, using the
    future_statement to make modules containing it act as if the new syntax
    or semantics were already being enforced.

    Note that there is no need to involve the future_statement machinery
    in new features unless they can break existing code; fully backward-
    compatible additions can-- and should --be introduced without a
    corresponding future_statement.


Syntax

    A future_statement is simply a from/import statement using the reserved
    module name __future__:

        future_statement: "from" "__future__" "import" feature ["as" name]
                          ("," feature ["as" name])*

        feature: identifier
        name: identifier

    In addition, all future_statments must appear near the top of the
    module.  The only lines that can appear before a future_statement are:

    + The module docstring (if any).
    + Comments.
    + Blank lines.
    + Other future_statements.

    Example:
        """This is a module docstring."""

        # This is a comment, preceded by a blank line and followed by
        # a future_statement.
        from __future__ import nested_scopes

        from math import sin
        from __future__ import alabaster_weenoblobs  # compile-time error!
        # That was an error because preceded by a non-future_statement.


Semantics

    A future_statement is recognized and treated specially at compile time:
    changes to the semantics of core constructs are often implemented by
    generating different code.  It may even be the case that a new feature
    introduces new incompatible syntax (such as a new reserved word), in
    which case the compiler may need to parse the module differently.  Such
    decisions cannot be pushed off until runtime.

    For any given release, the compiler knows which feature names have been
    defined, and raises a compile-time error if a future_statement contains
    a feature not known to it[5].

    The direct runtime semantics are the same as for any import statement:
    there is a standard module __future__.py, described later, and it will
    be imported in the usual way at the time the future_statement is
    executed.

    The *interesting* runtime semantics depend on the specific feature(s)
    "imported" by the future_statement(s) appearing in the module.

    Note that there is nothing special about the statement:

        import __future__ [as name]

    That is not a future_statement; it's an ordinary import statement, with
    no special semantics or syntax restrictions.


Example

    Consider this code, in file scope.py:

        x = 42
        def f():
            x = 666
            def g():
                print "x is", x
            g()
        f()

    Under 2.0, it prints:

        x is 42

    Nested scopes[2] are being introduced in 2.1.  But under 2.1, it still
    prints

        x is 42

    and also generates a warning.

    In 2.2, and also in 2.1 *if* "from __future__ import nested_scopes" is
    included at the top of scope.py, it prints

        x is 666


Standard Module __future__.py

    Lib/__future__.py is a real module, and serves three purposes:

    1. To avoid confusing existing tools that analyze import statements and
       expect to find the modules they're importing.

    2. To ensure that future_statements run under releases prior to 2.1
       at least yield runtime exceptions (the import of __future__ will
       fail, because there was no module of that name prior to 2.1).

    3. To document when incompatible changes were introduced, and when they
       will be-- or were --made mandatory.  This is a form of executable
       documentation, and can be inspected programatically via importing
       __future__ and examining its contents.

    Each statement in __future__.py is of the form:

        FeatureName = "_Feature(" OptionalRelease "," MandatoryRelease ")"

    where, normally, OptionalRelease <  MandatoryRelease, and both are
    5-tuples of the same form as sys.version_info:

    (PY_MAJOR_VERSION, # the 2 in 2.1.0a3; an int
     PY_MINOR_VERSION, # the 1; an int
     PY_MICRO_VERSION, # the 0; an int
     PY_RELEASE_LEVEL, # "alpha", "beta", "candidate" or "final"; string
     PY_RELEASE_SERIAL # the 3; an int
    )

    OptionalRelease records the first release in which

        from __future__ import FeatureName

    was accepted.

    In the case of MandatoryReleases that have not yet occurred,
    MandatoryRelease predicts the release in which the feature will become
    part of the language.

    Else MandatoryRelease records when the feature became part of the
    language; in releases at or after that, modules no longer need

        from __future__ import FeatureName

    to use the feature in question, but may continue to use such imports.

    MandatoryRelease may also be None, meaning that a planned feature got
    dropped.

    Instances of class _Feature have two corresponding methods,
    .getOptionalRelease() and .getMandatoryRelease().

    No feature line will ever be deleted from __future__.py.

    Example line:

      nested_scopes = _Feature((2, 1, 0, "beta", 1), (2, 2, 0, "final", 0))

    This means that

        from __future__ import nested_scopes

    will work in all releases at or after 2.1b1, and that nested_scopes are
    intended to be enforced starting in release 2.2.


Resolved Problem: Runtime Compilation

    Several Python features can compile code during a module's runtime:

    1. The exec statement.
    2. The execfile() function.
    3. The compile() function.
    4. The eval() function.
    5. The input() function.

    Since a module M containing a future_statement naming feature F
    explicitly requests that the current release act like a future release
    with respect to F, any code compiled dynamically from text passed to
    one of these from within M should probably also use the new syntax or
    semantics associated with F.  The 2.1 release does behave this way.

    This isn't always desired, though.  For example, doctest.testmod(M)
    compiles examples taken from strings in M, and those examples should
    use M's choices, not necessarily the doctest module's choices.  In the
    2.1 release, this isn't possible, and no scheme has yet been suggested
    for working around this.  NOTE:  PEP 264 later addressed this in a
    flexible way, by adding optional arguments to compile().

    In any case, a future_statement appearing "near the top" (see Syntax
    above) of text compiled dynamically by an exec, execfile() or compile()
    applies to the code block generated, but has no further effect on the
    module that executes such an exec, execfile() or compile().  This
    can't be used to affect eval() or input(), however, because they only
    allow expression input, and a future_statement is not an expression.


Resolved Problem: Native Interactive Shells

    There are two ways to get an interactive shell:

    1. By invoking Python from a command line without a script argument.

    2. By invoking Python from a command line with the -i switch and with a
       script argument.

    An interactive shell can be seen as an extreme case of runtime
    compilation (see above):  in effect, each statement typed at an
    interactive shell prompt runs a new instance of exec, compile() or
    execfile().  A future_statement typed at an interactive shell applies to
    the rest of the shell session's life, as if the future_statement had
    appeared at the top of a module.


Resolved Problem: Simulated Interactive Shells

    Interactive shells "built by hand" (by tools such as IDLE and the Emacs
    Python-mode) should behave like native interactive shells (see above).
    However, the machinery used internally by native interactive shells has
    not been exposed, and there isn't a clear way for tools building their
    own interactive shells to achieve the desired behavior.

    NOTE:  PEP 264 later addressed this, by adding intelligence to the
    standard codeop.py.  Simulated shells that don't use the standard
    library shell helpers can get a similar effect by exploiting the
    new optional arguments to compile() added by PEP 264.


Questions and Answers

    Q:  What about a "from __past__" version, to get back *old* behavior?

    A:  Outside the scope of this PEP.  Seems unlikely to the author,
        though.  Write a PEP if you want to pursue it.

    Q:  What about incompatibilities due to changes in the Python virtual
        machine?

    A:  Outside the scope of this PEP, although PEP 5 [1] suggests a grace
        period there too, and the future_statement may also have a role to
        play there.

    Q:  What about incompatibilities due to changes in Python's C API?

    A:  Outside the scope of this PEP.

    Q:  I want to wrap future_statements in try/except blocks, so I can
        use different code depending on which version of Python I'm running.
        Why can't I?

    A:  Sorry!  try/except is a runtime feature; future_statements are
        primarily compile-time gimmicks, and your try/except happens long
        after the compiler is done.  That is, by the time you do
        try/except, the semantics in effect for the module are already a
        done deal.  Since the try/except wouldn't accomplish what it
        *looks* like it should accomplish, it's simply not allowed.  We
        also want to keep these special statements very easy to find and to
        recognize.

        Note that you *can* import __future__ directly, and use the
        information in it, along with sys.version_info, to figure out where
        the release you're running under stands in relation to a given
        feature's status.

     Q: Going back to the nested_scopes example, what if release 2.2 comes
        along and I still haven't changed my code?  How can I keep the 2.1
        behavior then?

     A: By continuing to use 2.1, and not moving to 2.2 until you do change
        your code.  The purpose of future_statement is to make life easier
        for people who keep current with the latest release in a timely
        fashion.  We don't hate you if you don't, but your problems are
        much harder to solve, and somebody with those problems will need to
        write a PEP addressing them.  future_statement is aimed at a
        different audience.

     Q: Overloading "import" sucks.  Why not introduce a new statement for
        this?

     A: Like maybe "lambda lambda nested_scopes"?  That is, unless we
        introduce a new keyword, we can't introduce an entirely new
        statement.  But if we introduce a new keyword, that in itself
        would break old code.  That would be too ironic to bear.  Yes,
        overloading "import" does suck, but not as energetically as the
        alternatives -- as is, future_statements are 100% backward
        compatible.


Copyright

    This document has been placed in the public domain.


References and Footnotes

    [1] PEP 5, Guidelines for Language Evolution, Prescod
        http://www.python.org/dev/peps/pep-0005/

    [2] PEP 227, Statically Nested Scopes, Hylton
        http://www.python.org/dev/peps/pep-0227/

    [3] PEP 230, Warning Framework, Van Rossum
        http://www.python.org/dev/peps/pep-0230/

    [4] Note that this is "may" and not "will":  better safe than sorry.  Of
        course spurious warnings won't be generated when avoidable with
        reasonable cost.

    [5] This ensures that a future_statement run under a release prior to
        the first one in which a given feature is known (but >= 2.1) will
        raise a compile-time error rather than silently do a wrong thing.
        If transported to a release prior to 2.1, a runtime error will be
        raised because of the failure to import __future__ (no such module
        existed in the standard distribution before the 2.1 release, and
        the double underscores make it a reserved name).



pep-0237 Unifying Long Integers and Integers

PEP: 237
Title: Unifying Long Integers and Integers
Version: $Revision$
Last-Modified: $Date$
Author: Moshe Zadka, Guido van Rossum
Status: Final
Type: Standards Track
Created: 11-Mar-2001
Python-Version: 2.2
Post-History: 16-Mar-2001, 14-Aug-2001, 23-Aug-2001

Abstract

    Python currently distinguishes between two kinds of integers
    (ints): regular or short ints, limited by the size of a C long
    (typically 32 or 64 bits), and long ints, which are limited only
    by available memory.  When operations on short ints yield results
    that don't fit in a C long, they raise an error.  There are some
    other distinctions too.  This PEP proposes to do away with most of
    the differences in semantics, unifying the two types from the
    perspective of the Python user.


Rationale

    Many programs find a need to deal with larger numbers after the
    fact, and changing the algorithms later is bothersome.  It can
    hinder performance in the normal case, when all arithmetic is
    performed using long ints whether or not they are needed.

    Having the machine word size exposed to the language hinders
    portability.  For examples Python source files and .pyc's are not
    portable between 32-bit and 64-bit machines because of this.

    There is also the general desire to hide unnecessary details from
    the Python user when they are irrelevant for most applications.
    An example is memory allocation, which is explicit in C but
    automatic in Python, giving us the convenience of unlimited sizes
    on strings, lists, etc.  It makes sense to extend this convenience
    to numbers.

    It will give new Python programmers (whether they are new to
    programming in general or not) one less thing to learn before they
    can start using the language.


Implementation

    Initially, two alternative implementations were proposed (one by
    each author):

    1. The PyInt type's slot for a C long will be turned into a 

        union {
            long i;
            struct {
                unsigned long length;
                digit digits[1];
            } bignum;
        };

       Only the n-1 lower bits of the long have any meaning; the top
       bit is always set.  This distinguishes the union.  All PyInt
       functions will check this bit before deciding which types of
       operations to use.

    2. The existing short and long int types remain, but operations
       return a long int instead of raising OverflowError when a
       result cannot be represented as a short int.  A new type,
       integer, may be introduced that is an abstract base type of
       which both the int and long implementation types are
       subclassed.  This is useful so that programs can check
       integer-ness with a single test:

           if isinstance(i, integer): ...

    After some consideration, the second implementation plan was
    selected, since it is far easier to implement, is backwards
    compatible at the C API level, and in addition can be implemented
    partially as a transitional measure.


Incompatibilities

    The following operations have (usually subtly) different semantics
    for short and for long integers, and one or the other will have to
    be changed somehow.  This is intended to be an exhaustive list.
    If you know of any other operation that differ in outcome
    depending on whether a short or a long int with the same value is
    passed, please write the second author.

    - Currently, all arithmetic operators on short ints except <<
      raise OverflowError if the result cannot be represented as a
      short int.  This will be changed to return a long int instead.
      The following operators can currently raise OverflowError: x+y,
      x-y, x*y, x**y, divmod(x, y), x/y, x%y, and -x.  (The last four
      can only overflow when the value -sys.maxint-1 is involved.)

    - Currently, x<<n can lose bits for short ints.  This will be
      changed to return a long int containing all the shifted-out
      bits, if returning a short int would lose bits (where changing
      sign is considered a special case of losing bits).

    - Currently, hex and oct literals for short ints may specify
      negative values; for example 0xffffffff == -1 on a 32-bit
      machine.  This will be changed to equal 0xffffffffL (2**32-1).

    - Currently, the '%u', '%x', '%X' and '%o' string formatting
      operators and the hex() and oct() built-in functions behave
      differently for negative numbers: negative short ints are
      formatted as unsigned C long, while negative long ints are
      formatted with a minus sign.  This will be changed to use the
      long int semantics in all cases (but without the trailing 'L'
      that currently distinguishes the output of hex() and oct() for
      long ints).  Note that this means that '%u' becomes an alias for
      '%d'.  It will eventually be removed.

    - Currently, repr() of a long int returns a string ending in 'L'
      while repr() of a short int doesn't.  The 'L' will be dropped;
      but not before Python 3.0.

    - Currently, an operation with long operands will never return a
      short int.  This *may* change, since it allows some
      optimization.  (No changes have been made in this area yet, and
      none are planned.)

    - The expression type(x).__name__ depends on whether x is a short
      or a long int.  Since implementation alternative 2 is chosen,
      this difference will remain.  (In Python 3.0, we *may* be able
      to deploy a trick to hide the difference, because it *is*
      annoying to reveal the difference to user code, and more so as
      the difference between the two types is less visible.)

    - Long and short ints are handled different by the marshal module,
      and by the pickle and cPickle modules.  This difference will
      remain (at least until Python 3.0).

    - Short ints with small values (typically between -1 and 99
      inclusive) are "interned" -- whenever a result has such a value,
      an existing short int with the same value is returned.  This is
      not done for long ints with the same values.  This difference
      will remain.  (Since there is no guarantee of this interning, is
      is debatable whether this is a semantic difference -- but code
      may exist that uses 'is' for comparisons of short ints and
      happens to work because of this interning.  Such code may fail
      if used with long ints.)


Literals

    A trailing 'L' at the end of an integer literal will stop having
    any meaning, and will be eventually become illegal.  The compiler
    will choose the appropriate type solely based on the value.
    (Until Python 3.0, it will force the literal to be a long; but
    literals without a trailing 'L' may also be long, if they are not
    representable as short ints.)


Built-in Functions

    The function int() will return a short or a long int depending on
    the argument value.  In Python 3.0, the function long() will call
    the function int(); before then, it will continue to force the
    result to be a long int, but otherwise work the same way as int().
    The built-in name 'long' will remain in the language to represent
    the long implementation type (unless it is completely eradicated
    in Python 3.0), but using the int() function is still recommended,
    since it will automatically return a long when needed.


C API

    The C API remains unchanged; C code will still need to be aware of
    the difference between short and long ints.  (The Python 3.0 C API
    will probably be completely incompatible.)

    The PyArg_Parse*() APIs already accept long ints, as long as they
    are within the range representable by C ints or longs, so that
    functions taking C int or long argument won't have to worry about
    dealing with Python longs.


Transition

    There are three major phases to the transition:

    A. Short int operations that currently raise OverflowError return
       a long int value instead.  This is the only change in this
       phase.  Literals will still distinguish between short and long
       ints.  The other semantic differences listed above (including
       the behavior of <<) will remain.  Because this phase only
       changes situations that currently raise OverflowError, it is
       assumed that this won't break existing code.  (Code that
       depends on this exception would have to be too convoluted to be
       concerned about it.)  For those concerned about extreme
       backwards compatibility, a command line option (or a call to
       the warnings module) will allow a warning or an error to be
       issued at this point, but this is off by default.

    B. The remaining semantic differences are addressed.  In all cases
       the long int semantics will prevail.  Since this will introduce
       backwards incompatibilities which will break some old code,
       this phase may require a future statement and/or warnings, and
       a prolonged transition phase.  The trailing 'L' will continue
       to be used for longs as input and by repr().

    C. The trailing 'L' is dropped from repr(), and made illegal on
       input.  (If possible, the 'long' type completely disappears.)
       The trailing 'L' is also dropped from hex() and oct().

    Phase A will be implemented in Python 2.2.

    Phase B will be implemented gradually in Python 2.3 and Python
    2.4.  Envisioned stages of phase B:

    B0. Warnings are enabled about operations that will change their
        numeric outcome in stage B1, in particular hex() and oct(),
        '%u', '%x', '%X' and '%o', hex and oct literals in the
        (inclusive) range [sys.maxint+1, sys.maxint*2+1], and left
        shifts losing bits.

    B1. The new semantic for these operations are implemented.
        Operations that give different results than before will *not*
        issue a warning.

    We propose the following timeline:

    B0. Python 2.3.

    B1. Python 2.4.

    Phase C will be implemented in Python 3.0 (at least two years
    after Python 2.4 is released).


OverflowWarning

    Here are the rules that guide warnings generated in situations
    that currently raise OverflowError.  This applies to transition
    phase A.  Historical note:  despite that phase A was completed in
    Python 2.2, and phase B0 in Python 2.3, nobody noticed that
    OverflowWarning was still generated in Python 2.3.  It was finally
    disabled in Python 2.4.  The Python builtin OverflowWarning, and
    the corresponding C API PyExc_OverflowWarning, are no longer
    generated or used in Python 2.4, but will remain for the (unlikely)
    case of user code until Python 2.5.

    - A new warning category is introduced, OverflowWarning.  This is
      a built-in name.

    - If an int result overflows, an OverflowWarning warning is
      issued, with a message argument indicating the operation,
      e.g. "integer addition".  This may or may not cause a warning
      message to be displayed on sys.stderr, or may cause an exception
      to be raised, all under control of the -W command line and the
      warnings module.

    - The OverflowWarning warning is ignored by default.

    - The OverflowWarning warning can be controlled like all warnings,
      via the -W command line option or via the
      warnings.filterwarnings() call.  For example:

        python -Wdefault::OverflowWarning

      cause the OverflowWarning to be displayed the first time it
      occurs at a particular source line, and

        python -Werror::OverflowWarning

      cause the OverflowWarning to be turned into an exception
      whenever it happens.  The following code enables the warning
      from inside the program:

        import warnings
        warnings.filterwarnings("default", "", OverflowWarning)

      See the python man page for the -W option and the warnings
      module documentation for filterwarnings().

    - If the OverflowWarning warning is turned into an error,
      OverflowError is substituted.  This is needed for backwards
      compatibility.

    - Unless the warning is turned into an exceptions, the result of
      the operation (e.g., x+y) is recomputed after converting the
      arguments to long ints.


Example

    If you pass a long int to a C function or built-in operation that
    takes an integer, it will be treated the same as as a short int as
    long as the value fits (by virtue of how PyArg_ParseTuple() is
    implemented).  If the long value doesn't fit, it will still raise
    an OverflowError.  For example:

      def fact(n):
          if n <= 1:
              return 1
          return n*fact(n-1)

      A = "ABCDEFGHIJKLMNOPQ"
      n = input("Gimme an int: ")
      print A[fact(n)%17]

    For n >= 13, this currently raises OverflowError (unless the user
    enters a trailing 'L' as part of their input), even though the
    calculated index would always be in range(17).  With the new
    approach this code will do the right thing: the index will be
    calculated as a long int, but its value will be in range.


Resolved Issues

    These issues, previously open, have been resolved.

    - hex() and oct() applied to longs will continue to produce a
      trailing 'L' until Python 3000.  The original text above wasn't
      clear about this, but since it didn't happen in Python 2.4 it
      was thought better to leave it alone.  BDFL pronouncement here:

          http://mail.python.org/pipermail/python-dev/2006-June/065918.html

    - What to do about sys.maxint?  Leave it in, since it is still
      relevant whenever the distinction between short and long ints is
      still relevant (e.g. when inspecting the type of a value).

    - Should we remove '%u' completely?  Remove it.

    - Should we warn about << not truncating integers?  Yes.

    - Should the overflow warning be on a portable maximum size?  No.


Implementation

    The implementation work for the Python 2.x line is completed;
    phase A was released with Python 2.2, phase B0 with Python 2.3,
    and phase B1 will be released with Python 2.4 (and is already in
    CVS).


Copyright

    This document has been placed in the public domain.



pep-0238 Changing the Division Operator

PEP: 238
Title: Changing the Division Operator
Version: $Revision$
Last-Modified: $Date$
Author: Moshe Zadka <moshez at zadka.site.co.il>, Guido van Rossum <guido at python.org>
Status: Final
Type: Standards Track
Created: 11-Mar-2001
Python-Version: 2.2
Post-History: 16-Mar-2001, 26-Jul-2001, 27-Jul-2001

Abstract

    The current division (/) operator has an ambiguous meaning for
    numerical arguments: it returns the floor of the mathematical
    result of division if the arguments are ints or longs, but it
    returns a reasonable approximation of the division result if the
    arguments are floats or complex.  This makes expressions expecting
    float or complex results error-prone when integers are not
    expected but possible as inputs.

    We propose to fix this by introducing different operators for
    different operations: x/y to return a reasonable approximation of
    the mathematical result of the division ("true division"), x//y to
    return the floor ("floor division").  We call the current, mixed
    meaning of x/y "classic division".

    Because of severe backwards compatibility issues, not to mention a
    major flamewar on c.l.py, we propose the following transitional
    measures (starting with Python 2.2):

    - Classic division will remain the default in the Python 2.x
      series; true division will be standard in Python 3.0.

    - The // operator will be available to request floor division
      unambiguously.

    - The future division statement, spelled "from __future__ import
      division", will change the / operator to mean true division
      throughout the module.

    - A command line option will enable run-time warnings for classic
      division applied to int or long arguments; another command line
      option will make true division the default.

    - The standard library will use the future division statement and
      the // operator when appropriate, so as to completely avoid
      classic division.


Motivation

    The classic division operator makes it hard to write numerical
    expressions that are supposed to give correct results from
    arbitrary numerical inputs.  For all other operators, one can
    write down a formula such as x*y**2 + z, and the calculated result
    will be close to the mathematical result (within the limits of
    numerical accuracy, of course) for any numerical input type (int,
    long, float, or complex).  But division poses a problem: if the
    expressions for both arguments happen to have an integral type, it
    implements floor division rather than true division.

    The problem is unique to dynamically typed languages: in a
    statically typed language like C, the inputs, typically function
    arguments, would be declared as double or float, and when a call
    passes an integer argument, it is converted to double or float at
    the time of the call.  Python doesn't have argument type
    declarations, so integer arguments can easily find their way into
    an expression.

    The problem is particularly pernicious since ints are perfect
    substitutes for floats in all other circumstances: math.sqrt(2)
    returns the same value as math.sqrt(2.0), 3.14*100 and 3.14*100.0
    return the same value, and so on.  Thus, the author of a numerical
    routine may only use floating point numbers to test his code, and
    believe that it works correctly, and a user may accidentally pass
    in an integer input value and get incorrect results.

    Another way to look at this is that classic division makes it
    difficult to write polymorphic functions that work well with
    either float or int arguments; all other operators already do the
    right thing.  No algorithm that works for both ints and floats has
    a need for truncating division in one case and true division in
    the other.

    The correct work-around is subtle: casting an argument to float()
    is wrong if it could be a complex number; adding 0.0 to an
    argument doesn't preserve the sign of the argument if it was minus
    zero.  The only solution without either downside is multiplying an
    argument (typically the first) by 1.0.  This leaves the value and
    sign unchanged for float and complex, and turns int and long into
    a float with the corresponding value.

    It is the opinion of the authors that this is a real design bug in
    Python, and that it should be fixed sooner rather than later.
    Assuming Python usage will continue to grow, the cost of leaving
    this bug in the language will eventually outweigh the cost of
    fixing old code -- there is an upper bound to the amount of code
    to be fixed, but the amount of code that might be affected by the
    bug in the future is unbounded.

    Another reason for this change is the desire to ultimately unify
    Python's numeric model.  This is the subject of PEP 228[0] (which
    is currently incomplete).  A unified numeric model removes most of
    the user's need to be aware of different numerical types.  This is
    good for beginners, but also takes away concerns about different
    numeric behavior for advanced programmers.  (Of course, it won't
    remove concerns about numerical stability and accuracy.)

    In a unified numeric model, the different types (int, long, float,
    complex, and possibly others, such as a new rational type) serve
    mostly as storage optimizations, and to some extent to indicate
    orthogonal properties such as inexactness or complexity.  In a
    unified model, the integer 1 should be indistinguishable from the
    floating point number 1.0 (except for its inexactness), and both
    should behave the same in all numeric contexts.  Clearly, in a
    unified numeric model, if a==b and c==d, a/c should equal b/d
    (taking some liberties due to rounding for inexact numbers), and
    since everybody agrees that 1.0/2.0 equals 0.5, 1/2 should also
    equal 0.5.  Likewise, since 1//2 equals zero, 1.0//2.0 should also
    equal zero.


Variations

    Aesthetically, x//y doesn't please everyone, and hence several
    variations have been proposed.  They are addressed here:

    - x div y.  This would introduce a new keyword.  Since div is a
      popular identifier, this would break a fair amount of existing
      code, unless the new keyword was only recognized under a future
      division statement.  Since it is expected that the majority of
      code that needs to be converted is dividing integers, this would
      greatly increase the need for the future division statement.
      Even with a future statement, the general sentiment against
      adding new keywords unless absolutely necessary argues against
      this.

    - div(x, y).  This makes the conversion of old code much harder.
      Replacing x/y with x//y or x div y can be done with a simple
      query replace; in most cases the programmer can easily verify
      that a particular module only works with integers so all
      occurrences of x/y can be replaced.  (The query replace is still
      needed to weed out slashes occurring in comments or string
      literals.)  Replacing x/y with div(x, y) would require a much
      more intelligent tool, since the extent of the expressions to
      the left and right of the / must be analyzed before the
      placement of the "div(" and ")" part can be decided.

    - x \ y.  The backslash is already a token, meaning line
      continuation, and in general it suggests an "escape" to Unix
      eyes.  In addition (this due to Terry Reedy) this would make
      things like eval("x\y") harder to get right.


Alternatives

    In order to reduce the amount of old code that needs to be
    converted, several alternative proposals have been put forth.
    Here is a brief discussion of each proposal (or category of
    proposals).  If you know of an alternative that was discussed on
    c.l.py that isn't mentioned here, please mail the second author.

    - Let / keep its classic semantics; introduce // for true
      division.  This still leaves a broken operator in the language,
      and invites to use the broken behavior.  It also shuts off the
      road to a unified numeric model a la PEP 228[0].

    - Let int division return a special "portmanteau" type that
      behaves as an integer in integer context, but like a float in a
      float context.  The problem with this is that after a few
      operations, the int and the float value could be miles apart,
      it's unclear which value should be used in comparisons, and of
      course many contexts (like conversion to string) don't have a
      clear integer or float preference.

    - Use a directive to use specific division semantics in a module,
      rather than a future statement.  This retains classic division
      as a permanent wart in the language, requiring future
      generations of Python programmers to be aware of the problem and
      the remedies.

    - Use "from __past__ import division" to use classic division
      semantics in a module.  This also retains the classic division
      as a permanent wart, or at least for a long time (eventually the
      past division statement could raise an ImportError).

    - Use a directive (or some other way) to specify the Python
      version for which a specific piece of code was developed.  This
      requires future Python interpreters to be able to emulate
      *exactly* several previous versions of Python, and moreover to
      do so for multiple versions within the same interpreter.  This
      is way too much work.  A much simpler solution is to keep
      multiple interpreters installed.  Another argument against this
      is that the version directive is almost always overspecified:
      most code written for Python X.Y, works for Python X.(Y-1) and
      X.(Y+1) as well, so specifying X.Y as a version is more
      constraining than it needs to be.  At the same time, there's no
      way to know at which future or past version the code will break.


API Changes

    During the transitional phase, we have to support *three* division
    operators within the same program: classic division (for / in
    modules without a future division statement), true division (for /
    in modules with a future division statement), and floor division
    (for //).  Each operator comes in two flavors: regular, and as an
    augmented assignment operator (/= or //=).

    The names associated with these variations are:

    - Overloaded operator methods:

      __div__(), __floordiv__(), __truediv__();

      __idiv__(), __ifloordiv__(), __itruediv__().

    - Abstract API C functions:

      PyNumber_Divide(), PyNumber_FloorDivide(),
      PyNumber_TrueDivide();

      PyNumber_InPlaceDivide(), PyNumber_InPlaceFloorDivide(),
      PyNumber_InPlaceTrueDivide().

    - Byte code opcodes:

      BINARY_DIVIDE, BINARY_FLOOR_DIVIDE, BINARY_TRUE_DIVIDE;

      INPLACE_DIVIDE, INPLACE_FLOOR_DIVIDE, INPLACE_TRUE_DIVIDE.

    - PyNumberMethod slots:

      nb_divide, nb_floor_divide, nb_true_divide,

      nb_inplace_divide, nb_inplace_floor_divide,
      nb_inplace_true_divide.

    The added PyNumberMethod slots require an additional flag in
    tp_flags; this flag will be named Py_TPFLAGS_HAVE_NEWDIVIDE and
    will be included in Py_TPFLAGS_DEFAULT.

    The true and floor division APIs will look for the corresponding
    slots and call that; when that slot is NULL, they will raise an
    exception.  There is no fallback to the classic divide slot.

    In Python 3.0, the classic division semantics will be removed; the
    classic division APIs will become synonymous with true division.


Command Line Option

    The -Q command line option takes a string argument that can take
    four values: "old", "warn", "warnall", or "new".  The default is
    "old" in Python 2.2 but will change to "warn" in later 2.x
    versions.  The "old" value means the classic division operator
    acts as described.  The "warn" value means the classic division
    operator issues a warning (a DeprecationWarning using the standard
    warning framework) when applied to ints or longs.  The "warnall"
    value also issues warnings for classic division when applied to
    floats or complex; this is for use by the fixdiv.py conversion
    script mentioned below.  The "new" value changes the default
    globally so that the / operator is always interpreted as true
    division.  The "new" option is only intended for use in certain
    educational environments, where true division is required, but
    asking the students to include the future division statement in
    all their code would be a problem.

    This option will not be supported in Python 3.0; Python 3.0 will
    always interpret / as true division.

    (This option was originally proposed as -D, but that turned out to
    be an existing option for Jython, hence the Q -- mnemonic for
    Quotient.  Other names have been proposed, like -Qclassic,
    -Qclassic-warn, -Qtrue, or -Qold_division etc.; these seem more
    verbose to me without much advantage.  After all the term classic
    division is not used in the language at all (only in the PEP), and
    the term true division is rarely used in the language -- only in
    __truediv__.)


Semantics of Floor Division

    Floor division will be implemented in all the Python numeric
    types, and will have the semantics of

        a // b == floor(a/b)

    except that the result type will be the common type into which a
    and b are coerced before the operation.

    Specifically, if a and b are of the same type, a//b will be of
    that type too.  If the inputs are of different types, they are
    first coerced to a common type using the same rules used for all
    other arithmetic operators.

    In particular, if a and b are both ints or longs, the result has
    the same type and value as for classic division on these types
    (including the case of mixed input types; int//long and long//int
    will both return a long).

    For floating point inputs, the result is a float.  For example:

      3.5//2.0 == 1.0

    For complex numbers, // raises an exception, since floor() of a
    complex number is not allowed.

    For user-defined classes and extension types, all semantics are up
    to the implementation of the class or type.


Semantics of True Division

    True division for ints and longs will convert the arguments to
    float and then apply a float division.  That is, even 2/1 will
    return a float (2.0), not an int.  For floats and complex, it will
    be the same as classic division.

    The 2.2 implementation of true division acts as if the float type
    had unbounded range, so that overflow doesn't occur unless the
    magnitude of the mathematical *result* is too large to represent
    as a float.  For example, after "x = 1L << 40000", float(x) raises
    OverflowError (note that this is also new in 2.2:  previously the
    outcome was platform-dependent, most commonly a float infinity).  But
    x/x returns 1.0 without exception, while x/1 raises OverflowError.

    Note that for int and long arguments, true division may lose
    information; this is in the nature of true division (as long as
    rationals are not in the language).  Algorithms that consciously
    use longs should consider using //, as true division of longs
    retains no more than 53 bits of precision (on most platforms).

    If and when a rational type is added to Python (see PEP 239[2]),
    true division for ints and longs should probably return a
    rational.  This avoids the problem with true division of ints and
    longs losing information.  But until then, for consistency, float is
    the only choice for true division.


The Future Division Statement

    If "from __future__ import division" is present in a module, or if
    -Qnew is used, the / and /= operators are translated to true
    division opcodes; otherwise they are translated to classic
    division (until Python 3.0 comes along, where they are always
    translated to true division).

    The future division statement has no effect on the recognition or
    translation of // and //=.

    See PEP 236[4] for the general rules for future statements.

    (It has been proposed to use a longer phrase, like "true_division"
    or "modern_division".  These don't seem to add much information.)


Open Issues

    We expect that these issues will be resolved over time, as more
    feedback is received or we gather more experience with the initial
    implementation.

    - It has been proposed to call // the quotient operator, and the /
      operator the ratio operator.  I'm not sure about this -- for
      some people quotient is just a synonym for division, and ratio
      suggests rational numbers, which is wrong.  I prefer the
      terminology to be slightly awkward if that avoids unambiguity.
      Also, for some folks "quotient" suggests truncation towards
      zero, not towards infinity as "floor division" says explicitly.

    - It has been argued that a command line option to change the
      default is evil.  It can certainly be dangerous in the wrong
      hands: for example, it would be impossible to combine a 3rd
      party library package that requires -Qnew with another one that
      requires -Qold.  But I believe that the VPython folks need a way
      to enable true division by default, and other educators might
      need the same.  These usually have enough control over the
      library packages available in their environment.

    - For classes to have to support all three of __div__(),
      __floordiv__() and __truediv__() seems painful; and what to do
      in 3.0?  Maybe we only need __div__() and __floordiv__(), or
      maybe at least true division should try __truediv__() first and
      __div__() second.


Resolved Issues

    - Issue:  For very large long integers, the definition of true
      division as returning a float causes problems, since the range of
      Python longs is much larger than that of Python floats.  This
      problem will disappear if and when rational numbers are supported.

      Resolution:  For long true division, Python uses an internal
      float type with native double precision but unbounded range, so
      that OverflowError doesn't occur unless the quotient is too large
      to represent as a native double.

    - Issue:  In the interim, maybe the long-to-float conversion could be
      made to raise OverflowError if the long is out of range.

      Resolution:  This has been implemented, but, as above, the
      magnitude of the inputs to long true division doesn't matter; only
      the magnitude of the quotient matters.

    - Issue:  Tim Peters will make sure that whenever an in-range float
      is returned, decent precision is guaranteed.

      Resolution:  Provided the quotient of long true division is
      representable as a float, it suffers no more than 3 rounding
      errors:  one each for converting the inputs to an internal float
      type with native double precision but unbounded range, and
      one more for the division.  However, note that if the magnitude
      of the quotient is too *small* to represent as a native double,
      0.0 is returned without exception ("silent underflow").


FAQ

    Q. When will Python 3.0 be released?

    A. We don't plan that long ahead, so we can't say for sure.  We
       want to allow at least two years for the transition.  If Python
       3.0 comes out sooner, we'll keep the 2.x line alive for
       backwards compatibility until at least two years from the
       release of Python 2.2.  In practice, you will be able to
       continue to use the Python 2.x line for several years after
       Python 3.0 is released, so you can take your time with the
       transition.  Sites are expected to have both Python 2.x and
       Python 3.x installed simultaneously.

    Q. Why isn't true division called float division?

    A. Because I want to keep the door open to *possibly* introducing
       rationals and making 1/2 return a rational rather than a
       float.  See PEP 239[2].

    Q. Why is there a need for __truediv__ and __itruediv__?

    A. We don't want to make user-defined classes second-class
       citizens.  Certainly not with the type/class unification going
       on.

    Q. How do I write code that works under the classic rules as well
       as under the new rules without using // or a future division
       statement?

    A. Use x*1.0/y for true division, divmod(x, y)[0] for int
       division.  Especially the latter is best hidden inside a
       function.  You may also write float(x)/y for true division if
       you are sure that you don't expect complex numbers.  If you
       know your integers are never negative, you can use int(x/y) --
       while the documentation of int() says that int() can round or
       truncate depending on the C implementation, we know of no C
       implementation that doesn't truncate, and we're going to change
       the spec for int() to promise truncation.  Note that classic
       division (and floor division) round towards negative infinity,
       while int() rounds towards zero, giving different answers for
       negative numbers.

    Q. How do I specify the division semantics for input(), compile(),
       execfile(), eval() and exec?

    A. They inherit the choice from the invoking module.  PEP 236[4]
       now lists this as a resolved problem, referring to PEP 264[5].

    Q. What about code compiled by the codeop module?

    A. This is dealt with properly; see PEP 264[5].

    Q. Will there be conversion tools or aids?

    A. Certainly.  While these are outside the scope of the PEP, I
       should point out two simple tools that will be released with
       Python 2.2a3: Tools/scripts/finddiv.py finds division operators
       (slightly smarter than "grep /") and Tools/scripts/fixdiv.py
       can produce patches based on run-time analysis.

    Q. Why is my question not answered here?

    A. Because we weren't aware of it.  If it's been discussed on
       c.l.py and you believe the answer is of general interest,
       please notify the second author.  (We don't have the time or
       inclination to answer every question sent in private email,
       hence the requirement that it be discussed on c.l.py first.)


Implementation

    Essentially everything mentioned here is implemented in CVS and
    will be released with Python 2.2a3; most of it was already
    released with Python 2.2a2.


References

    [0] PEP 228, Reworking Python's Numeric Model
        http://www.python.org/dev/peps/pep-0228/

    [1] PEP 237, Unifying Long Integers and Integers, Zadka,
        http://www.python.org/dev/peps/pep-0237/

    [2] PEP 239, Adding a Rational Type to Python, Zadka,
        http://www.python.org/dev/peps/pep-0239/

    [3] PEP 240, Adding a Rational Literal to Python, Zadka,
        http://www.python.org/dev/peps/pep-0240/

    [4] PEP 236, Back to the __future__, Peters,
        http://www.python.org/dev/peps/pep-0236/

    [5] PEP 264, Future statements in simulated shells
        http://www.python.org/dev/peps/pep-0236/


Copyright

    This document has been placed in the public domain.



pep-0239 Adding a Rational Type to Python

PEP: 239
Title: Adding a Rational Type to Python
Version: $Revision$
Last-Modified: $Date$
Author: Christopher A. Craig <python-pep at ccraig.org>, Moshe Zadka <moshez at zadka.site.co.il>
Status: Rejected
Type: Standards Track
Created: 11-Mar-2001
Python-Version: 2.2
Post-History: 16-Mar-2001

Abstract

    Python has no numeric type with the semantics of an unboundedly
    precise rational number.  This proposal explains the semantics of
    such a type, and suggests builtin functions and literals to
    support such a type.  This PEP suggests no literals for rational
    numbers; that is left for another PEP[1].

BDFL Pronouncement

    This PEP is rejected.  The needs outlined in the rationale section
    have been addressed to some extent by the acceptance of PEP 327
    for decimal arithmetic.  Guido also noted, "Rational arithmetic
    was the default 'exact' arithmetic in ABC and it did not work out as
    expected".  See the python-dev discussion on 17 June 2005.

    *Postscript:* With the acceptance of PEP 3141, "A Type Hierarchy
    for Numbers", a 'Rational' numeric abstract base class was added
    with a concrete implementation in the 'fractions' module.

Rationale

    While sometimes slower and more memory intensive (in general,
    unboundedly so) rational arithmetic captures more closely the
    mathematical ideal of numbers, and tends to have behavior which is
    less surprising to newbies.  Though many Python implementations of
    rational numbers have been written, none of these exist in the
    core, or are documented in any way.  This has made them much less
    accessible to people who are less Python-savvy.


RationalType

    There will be a new numeric type added called RationalType.  Its
    unary operators will do the obvious thing.  Binary operators will
    coerce integers and long integers to rationals, and rationals to
    floats and complexes.

    The following attributes will be supported: .numerator and
    .denominator.  The language definition will promise that

        r.denominator * r == r.numerator

    that the GCD of the numerator and the denominator is 1 and that
    the denominator is positive.

    The method r.trim(max_denominator) will return the closest
    rational s to r such that abs(s.denominator) <= max_denominator.


The rational() Builtin

    This function will have the signature rational(n, d=1).  n and d
    must both be integers, long integers or rationals.  A guarantee is
    made that

        rational(n, d) * d == n


Open Issues

    - Maybe the type should be called rat instead of rational.
      Somebody proposed that we have "abstract" pure mathematical
      types named complex, real, rational, integer, and "concrete"
      representation types with names like float, rat, long, int.

    - Should a rational number with an integer value be allowed as a
      sequence index?  For example, should s[5/3 - 2/3] be equivalent
      to s[1]?

    - Should shift and mask operators be allowed for rational numbers?
      For rational numbers with integer values?

    - Marcin 'Qrczak' Kowalczyk summarized the arguments for and
      against unifying ints with rationals nicely on c.l.py:

      Arguments for unifying ints with rationals:

      - Since 2 == 2/1 and maybe str(2/1) == '2', it reduces surprises
        where objects seem equal but behave differently.

      - / can be freely used for integer division when I *know* that
        there is no remainder (if I am wrong and there is a remainder,
        there will probably be some exception later).

      Arguments against:

      - When I use the result of / as a sequence index, it's usually
        an error which should not be hidden by making the program
        working for some data, since it will break for other data.

      - (this assumes that after unification int and rational would be
        different types:) Types should rarely depend on values. It's
        easier to reason when the type of a variable is known: I know
        how I can use it. I can determine that something is an int and
        expect that other objects used in this place will be ints too.

      - (this assumes the same type for them:) Int is a good type in
        itself, not to be mixed with rationals.  The fact that
        something is an integer should be expressible as a statement
        about its type. Many operations require ints and don't accept
        rationals. It's natural to think about them as about different
        types.


References

    [1] PEP 240, Adding a Rational Literal to Python, Zadka,
        http://www.python.org/dev/peps/pep-0240/


Copyright

    This document has been placed in the public domain.



pep-0240 Adding a Rational Literal to Python

PEP: 240
Title: Adding a Rational Literal to Python
Version: $Revision$
Last-Modified: $Date$
Author: Christopher A. Craig <python-pep at ccraig.org>, Moshe Zadka <moshez at zadka.site.co.il>
Status: Rejected
Type: Standards Track
Created: 11-Mar-2001
Python-Version: 2.2
Post-History: 16-Mar-2001

Abstract

    A different PEP[1] suggests adding a builtin rational type to
    Python.  This PEP suggests changing the ddd.ddd float literal to a
    rational in Python, and modifying non-integer division to return
    it.

BDFL Pronouncement

    This PEP is rejected.  The needs outlined in the rationale section
    have been addressed to some extent by the acceptance of PEP 327
    for decimal arithmetic.  Guido also noted, "Rational arithmetic
    was the default 'exact' arithmetic in ABC and it did not work out as
    expected".  See the python-dev discussion on 17 June 2005.

Rationale

    Rational numbers are useful for exact and unsurprising arithmetic.
    They give the correct results people have been taught in various
    math classes.  Making the "obvious" non-integer type one with more
    predictable semantics will surprise new programmers less then
    using floating point numbers. As quite a few posts on c.l.py and
    on tutor@python.org have shown, people often get bit by strange
    semantics of floating point numbers: for example, round(0.98, 2)
    still gives 0.97999999999999998.


Proposal

    Literals conforming to the regular expression '\d*.\d*' will be
    rational numbers.


Backwards Compatibility

    The only backwards compatible issue is the type of literals
    mentioned above.  The following migration is suggested:

    1. The next Python after approval will allow 
       "from __future__ import rational_literals" 
       to cause all such literals to be treated as rational numbers.

    2. Python 3.0 will have a warning, turned on by default, about
       such literals in the absence of a __future__ statement.  The
       warning message will contain information about the __future__
       statement, and indicate that to get floating point literals,
       they should be suffixed with "e0".

    3. Python 3.1 will have the warning turned off by default.  This
       warning will stay in place for 24 months, at which time the
       literals will be rationals and the warning will be removed.


Common Objections

    Rationals are slow and memory intensive!
    (Relax, I'm not taking floats away, I'm just adding two more characters.
    1e0 will still be a float)

    Rationals must present themselves as a decimal float or they will be
    horrible for users expecting decimals (i.e. str(.5) should return '.5' and
    not '1/2').  This means that many rationals must be truncated at some 
    point, which gives us a new loss of precision.
    


References

    [1] PEP 239, Adding a Rational Type to Python, Zadka,
        http://www.python.org/dev/peps/pep-0239/


Copyright

    This document has been placed in the public domain.



pep-0241 Metadata for Python Software Packages

PEP: 241
Title: Metadata for Python Software Packages
Version: $Revision$
Last-Modified: $Date$
Author: A.M. Kuchling <amk at amk.ca>
Status: Final
Type: Standards Track
Created: 12-Mar-2001
Post-History: 19-Mar-2001

Introduction

   This PEP describes a mechanism for adding metadata to Python
   packages.  It includes specifics of the field names, and their
   semantics and usage.


Including Metadata in Packages

    The Distutils 'sdist' command will be modified to extract the
    metadata fields from the arguments and write them to a file in the
    generated zipfile or tarball.  This file will be named PKG-INFO
    and will be placed in the top directory of the source
    distribution (where the README, INSTALL, and other files usually
    go).

    Developers may not provide their own PKG-INFO file.  The "sdist"
    command will, if it detects an existing PKG-INFO file, terminate
    with an appropriate error message.  This should prevent confusion
    caused by the PKG-INFO and setup.py files being out of sync.

    The PKG-INFO file format is a single set of RFC-822 headers
    parseable by the rfc822.py module.  The field names listed in the
    following section are used as the header names.  There's no 
    extension mechanism in this simple format; the Catalog and Distutils
    SIGs will aim at getting a more flexible format ready for Python 2.2.
    

Fields

    This section specifies the names and semantics of each of the
    supported metadata fields.
 
    Fields marked with "(Multiple use)" may be specified multiple
    times in a single PKG-INFO file.  Other fields may only occur
    once in a PKG-INFO file.  Fields marked with "(optional)" are
    not required to appear in a valid PKG-INFO file, all other
    fields must be present.

    Metadata-Version

      Version of the file format; currently "1.0" is the only
      legal value here.  

      Example: 

           Metadata-Version: 1.0

    Name

      The name of the package.  

      Example: 

          Name: BeagleVote
      
    Version

      A string containing the package's version number.  This
      field should be parseable by one of the Version classes
      (StrictVersion or LooseVersion) in the distutils.version
      module.

      Example: 

          Version: 1.0a2
      
    Platform (multiple use)

      A comma-separated list of platform specifications, summarizing
      the operating systems supported by the package.  The major
      supported platforms are listed below, but this list is
      necessarily incomplete.

            POSIX, MacOS, Windows, BeOS, PalmOS.

      Binary distributions will use the Supported-Platform field in
      their metadata to specify the OS and CPU for which the binary
      package was compiled.  The semantics of the Supported-Platform
      are not specified in this PEP.

      Example: 

          Platform: POSIX, Windows
      
    Summary

      A one-line summary of what the package does.

      Example: 

          Summary: A module for collecting votes from beagles.
      
    Description (optional)

      A longer description of the package that can run to several
      paragraphs.  (Software that deals with metadata should not
      assume any maximum size for this field, though one hopes that
      people won't include their instruction manual as the
      long-description.)

      Example: 
      
          Description: This module collects votes from beagles
                       in order to determine their electoral wishes.
                       Do NOT try to use this module with basset hounds;
                       it makes them grumpy.
      
    Keywords (optional)

      A list of additional keywords to be used to assist searching
      for the package in a larger catalog.

      Example: 

          Keywords: dog puppy voting election
      
    Home-page (optional)

      A string containing the URL for the package's home page.

      Example: 

          Home-page: http://www.example.com/~cschultz/bvote/
      
    Author (optional)

      A string containing at a minimum the author's name.  Contact
      information can also be added, separating each line with
      newlines.

      Example: 

          Author: C. Schultz
                  Universal Features Syndicate
                  Los Angeles, CA
      
    Author-email

      A string containing the author's e-mail address.  It can contain
      a name and e-mail address in the legal forms for a RFC-822
      'From:' header.  It's not optional because cataloging systems
      can use the e-mail portion of this field as a unique key
      representing the author.  A catalog might provide authors the
      ability to store their GPG key, personal home page, and other
      additional metadata *about the author*, and optionally the
      ability to associate several e-mail addresses with the same
      person.  Author-related metadata fields are not covered by this
      PEP.  

      Example: 

          Author-email: "C. Schultz" <cschultz@example.com>
      
    License
      
      A string selected from a short list of choices, specifying the
      license covering the package.  Some licenses result in the
      software being freely redistributable, so packagers and
      resellers can automatically know that they're free to
      redistribute the software.  Other licenses will require
      a careful reading by a human to determine how the software can be
      repackaged and resold.

      The choices are:

        Artistic, BSD, DFSG, GNU GPL, GNU LGPL, "MIT", 
        Mozilla PL, "public domain", Python, Qt PL, Zope PL, unknown,
        nocommercial, nosell, nosource, shareware, other

      Definitions of some of the licenses are:

       DFSG           The license conforms to the Debian Free Software
                      Guidelines, but does not use one of the other
                      DFSG conforming licenses listed here. 
                      More information is available at:
                      http://www.debian.org/social_contract#guidelines

       Python         Python 1.6 or higher license.  Version 1.5.2 and 
                      earlier are under the MIT license.

       public domain  Software is public domain, not copyrighted.
       unknown        Status is not known 
       nocommercial   Free private use but commercial use not permitted 
       nosell         Free use but distribution for profit by arrangement 
       nosource       Freely distributable but no source code 
       shareware      Payment is requested if software is used
       other          General category for other non-DFSG licenses 

      Some of these licenses can be interpreted to mean the software is 
      freely redistributable.  The list of redistributable licenses is:

      Artistic, BSD, DFSG, GNU GPL, GNU LGPL, "MIT", 
      Mozilla PL, "public domain", Python, Qt PL, Zope PL, 
      nosource, shareware

      Note that being redistributable does not mean a package
      qualifies as free software, 'nosource' and 'shareware' being
      examples.

      Example: 

          License: MIT
      

Acknowledgements

    Many changes and rewrites to this document were suggested by the
    readers of the Distutils SIG.  In particular, Sean Reifschneider
    often contributed actual text for inclusion in this PEP.
 
    The list of licenses was compiled using the SourceForge license
    list and the CTAN license list compiled by Graham Williams; Carey
    Evans also offered several useful suggestions on this list.


Copyright

    This document has been placed in the public domain.



pep-0242 Numeric Kinds

PEP: 242
Title: Numeric Kinds
Version: $Revision$
Last-Modified: $Date$
Author: Paul F. Dubois <paul at pfdubois.com>
Status: Rejected
Type: Standards Track
Created: 17-Mar-2001
Python-Version: 2.2
Post-History: 17-Apr-2001

Abstract

    This proposal gives the user optional control over the precision
    and range of numeric computations so that a computation can be
    written once and run anywhere with at least the desired precision
    and range.  It is backward compatible with existing code.  The
    meaning of decimal literals is clarified.


Rationale

    Currently it is impossible in every language except Fortran 90 to
    write a program in a portable way that uses floating point and
    gets roughly the same answer regardless of platform -- or refuses
    to compile if that is not possible.  Python currently has only one
    floating point type, equal to a C double in the C implementation.

    No type exists corresponding to single or quad floats.  It would
    complicate the language to try to introduce such types directly
    and their subsequent use would not be portable.  This proposal is
    similar to the Fortran 90 "kind" solution, adapted to the Python
    environment.  With this facility an entire calculation can be
    switched from one level of precision to another by changing a
    single line.  If the desired precision does not exist on a
    particular machine, the program will fail rather than get the
    wrong answer.  Since coding in this style would involve an early
    call to the routine that will fail, this is the next best thing to
    not compiling.


Supported Kinds of Ints and Floats

    Complex numbers are treated separately below, since Python can be
    built without them.

    Each Python compiler may define as many "kinds" of integer and
    floating point numbers as it likes, except that it must support at
    least two kinds of integer corresponding to the existing int and
    long, and must support at least one kind of floating point number,
    equivalent to the present float.
    
    The range and precision of these required kinds are processor
    dependent, as at present, except for the "long integer" kind,
    which can hold an arbitrary integer.

    The built-in functions int(), long(), and float() convert inputs
    to these default kinds as they do at present.  (Note that a
    Unicode string is actually a different "kind" of string and that a
    sufficiently knowledgeable person might be able to expand this PEP
    to cover that case.)

    Within each type (integer, floating) the compiler supports a
    linearly-ordered set of kinds, with the ordering determined by the
    ability to hold numbers of an increased range and/or precision.


Kind Objects

    Two new standard functions are defined in a module named "kinds".
    They return callable objects called kind objects.  Each int or
    floating kind object f has the signature result = f(x), and each
    complex kind object has the signature result = f(x, y=0.).

    int_kind(n)
        For an integer argument n >= 1, return a callable object whose
        result is an integer kind that will hold an integer number in
        the open interval (-10**n,10**n).  The kind object accepts
        arguments that are integers including longs.  If n == 0,
        returns the kind object corresponding to the Python literal 0.

    float_kind(nd, n)
        For nd >= 0 and n >= 1, return a callable object whose result
        is a floating point kind that will hold a floating-point
        number with at least nd digits of precision and a base-10
        exponent in the closed interval [-n, n].  The kind object
        accepts arguments that are integer or float.

        If nd and n are both zero, returns the kind object
        corresponding to the Python literal 0.0.

    The compiler will return a kind object corresponding to the least
    of its available set of kinds for that type that has the desired
    properties.  If no kind with the desired qualities exists in a
    given implementation an OverflowError exception is thrown.  A kind
    function converts its argument to the target kind, but if the
    result does not fit in the target kind's range, an OverflowError
    exception is thrown.

    Besides their callable behavior, kind objects have attributes
    giving the traits of the kind in question.

    1. name is the name of the kind.  The standard kinds are called
       int, long, double.

    2. typecode is a single-letter string that would be appropriate
       for use with Numeric or module array to form an array of this
       kind.  The standard types' typecodes are 'i', 'O', 'd'
       respectively.

    3. Integer kinds have these additional attributes: MAX, equal to
       the maximum permissible integer of this kind, or None for the
       long kind.  MIN, equal to the most negative permissible integer
       of this kind, or None for the long kind.

    4. Float kinds have these additional attributes whose properties
       are equal to the corresponding value for the corresponding C
       type in the standard header file "float.h".  MAX, MIN, DIG,
       MANT_DIG, EPSILON, MAX_EXP, MAX_10_EXP, MIN_EXP, MIN_10_EXP,
       RADIX, ROUNDS (== FLT_RADIX, FLT_ROUNDS in float.h).  These
       values are of type integer except for MAX, MIN, and EPSILON,
       which are of the Python floating type to which the kind
       corresponds.


Attributes of Module kinds

    int_kinds is a list of the available integer kinds, sorted from lowest
              to highest kind.  By definition, int_kinds[-1] is the
              long kind.

    float_kinds is a list of the available floating point kinds, sorted
                from lowest to highest kind.

    default_int_kind is the kind object corresponding to the Python 
                     literal 0

    default_long_kind is the kind object corresponding to the Python
                      literal 0L

    default_float_kind is the kind object corresponding to the Python
                       literal 0.0


Complex Numbers

    If supported, complex numbers have real and imaginary parts that
    are floating-point numbers with the same kind.  A Python compiler
    must support a complex analog of each floating point kind it
    supports, if it supports complex numbers at all.

    If complex numbers are supported, the following are available in
    module kinds:

    complex_kind(nd, n)
        Return a callable object whose result is a complex kind that
        will hold a complex number each of whose components (.real,
        .imag) is of kind float_kind(nd, n).  The kind object will
        accept one argument that is of any integer, real, or complex
        kind, or two arguments, each integer or real.

    complex_kinds is a list of the available complex kinds, sorted 
                  from lowest to highest kind.

    default_complex_kind is the kind object corresponding to the
                         Python literal 0.0j.  The name of this kind
                         is doublecomplex, and its typecode is 'D'.
                              
    Complex kind objects have these addition attributes:

    floatkind is the kind object of the corresponding float type.


Examples

    In module myprecision.py:

        import kinds
        tinyint = kinds.int_kind(1)
        single = kinds.float_kind(6, 90)
        double = kinds.float_kind(15, 300)
        csingle = kinds.complex_kind(6, 90)
     
    In the rest of my code:

        from myprecision import tinyint, single, double, csingle  
        n = tinyint(3)
        x = double(1.e20)
        z = 1.2
        # builtin float gets you the default float kind, properties unknown
        w = x * float(x)
        # but in the following case we know w has kind "double".
        w = x * double(z)

        u = csingle(x + z * 1.0j)
        u2 = csingle(x+z, 1.0)

    Note how that entire code can then be changed to a higher
    precision by changing the arguments in myprecision.py.

    Comment: note that you aren't promised that single != double; but
    you are promised that double(1.e20) will hold a number with 15
    decimal digits of precision and a range up to 10**300 or that the
    float_kind call will fail.


Open Issues

    No open issues have been raised at this time.


Rejection

    This PEP has been closed by the author.  The kinds module will not
    be added to the standard library.

    There was no opposition to the proposal but only mild interest in
    using it, not enough to justify adding the module to the standard
    library.  Instead, it will be made available as a separate
    distribution item at the Numerical Python site.  At the next
    release of Numerical Python, it will no longer be a part of the
    Numeric distribution.


Copyright

    This document has been placed in the public domain.



pep-0243 Module Repository Upload Mechanism

PEP: 243
Title: Module Repository Upload Mechanism
Version: $Revision$
Last-Modified: $Date$
Author: Sean Reifschneider <jafo-pep at tummy.com>
Discussions-To:  <distutils-sig at python.org>
Status: Withdrawn
Type: Standards Track
Created: 18-Mar-2001
Python-Version: 2.1
Post-History: 20-Mar-2001, 24-Mar-2001

Abstract

    For a module repository system (such as Perl's CPAN) to be
    successful, it must be as easy as possible for module authors to
    submit their work.  An obvious place for this submit to happen is
    in the Distutils tools after the distribution archive has been
    successfully created.  For example, after a module author has
    tested their software (verifying the results of "setup.py sdist"),
    they might type "setup.py sdist --submit".  This would flag
    Distutils to submit the source distribution to the archive server
    for inclusion and distribution to the mirrors.

    This PEP only deals with the mechanism for submitting the software
    distributions to the archive, and does not deal with the actual
    archive/catalog server.


Upload Process

    The upload will include the Distutils "PKG-INFO" meta-data
    information (as specified in PEP-241 [1]), the actual software
    distribution, and other optional information.  This information
    will be uploaded as a multi-part form encoded the same as a
    regular HTML file upload request.  This form is posted using
    ENCTYPE="multipart/form-data" encoding [2].

    The upload will be made to the host "www.python.org" on port
    80/tcp (POST http://www.python.org:80/pypi).  The form
    will consist of the following fields:

        distribution -- The file containing the module software (for
        example, a .tar.gz or .zip file).

        distmd5sum -- The MD5 hash of the uploaded distribution,
        encoded in ASCII representing the hexadecimal representation
        of the digest ("for byte in digest: s = s + ('%02x' %
        ord(byte))").

        pkginfo (optional) -- The file containing the distribution
        meta-data (as specified in PEP-241 [1]).  Note that if this is
        not included, the distribution file is expected to be in .tar
        format (gzipped and bzipped compreesed are allowed) or .zip
        format, with a "PKG-INFO" file in the top-level directory it
        extracts ("package-1.00/PKG-INFO").

        infomd5sum (required if pkginfo field is present) -- The MD5 hash
        of the uploaded meta-data, encoded in ASCII representing the
        hexadecimal representation of the digest ("for byte in digest:
        s = s + ('%02x' % ord(byte))").

        platform (optional) -- A string representing the target
        platform for this distribution.  This is only for binary
        distributions.  It is encoded as
        "<os_name>-<os_version>-<platform architecture>-<python
        version>".

        signature (optional) -- A OpenPGP-compatible signature [3] of
        the uploaded distribution as signed by the author.  This may
        be used by the cataloging system to automate acceptance of
        uploads.

        protocol_version -- A string indicating the protocol version that
        the client supports.  This document describes protocol version "1".


Return Data

    The status of the upload will be reported using HTTP non-standard
    ("X-*)" headers.  The "X-Swalow-Status" header may have the following
    values:

        SUCCESS -- Indicates that the upload has succeeded.

        FAILURE -- The upload is, for some reason, unable to be
        processed.

        TRYAGAIN -- The server is unable to accept the upload at this
        time, but the client should try again at a later time.
        Potential causes of this are resource shortages on the server,
        administrative down-time, etc...

    Optionally, there may be a "X-Swalow-Reason" header which includes a
    human-readable string which provides more detailed information about
    the "X-Swalow-Status".

    If there is no "X-Swalow-Status" header, or it does not contain one of
    the three strings above, it should be treated as a temporary failure.

    Example:

        >>> f = urllib.urlopen('http://www.python.org:80/pypi')
        >>> s = f.headers['x-swalow-status']
        >>> s = s + ': ' + f.headers.get('x-swalow-reason', '<None>')
        >>> print s
        FAILURE: Required field "distribution" missing.


Sample Form

    The upload client must submit the page in the same form as
    Netscape Navigator version 4.76 for Linux produces when presented
    with the following form:

        <H1>Upload file</H1>
        <FORM NAME="fileupload" METHOD="POST" ACTION="pypi"
              ENCTYPE="multipart/form-data">
        <INPUT TYPE="file" NAME="distribution"><BR>
        <INPUT TYPE="text" NAME="distmd5sum"><BR>
        <INPUT TYPE="file" NAME="pkginfo"><BR>
        <INPUT TYPE="text" NAME="infomd5sum"><BR>
        <INPUT TYPE="text" NAME="platform"><BR>
        <INPUT TYPE="text" NAME="signature"><BR>
        <INPUT TYPE="hidden" NAME="protocol_version" VALUE="1"><BR>
        <INPUT TYPE="SUBMIT" VALUE="Upload">
        </FORM>


Platforms

    The following are valid os names:

        aix beos debian dos freebsd hpux mac macos mandrake netbsd
        openbsd qnx redhat solaris suse windows yellowdog

    The above include a number of different types of distributions of
    Linux.  Because of versioning issues these must be split out, and
    it is expected that when it makes sense for one system to use
    distributions made on other similar systems, the download client
    will make the distinction.

    Version is the official version string specified by the vendor for
    the particular release.  For example, "2000" and "nt" (Windows),
    "9.04" (HP-UX), "7.0" (RedHat, Mandrake).

    The following are valid architectures:

        alpha hppa ix86 powerpc sparc ultrasparc


Status

    I currently have a proof-of-concept client and server implemented.
    I plan to have the Distutils patches ready for the 2.1 release.
    Combined with Andrew's PEP-241 [1] for specifying distribution
    meta-data, I hope to have a platform which will allow us to gather
    real-world data for finalizing the catalog system for the 2.2
    release.


References

    [1] Metadata for Python Software Package, Kuchling,
        http://www.python.org/dev/peps/pep-0241/

    [2] RFC 1867, Form-based File Upload in HTML
        http://www.faqs.org/rfcs/rfc1867.html

    [3] RFC 2440, OpenPGP Message Format
        http://www.faqs.org/rfcs/rfc2440.html


Copyright

    This document has been placed in the public domain.



pep-0244 The `directive' statement

PEP: 244
Title: The `directive' statement
Version: $Revision$
Last-Modified: $Date$
Author: Martin von Lรถwis <martin at v.loewis.de>
Status: Rejected
Type: Standards Track
Created: 20-Mar-2001
Python-Version: 2.1
Post-History: 

Motivation

    From time to time, Python makes an incompatible change to the
    advertised semantics of core language constructs, or changes their
    accidental (implementation-dependent) behavior in some way.  While
    this is never done capriciously, and is always done with the aim
    of improving the language over the long term, over the short term
    it's contentious and disrupting.

    PEP 1, Guidelines for Language Evolution[1] suggests ways to ease
    the pain, and this PEP introduces some machinery in support of
    that.

    PEP 2, Statically Nested Scopes[2] is the first application, and
    will be used as an example here.

    When a new, potentially incompatible language feature is added,
    some modules and libraries may chose to use it, while others may
    not.  This specification introduces a syntax where a module author
    can denote whether a certain language feature is used in the
    module or not.

    In discussion of this PEP, readers commented that there are two
    kinds of "settable" language features:

    - those that are designed to eventually become the only option, at
      which time specifying use of them is not necessary anymore.  The
      features for which the syntax of the "Back to the __future__"
      PEP 236, Back to the __future__[3] was proposed fall into this
      category.  This PEP supports declaring such features, and
      supports phasing out the "old" meaning of constructs whose
      semantics has changed under the new feature.  However, it
      defines no policy as to what features must be phased out
      eventually.

    - those which are designed to stay optional forever, e.g. if they
      change some default setting in the interpreter.  An example for
      such settings might be the request to always emit line-number
      instructions for a certain module; no specific flags of that
      kind are proposed in this specification.

    Since a primary goal of this PEP is to support new language
    constructs without immediately breaking old libraries, special
    care was taken not to break old libraries by introducing the new
    syntax.


Syntax

    A directive_statement is a statement of the form

        directive_statement: 'directive' NAME [atom] [';'] NEWLINE

    The name in the directive indicates the kind of the directive; it
    defines whether the optional atom can be present, and whether
    there are further syntactical or semantical restrictions to the
    atom.  In addition, depending on the name of the directive,
    certain additional syntactical or semantical restrictions may be
    placed on the directive (e.g. placement of the directive in the
    module may be restricted to the top of the module).

    In the directive_statement, 'directive' is a new
    keyword. According to [1], this keyword is initially considered as
    a keyword only when used in a directive statement, see "Backwards
    Compatibility" below.


Semantics

    A directive statement instructs the Python interpreter to process
    a source file in a different way; the specific details of that
    processing depend on the directive name.  The optional atom is
    typically interpreted when the source code is processed; details
    of that interpretation depend on the directive.


Specific Directives: transitional

    If a syntactical or semantical change is added to Python which is
    incompatible, [1] mandates a transitional evolution of the
    language, where the new feature is initially available alongside
    with the old one.  Such a transition is possible by means of the
    transitional directive.

    In a transitional directive, the NAME is 'transitional'. The atom
    MUST be present, and it MUST be a NAME.  The possible values for
    that name are defined when the language change is defined.  One
    example for such a directive is

        directive transitional nested_scopes

    The transitional directive MUST occur at before any other
    statement in a module, except for the documentation string
    (i.e. it may appear as the second statement of a module only if
    the first statement is a STRING+).


Backwards Compatibility

    Introducing 'directive' as a new keyword might cause
    incompatibilities with existing code.  Following the guideline in
    [1], in the initial implementation of this specification,
    directive is a new keyword only if it was used in a valid
    directive_statement (i.e. if it appeared as the first non-string
    token in a module).


Unresolved Problems: directive as the first identifier

    Using directive in a module as

    directive = 1

    (i.e. the name directive appears as the first thing in a module)
    will treat it as keyword, not as identifier. It would be possible
    to classify it as a NAME with an additional look-ahead token, but
    such look-ahead is not available in the Python tokenizer.


Questions and Answers

    Q: It looks like this PEP was written to allow definition of source
       code character sets.  Is that true?

    A: No.  Even though the directive facility can be extended to
       allow source code encodings, no specific directive is proposed.

    Q: Then why was this PEP written at all?

    A: It acts as a counter-proposal to [3], which proposes to
       overload the import statement with a new meaning.  This PEP
       allows to solve the problem in a more general way.

    Q: But isn't mixing source encodings and language changes like
       mixing apples and oranges?

    A: Perhaps.  To address the difference, the predefined
       "transitional" directive has been defined.


References and Footnotes

    [1] PEP 5, Guidelines for Language Evolution, Prescod
        http://www.python.org/dev/peps/pep-0005/

    [2] PEP 227, Statically Nested Scopes, Hylton
        http://www.python.org/dev/peps/pep-0227/

    [3] PEP 236, Back to the __future__, Peters
        http://www.python.org/dev/peps/pep-0236/


Copyright

    This document has been placed in the public domain.



pep-0245 Python Interface Syntax

PEP: 245
Title: Python Interface Syntax
Version: $Revision$
Last-Modified: $Date$
Author: Michel Pelletier <michel at users.sourceforge.net>
Discussions-To: http://www.zope.org/Wikis/Interfaces
Status: Rejected
Type: Standards Track
Created: 11-Jan-2001
Python-Version: 2.2
Post-History: 21-Mar-2001

Rejection Notice

    I'm rejecting this PEP.  It's been five years now.  While at some
    point I expect that Python will have interfaces, it would be naive
    to expect it to resemble the syntax in this PEP.  Also, PEP 246 is
    being rejected in favor of something completely different; interfaces
    won't play a role in adaptation or whatever will replace it.  GvR.


Introduction

    This PEP describes a proposed syntax for creating interface
    objects in Python.


Overview

    In addition to thinking about adding a static type system to
    Python, the Types-SIG was also charged to devise an interface
    system for Python.  In December of 1998, Jim Fulton released a
    prototype interfaces system based on discussions from the SIG.
    Many of the issues and background information on this discussion
    and prototype can be found in the SIG archives[1].

    Around the end of 2000, Digital Creations began thinking about
    better component model designs for Zope[2].  Zope's future
    component model relies heavily on interface objects.  This led to
    further development of Jim's "Scarecrow" interfaces prototype.
    Starting with version 2.3, Zope comes with an Interface package as
    standard software.  Zope's Interface package is used as the
    reference implementation for this PEP.

    The syntax proposed by this PEP relies on syntax enhancements
    describe in PEP 232 [3] and describes an underlying framework
    which PEP 233 [4] could be based upon.  There is some work being
    done with regard to interface objects and Proxy objects, so for
    those optional parts of this PEP you may want to see[5].


The Problem

    Interfaces are important because they solve a number of problems
    that arise while developing software:

    - There are many implied interfaces in Python, commonly referred
      to as "protocols".  Currently determining those protocols is
      based on implementation introspection, but often that also
      fails.  For example, defining __getitem__ implies both a
      sequence and a mapping (the former with sequential, integer
      keys).  There is no way for the developer to be explict about
      which protocols the object intends to implement.

    - Python is limited, from the developer's point of view, by the
      split between types and classes.  When types are expected, the
      consumer uses code like 'type(foo) == type("")' to determine if
      'foo' is a string.  When instances of classes are expected, the
      consumer uses 'isinstance(foo, MyString)' to determine if 'foo'
      is an instance of the 'MyString' class.  There is no unified
      model for determining if an object can be used in a certain,
      valid way.

    - Python's dynamic typing is very flexible and powerful, but it
      does not have the advantage of static typed languages that
      provide type checking.  Static typed langauges provide you with
      much more type saftey, but are often overly verbose because
      objects can only be generalized by common subclassing and used
      specificly with casting (for example, in Java).

    There are also a number of documentation problems that interfaces
    try to solve.

    - Developers waste a lot of time looking at the source code of
      your system to figure out how objects work.

    - Developers who are new to your system may misunderstand how your
      objects work, causing, and possibly propagating, usage errors.

    - Because a lack of interfaces means usage is inferred from the
      source, developers may end up using methods and attributes that
      are meant for "internal use only".

    - Code inspection can be hard, and very discouraging to novice
      programmers trying to properly understand code written by gurus.

    - A lot of time is wasted when many people try very hard to
      understand obscurity (like undocumented software).  Effort spend
      up front documenting interfaces will save much of this time in
      the end.

    Interfaces try to solve these problems by providing a way for you
    to specify a contractual obligation for your object, documentation
    on how to use an object, and a built-in mechanism for discovering
    the contract and the documentation.

    Python has very useful introspection features.  It is well known
    that this makes exploring concepts in the interactive interpreter
    easier, because Python gives you the ability to look at all kinds
    of information about the objects: the type, doc strings, instance
    dictionaries, base classes, unbound methods and more.

    Many of these features are oriented toward introspecting, using
    and changing the implementation of software, and one of them ("doc
    strings") is oriented toward providing documentation.  This
    proposal describes an extension to this natural introspection
    framework that describes an object's interface.


Overview of the Interface Syntax

    For the most part, the syntax of interfaces is very much like the
    syntax of classes, but future needs, or needs brought up in
    discussion, may define new possibilities for interface syntax.

    A formal BNF description of the syntax is givena later in the PEP,
    for the purposes of illustration, here is an example of two
    different interfaces created with the proposed syntax:

        interface CountFishInterface:
            "Fish counting interface"

            def oneFish():
                "Increments the fish count by one"

            def twoFish():
                "Increments the fish count by two"

            def getFishCount():
                "Returns the fish count"

        interface ColorFishInterface:
            "Fish coloring interface"

            def redFish():
                "Sets the current fish color to red"

            def blueFish():
                "Sets the current fish color to blue"

            def getFishColor():
                "This returns the current fish color" 

    This code, when evaluated, will create two interfaces called
    `CountFishInterface' and `ColorFishInterface'. These interfaces
    are defined by the `interface' statement.

    The prose documentation for the interfaces and their methods come
    from doc strings.  The method signature information comes from the
    signatures of the `def' statements.  Notice how there is no body
    for the def statements.  The interface does not implement a
    service to anything; it merely describes one.  Documentation
    strings on interfaces and interface methods are mandatory, a
    'pass' statement cannot be provided.  The interface equivalent of
    a pass statement is an empty doc string.

    You can also create interfaces that "extend" other interfaces.
    Here, you can see a new type of Interface that extends the
    CountFishInterface and ColorFishInterface:

        interface FishMarketInterface(CountFishInterface, ColorFishInterface):
            "This is the documentation for the FishMarketInterface"

            def getFishMonger():
                "Returns the fish monger you can interact with"

            def hireNewFishMonger(name):
                "Hire a new fish monger"

            def buySomeFish(quantity=1):
                "Buy some fish at the market"

    The FishMarketInteface extends upon the CountFishInterface and
    ColorfishInterface.


Interface Assertion

    The next step is to put classes and interfaces together by
    creating a concrete Python class that asserts that it implements
    an interface.  Here is an example FishMarket component that might
    do this:

        class FishError(Error):
            pass

        class FishMarket implements FishMarketInterface:
            number = 0
            color = None
            monger_name = 'Crusty Barnacles' 

            def __init__(self, number, color):
                self.number = number
                self.color = color

            def oneFish(self):
                self.number += 1

            def twoFish(self):
                self.number += 2

            def redFish(self):
                self.color = 'red'

            def blueFish(self):
                self.color = 'blue'

            def getFishCount(self):
                return self.number

            def getFishColor(self):
                return self.color

            def getFishMonger(self):
                return self.monger_name

            def hireNewFishMonger(self, name):
                self.monger_name = name

            def buySomeFish(self, quantity=1):
                if quantity > self.count:
                    raise FishError("There's not enough fish")
                self.count -= quantity
                return quantity

    This new class, FishMarket defines a concrete class which
    implements the FishMarketInterface.  The object following the
    `implements' statement is called an "interface assertion".  An
    interface assertion can be either an interface object, or tuple of
    interface assertions.

    The interface assertion provided in a `class' statement like this
    is stored in the class's `__implements__' class attribute.  After
    interpreting the above example, you would have a class statement
    that can be examined like this with an 'implements' built-in
    function:

        >>> FishMarket
        <class FishMarket at 8140f50>
        >>> FishMarket.__implements__
        (<Interface FishMarketInterface at 81006f0>,)
        >>> f = FishMarket(6, 'red')
        >>> implements(f, FishMarketInterface)
        1
        >>>

    A class can realize more than one interface.  For example, say you
    had an interface called `ItemInterface' that described how an
    object worked as an item in a container object.  If you wanted to
    assert that FishMarket instances realized the ItemInterface
    interface as well as the FishMarketInterface, you can provide an
    interface assertion that contained a tuple of interface objects to
    the FishMarket class:

        class FishMarket implements FishMarketInterface, ItemInterface:
            # ...

    Interface assertions can also be used if you want to assert that
    one class implements an interface, and all of the interfaces that
    another class implements:

        class MyFishMarket implements FishMarketInterface, ItemInterface:
            # ...

        class YourFishMarket implements FooInterface, MyFishMarket.__implements__:
            # ...

    This new class YourFishMarket, asserts that it implements the
    FooInterface, as well as the interfaces implemented by the
    MyFishMarket class.

    It's worth going into a little bit more detail about interface
    assertions.  An interface assertion is either an interface object,
    or a tuple of interface assertions.  For example:

        FooInterface

        FooInterface, (BarInteface, BobInterface)

        FooInterface, (BarInterface, (BobInterface, MyClass.__implements__))

    Are all valid interface assertions.  When two interfaces define
    the same attributes, the order in which information is preferred
    in the assertion is from top-to-bottom, left-to-right.

    There are other interface proposals that, in the need for
    simplicity, have combined the notion of class and interface to
    provide simple interface enforcement.  Interface objects have a
    `deferred' method that returns a deferred class that implements
    this behavior:

        >>> FM = FishMarketInterface.deferred()
        >>> class MyFM(FM): pass

        >>> f = MyFM()
        >>> f.getFishMonger()
        Traceback (innermost last):
          File "<stdin>", line 1, in ?
        Interface.Exceptions.BrokenImplementation: 
        An object has failed to implement interface FishMarketInterface

                The getFishMonger attribute was not provided.
        >>> 

    This provides for a bit of passive interface enforcement by
    telling you what you forgot to do to implement that interface.


Formal Interface Syntax

    Python syntax is defined in a modified BNF grammer notation
    described in the Python Reference Manual [8].  This section
    describes the proposed interface syntax using this grammar:

        interfacedef:   "interface" interfacename [extends] ":" suite
        extends:        "(" [expression_list] ")"
        interfacename:  identifier

    An interface definition is an executable statement.  It first
    evaluates the extends list, if present.  Each item in the extends
    list should evaluate to an interface object.

    The interface's suite is then executed in a new execution frame
    (see the Python Reference Manual, section 4.1), using a newly
    created local namespace and the original global namespace.  When
    the interface's suite finishes execution, its execution frame is
    discarded but its local namespace is saved as interface elements.
    An interface object is then created using the extends list for the
    base interfaces and the saved interface elements.  The interface
    name is bound to this interface object in the original local
    namespace.

    This PEP also proposes an extension to Python's 'class' statement:

        classdef:    "class" classname [inheritance] [implements] ":" suite
        implements:  "implements" implist
        implist:     expression-list

        classname,
        inheritance,
        suite,
        expression-list:  see the Python Reference Manual

    Before a class' suite is executed, the 'inheritance' and
    'implements' statements are evaluated, if present.  The
    'inheritance' behavior is unchanged as defined in Section 7.6 of
    the Language Reference.

    The 'implements', if present, is evaluated after inheritance.
    This must evaluate to an interface specification, which is either
    an interface, or a tuple of interface specifications.  If a valid
    interface specification is present, the assertion is assigned to
    the class object's '__implements__' attribute, as a tuple.

    This PEP does not propose any changes to the syntax of function
    definitions or assignments.


Classes and Interfaces

    The example interfaces above do not describe any kind of behavior
    for their methods, they just describe an interface that a typical
    FishMarket object would realize.

    You may notice a similarity between interfaces extending from
    other interfaces and classes sub-classing from other classes.
    This is a similar concept.  However it is important to note that
    interfaces extend interfaces and classes subclass classes.  You
    cannot extend a class or subclass an interface.  Classes and
    interfaces are separate.

    The purpose of a class is to share the implementation of how an
    object works.  The purpose of an interface is to document how to
    work with an object, not how the object is implemented.  It is
    possible to have several different classes with very different
    implementations realize the same interface.

    It's also possible to implement one interface with many classes
    that mix in pieces the functionality of the interface or,
    conversely, it's possible to have one class implement many
    interfaces.  Because of this, interfaces and classes should not be
    confused or intermingled.


Interface-aware built-ins

    A useful extension to Python's list of built-in functions in the
    light of interface objects would be `implements()'.  This builtin
    would expect two arguments, an object and an interface, and return
    a true value if the object implements the interface, false
    otherwise.  For example:

        >>> interface FooInterface: pass
        >>> class Foo implements FooInterface: pass
        >>> f = Foo()
        >>> implements(f, FooInterface)
        1

    Currently, this functionality exists in the reference
    implementation as functions in the `Interface' package, requiring
    an "import Interface" to use it.  Its existence as a built-in
    would be purely for a convenience, and not necessary for using
    interfaces, and analogous to `isinstance()' for classes.


Backward Compatibility

    The proposed interface model does not introduce any backward
    compatibility issues in Python.  The proposed syntax, however,
    does.

    Any existing code that uses `interface' as an identifier will
    break.  There may be other kinds of backwards incompatibility that
    defining `interface' as a new keyword will introduce.  This
    extension to Python's syntax does not change any existing syntax
    in any backward incompatible way.

    The new `from __future__' Python syntax[6], and the new warning
    framework [7] is ideal for resolving this backward
    incompatibility.  To use interface syntax now, a developer could
    use the statement:

        from __future__ import interfaces

    In addition, any code that uses the keyword `interface' as an
    identifier will be issued a warning from Python.  After the
    appropriate period of time, the interface syntax would become
    standard, the above import statement would do nothing, and any
    identifiers named `interface' would raise an exception.  This
    period of time is proposed to be 24 months.


Summary of Proposed Changes to Python

    Adding new `interface' keyword and extending class syntax with
    `implements'.

    Extending class interface to include __implements__.

    Add 'implements(obj, interface)' built-in.


Risks

    This PEP proposes adding one new keyword to the Python language,
    `interface'.  This will break code.


Open Issues

    Goals

    Syntax

    Architecture


Dissenting Opinion

    This PEP has not yet been discussed on python-dev.
        

References

    [1] http://mail.python.org/pipermail/types-sig/1998-December/date.html

    [2] http://www.zope.org

    [3] PEP 232, Function Attributes, Warsaw
        http://www.python.org/dev/peps/pep-0232/

    [4] PEP 233, Python Online Help, Prescod
        http://www.python.org/dev/peps/pep-0233/

    [5] http://www.lemburg.com/files/python/mxProxy.html

    [6] PEP 236, Back to the __future__, Peters
        http://www.python.org/dev/peps/pep-0236/

    [7] PEP 230, Warning Framework, van Rossum
        http://www.python.org/dev/peps/pep-0236/


Copyright

    This document has been placed in the public domain.



pep-0246 Object Adaptation

PEP: 246
Title: Object Adaptation
Version: $Revision$
Last-Modified: $Date$
Author: Alex Martelli <aleaxit at gmail.com>, Clark C. Evans <cce at clarkevans.com>
Status: Rejected
Type: Standards Track
Created: 21-Mar-2001
Python-Version: 2.5
Post-History: 29-Mar-2001, 10-Jan-2005

Rejection Notice

    I'm rejecting this PEP.  Something much better is about to happen;
    it's too early to say exactly what, but it's not going to resemble
    the proposal in this PEP too closely so it's better to start a new
    PEP.  GvR.


Abstract

    This proposal puts forth an extensible cooperative mechanism for
    the adaptation of an incoming object to a context which expects an
    object supporting a specific protocol (say a specific type, class,
    or interface).

    This proposal provides a built-in "adapt" function that, for any
    object X and any protocol Y, can be used to ask the Python
    environment for a version of X compliant with Y.  Behind the
    scenes, the mechanism asks object X: "Are you now, or do you know
    how to wrap yourself to provide, a supporter of protocol Y?".
    And, if this request fails, the function then asks protocol Y:
    "Does object X support you, or do you know how to wrap it to
    obtain such a supporter?"  This duality is important, because
    protocols can be developed after objects are, or vice-versa, and
    this PEP lets either case be supported non-invasively with regard
    to the pre-existing component[s].

    Lastly, if neither the object nor the protocol know about each
    other, the mechanism may check a registry of adapter factories,
    where callables able to adapt certain objects to certain protocols
    can be registered dynamically.  This part of the proposal is
    optional: the same effect could be obtained by ensuring that
    certain kinds of protocols and/or objects can accept dynamic
    registration of adapter factories, for example via suitable custom
    metaclasses.  However, this optional part allows adaptation to be
    made more flexible and powerful in a way that is not invasive to
    either protocols or other objects, thereby gaining for adaptation
    much the same kind of advantage that Python standard library's
    "copy_reg" module offers for serialization and persistence.

    This proposal does not specifically constrain what a protocol
    _is_, what "compliance to a protocol" exactly _means_, nor what
    precisely a wrapper is supposed to do.  These omissions are
    intended to leave this proposal compatible with both existing
    categories of protocols, such as the existing system of type and
    classes, as well as the many concepts for "interfaces" as such
    which have been proposed or implemented for Python, such as the
    one in PEP 245 [1], the one in Zope3 [2], or the ones discussed in
    the BDFL's Artima blog in late 2004 and early 2005 [3].  However,
    some reflections on these subjects, intended to be suggestive and
    not normative, are also included.


Motivation

    Currently there is no standardized mechanism in Python for
    checking if an object supports a particular protocol.  Typically,
    existence of certain methods, particularly special methods such as
    __getitem__, is used as an indicator of support for a particular
    protocol.  This technique works well for a few specific protocols
    blessed by the BDFL (Benevolent Dictator for Life).  The same can
    be said for the alternative technique based on checking
    'isinstance' (the built-in class "basestring" exists specifically
    to let you use 'isinstance' to check if an object "is a [built-in]
    string").  Neither approach is easily and generally extensible to
    other protocols, defined by applications and third party
    frameworks, outside of the standard Python core.

    Even more important than checking if an object already supports a
    given protocol can be the task of obtaining a suitable adapter
    (wrapper or proxy) for the object, if the support is not already
    there.  For example, a string does not support the file protocol,
    but you can wrap it into a StringIO instance to obtain an object
    which does support that protocol and gets its data from the string
    it wraps; that way, you can pass the string (suitably wrapped) to
    subsystems which require as their arguments objects that are
    readable as files.  Unfortunately, there is currently no general,
    standardized way to automate this extremely important kind of
    "adaptation by wrapping" operations.

    Typically, today, when you pass objects to a context expecting a
    particular protocol, either the object knows about the context and
    provides its own wrapper or the context knows about the object and
    wraps it appropriately.  The difficulty with these approaches is
    that such adaptations are one-offs, are not centralized in a
    single place of the users code, and are not executed with a common
    technique, etc.  This lack of standardization increases code
    duplication with the same adapter occurring in more than one place
    or it encourages classes to be re-written instead of adapted.  In
    either case, maintainability suffers.

    It would be very nice to have a standard function that can be
    called upon to verify an object's compliance with a particular
    protocol and provide for a wrapper if one is readily available --
    all without having to hunt through each library's documentation
    for the incantation appropriate to that particular, specific case.


Requirements

    When considering an object's compliance with a protocol, there are
    several cases to be examined:

    a) When the protocol is a type or class, and the object has
       exactly that type or is an instance of exactly that class (not
       a subclass).  In this case, compliance is automatic.

    b) When the object knows about the protocol, and either considers
       itself compliant, or knows how to wrap itself suitably.

    c) When the protocol knows about the object, and either the object
       already complies or the protocol knows how to suitably wrap the
       object.

    d) When the protocol is a type or class, and the object is a
       member of a subclass.  This is distinct from the first case (a)
       above, since inheritance (unfortunately) does not necessarily
       imply substitutability, and thus must be handled carefully.

    e) When the context knows about the object and the protocol and
       knows how to adapt the object so that the required protocol is
       satisfied.  This could use an adapter registry or similar
       approaches.

    The fourth case above is subtle.  A break of substitutability can
    occur when a subclass changes a method's signature, or restricts
    the domains accepted for a method's argument ("co-variance" on
    arguments types), or extends the co-domain to include return
    values which the base class may never produce ("contra-variance"
    on return types).  While compliance based on class inheritance
    _should_ be automatic, this proposal allows an object to signal
    that it is not compliant with a base class protocol.

    If Python gains some standard "official" mechanism for interfaces,
    however, then the "fast-path" case (a) can and should be extended
    to the protocol being an interface, and the object an instance of
    a type or class claiming compliance with that interface.  For
    example, if the "interface" keyword discussed in [3] is adopted
    into Python, the "fast path" of case (a) could be used, since
    instantiable classes implementing an interface would not be
    allowed to break substitutability.


Specification

    This proposal introduces a new built-in function, adapt(), which
    is the basis for supporting these requirements.

    The adapt() function has three parameters:

    - `obj', the object to be adapted

    - `protocol', the protocol requested of the object

    - `alternate', an optional object to return if the object could
      not be adapted

    A successful result of the adapt() function returns either the
    object passed `obj', if the object is already compliant with the
    protocol, or a secondary object `wrapper', which provides a view
    of the object compliant with the protocol.  The definition of
    wrapper is deliberately vague, and a wrapper is allowed to be a
    full object with its own state if necessary.  However, the design
    intention is that an adaptation wrapper should hold a reference to
    the original object it wraps, plus (if needed) a minimum of extra
    state which it cannot delegate to the wrapper object.

    An excellent example of adaptation wrapper is an instance of
    StringIO which adapts an incoming string to be read as if it was a
    textfile: the wrapper holds a reference to the string, but deals
    by itself with the "current point of reading" (from _where_ in the
    wrapped strings will the characters for the next, e.g., "readline"
    call come from), because it cannot delegate it to the wrapped
    object (a string has no concept of "current point of reading" nor
    anything else even remotely related to that concept).

    A failure to adapt the object to the protocol raises an
    AdaptationError (which is a subclass of TypeError), unless the
    alternate parameter is used, in this case the alternate argument
    is returned instead.

    To enable the first case listed in the requirements, the adapt()
    function first checks to see if the object's type or the object's
    class are identical to the protocol.  If so, then the adapt()
    function returns the object directly without further ado.

    To enable the second case, when the object knows about the
    protocol, the object must have a __conform__() method.  This
    optional method takes two arguments:

    - `self', the object being adapted

    - `protocol, the protocol requested

    Just like any other special method in today's Python, __conform__
    is meant to be taken from the object's class, not from the object
    itself (for all objects, except instances of "classic classes" as
    long as we must still support the latter).  This enables a
    possible 'tp_conform' slot to be added to Python's type objects in
    the future, if desired.

    The object may return itself as the result of __conform__ to
    indicate compliance.  Alternatively, the object also has the
    option of returning a wrapper object compliant with the protocol.
    If the object knows it is not compliant although it belongs to a
    type which is a subclass of the protocol, then __conform__ should
    raise a LiskovViolation exception (a subclass of AdaptationError).
    Finally, if the object cannot determine its compliance, it should
    return None to enable the remaining mechanisms.  If __conform__
    raises any other exception, "adapt" just propagates it.

    To enable the third case, when the protocol knows about the
    object, the protocol must have an __adapt__() method.  This
    optional method takes two arguments:

    - `self', the protocol requested

    - `obj', the object being adapted

    If the protocol finds the object to be compliant, it can return
    obj directly.  Alternatively, the method may return a wrapper
    compliant with the protocol.  If the protocol knows the object is
    not compliant although it belongs to a type which is a subclass of
    the protocol, then __adapt__ should raise a LiskovViolation
    exception (a subclass of AdaptationError).  Finally, when
    compliance cannot be determined, this method should return None to
    enable the remaining mechanisms.  If __adapt__ raises any other
    exception, "adapt" just propagates it.

    The fourth case, when the object's class is a sub-class of the
    protocol, is handled by the built-in adapt() function.  Under
    normal circumstances, if "isinstance(object, protocol)" then
    adapt() returns the object directly.  However, if the object is
    not substitutable, either the __conform__() or __adapt__()
    methods, as above mentioned, may raise an LiskovViolation (a
    subclass of AdaptationError) to prevent this default behavior.

    If none of the first four mechanisms worked, as a last-ditch
    attempt, 'adapt' falls back to checking a registry of adapter
    factories, indexed by the protocol and the type of `obj', to meet
    the fifth case.  Adapter factories may be dynamically registered
    and removed from that registry to provide "third party adaptation"
    of objects and protocols that have no knowledge of each other, in
    a way that is not invasive to either the object or the protocols.


Intended Use

    The typical intended use of adapt is in code which has received
    some object X "from the outside", either as an argument or as the
    result of calling some function, and needs to use that object
    according to a certain protocol Y.  A "protocol" such as Y is
    meant to indicate an interface, usually enriched with some
    semantics constraints (such as are typically used in the "design
    by contract" approach), and often also some pragmatical
    expectation (such as "the running time of a certain operation
    should be no worse than O(N)", or the like); this proposal does
    not specify how protocols are designed as such, nor how or whether
    compliance to a protocol is checked, nor what the consequences may
    be of claiming compliance but not actually delivering it (lack of
    "syntactic" compliance -- names and signatures of methods -- will
    often lead to exceptions being raised; lack of "semantic"
    compliance may lead to subtle and perhaps occasional errors
    [imagine a method claiming to be threadsafe but being in fact
    subject to some subtle race condition, for example]; lack of
    "pragmatic" compliance will generally lead to code that runs
    ``correctly'', but too slowly for practical use, or sometimes to
    exhaustion of resources such as memory or disk space).

    When protocol Y is a concrete type or class, compliance to it is
    intended to mean that an object allows all of the operations that
    could be performed on instances of Y, with "comparable" semantics
    and pragmatics.  For example, a hypothetical object X that is a
    singly-linked list should not claim compliance with protocol
    'list', even if it implements all of list's methods: the fact that
    indexing X[n] takes time O(n), while the same operation would be
    O(1) on a list, makes a difference.  On the other hand, an
    instance of StringIO.StringIO does comply with protocol 'file',
    even though some operations (such as those of module 'marshal')
    may not allow substituting one for the other because they perform
    explicit type-checks: such type-checks are "beyond the pale" from
    the point of view of protocol compliance.

    While this convention makes it feasible to use a concrete type or
    class as a protocol for purposes of this proposal, such use will
    often not be optimal.  Rarely will the code calling 'adapt' need
    ALL of the features of a certain concrete type, particularly for
    such rich types as file, list, dict; rarely can all those features
    be provided by a wrapper with good pragmatics, as well as syntax
    and semantics that are really the same as a concrete type's.

    Rather, once this proposal is accepted, a design effort needs to
    start to identify the essential characteristics of those protocols
    which are currently used in Python, particularly within the
    standard library, and to formalize them using some kind of
    "interface" construct (not necessarily requiring any new syntax: a
    simple custom metaclass would let us get started, and the results
    of the effort could later be migrated to whatever "interface"
    construct is eventually accepted into the Python language).  With
    such a palette of more formally designed protocols, the code using
    'adapt' will be able to ask for, say, adaptation into "a filelike
    object that is readable and seekable", or whatever else it
    specifically needs with some decent level of "granularity", rather
    than too-generically asking for compliance to the 'file' protocol.

    Adaptation is NOT "casting".  When object X itself does not
    conform to protocol Y, adapting X to Y means using some kind of
    wrapper object Z, which holds a reference to X, and implements
    whatever operation Y requires, mostly by delegating to X in
    appropriate ways.  For example, if X is a string and Y is 'file',
    the proper way to adapt X to Y is to make a StringIO(X), *NOT* to
    call file(X) [which would try to open a file named by X].

    Numeric types and protocols may need to be an exception to this
    "adaptation is not casting" mantra, however.


Guido's "Optional Static Typing: Stop the Flames" Blog Entry

    A typical simple use case of adaptation would be:

        def f(X):
            X = adapt(X, Y)
            # continue by using X according to protocol Y

    In [4], the BDFL has proposed introducing the syntax:

        def f(X: Y):
            # continue by using X according to protocol Y

    to be a handy shortcut for exactly this typical use of adapt, and,
    as a basis for experimentation until the parser has been modified
    to accept this new syntax, a semantically equivalent decorator:

        @arguments(Y)
        def f(X):
            # continue by using X according to protocol Y

    These BDFL ideas are fully compatible with this proposal, as are
    other of Guido's suggestions in the same blog.



Reference Implementation and Test Cases

    The following reference implementation does not deal with classic
    classes: it consider only new-style classes.  If classic classes
    need to be supported, the additions should be pretty clear, though
    a bit messy (x.__class__ vs type(x), getting boundmethods directly
    from the object rather than from the type, and so on).

    -----------------------------------------------------------------
    adapt.py
    -----------------------------------------------------------------
    class AdaptationError(TypeError):
        pass
    class LiskovViolation(AdaptationError):
        pass

    _adapter_factory_registry = {}

    def registerAdapterFactory(objtype, protocol, factory):
        _adapter_factory_registry[objtype, protocol] = factory

    def unregisterAdapterFactory(objtype, protocol):
        del _adapter_factory_registry[objtype, protocol]

    def _adapt_by_registry(obj, protocol, alternate):
        factory = _adapter_factory_registry.get((type(obj), protocol))
        if factory is None:
            adapter = alternate
        else:
            adapter = factory(obj, protocol, alternate)
        if adapter is AdaptationError:
            raise AdaptationError
        else:
            return adapter


    def adapt(obj, protocol, alternate=AdaptationError):

        t = type(obj)

        # (a) first check to see if object has the exact protocol
        if t is protocol:
           return obj

        try:
            # (b) next check if t.__conform__ exists & likes protocol
            conform = getattr(t, '__conform__', None)
            if conform is not None:
                result = conform(obj, protocol)
                if result is not None:
                    return result

            # (c) then check if protocol.__adapt__ exists & likes obj
            adapt = getattr(type(protocol), '__adapt__', None)
            if adapt is not None:
                result = adapt(protocol, obj)
                if result is not None:
                    return result
        except LiskovViolation:
            pass
        else:
            # (d) check if object is instance of protocol
            if isinstance(obj, protocol):
                return obj

        # (e) last chance: try the registry
        return _adapt_by_registry(obj, protocol, alternate)

    -----------------------------------------------------------------
    test.py
    -----------------------------------------------------------------
    from adapt import AdaptationError, LiskovViolation, adapt
    from adapt import registerAdapterFactory, unregisterAdapterFactory
    import doctest

    class A(object):
        '''
        >>> a = A()
        >>> a is adapt(a, A)   # case (a)
        True
        '''

    class B(A):
        '''
        >>> b = B()
        >>> b is adapt(b, A)   # case (d)
        True
        '''

    class C(object):
        '''
        >>> c = C()
        >>> c is adapt(c, B)   # case (b)
        True
        >>> c is adapt(c, A)   # a failure case
        Traceback (most recent call last):
            ...
        AdaptationError
        '''
        def __conform__(self, protocol):
            if protocol is B:
                return self

    class D(C):
        '''
        >>> d = D()
        >>> d is adapt(d, D)   # case (a)
        True
        >>> d is adapt(d, C)   # case (d) explicitly blocked
        Traceback (most recent call last):
            ...
        AdaptationError
        '''
        def __conform__(self, protocol):
            if protocol is C:
                raise LiskovViolation

    class MetaAdaptingProtocol(type):
        def __adapt__(cls, obj):
            return cls.adapt(obj)

    class AdaptingProtocol:
        __metaclass__ = MetaAdaptingProtocol
        @classmethod
        def adapt(cls, obj):
            pass

    class E(AdaptingProtocol):
        '''
        >>> a = A()
        >>> a is adapt(a, E)   # case (c)
        True
        >>> b = A()
        >>> b is adapt(b, E)   # case (c)
        True
        >>> c = C()
        >>> c is adapt(c, E)   # a failure case
        Traceback (most recent call last):
            ...
        AdaptationError
        '''
        @classmethod
        def adapt(cls, obj):
            if isinstance(obj, A):
                return obj

    class F(object):
        pass

    def adapt_F_to_A(obj, protocol, alternate):
        if isinstance(obj, F) and issubclass(protocol, A):
            return obj
        else:
            return alternate

    def test_registry():
        '''
        >>> f = F()
        >>> f is adapt(f, A)   # a failure case
        Traceback (most recent call last):
            ...
        AdaptationError
        >>> registerAdapterFactory(F, A, adapt_F_to_A)
        >>> f is adapt(f, A)   # case (e)
        True
        >>> unregisterAdapterFactory(F, A)
        >>> f is adapt(f, A)   # a failure case again
        Traceback (most recent call last):
            ...
        AdaptationError
        >>> registerAdapterFactory(F, A, adapt_F_to_A)
        '''

    doctest.testmod()


Relationship To Microsoft's QueryInterface

    Although this proposal has some similarities to Microsoft's (COM)
    QueryInterface, it differs by a number of aspects.

    First, adaptation in this proposal is bi-directional, allowing the
    interface (protocol) to be queried as well, which gives more
    dynamic abilities (more Pythonic).  Second, there is no special
    "IUnknown" interface which can be used to check or obtain the
    original unwrapped object identity, although this could be
    proposed as one of those "special" blessed interface protocol
    identifiers.  Third, with QueryInterface, once an object supports
    a particular interface it must always there after support this
    interface; this proposal makes no such guarantee, since, in
    particular, adapter factories can be dynamically added to the
    registried and removed again later.

    Fourth, implementations of Microsoft's QueryInterface must support
    a kind of equivalence relation -- they must be reflexive,
    symmetrical, and transitive, in specific senses.  The equivalent
    conditions for protocol adaptation according to this proposal
    would also represent desirable properties:

        # given, to start with, a successful adaptation:
        X_as_Y = adapt(X, Y)

        # reflexive:
        assert adapt(X_as_Y, Y) is X_as_Y

        # transitive:
        X_as_Z = adapt(X, Z, None)
        X_as_Y_as_Z = adapt(X_as_Y, Z, None)
        assert (X_as_Y_as_Z is None) == (X_as_Z is None)

        # symmetrical:
        X_as_Z_as_Y = adapt(X_as_Z, Y, None)
        assert (X_as_Y_as_Z is None) == (X_as_Z_as_Y is None)

    However, while these properties are desirable, it may not be
    possible to guarantee them in all cases.  QueryInterface can
    impose their equivalents because it dictates, to some extent, how
    objects, interfaces, and adapters are to be coded; this proposal
    is meant to be not necessarily invasive, usable and to "retrofit"
    adaptation between two frameworks coded in mutual ignorance of
    each other without having to modify either framework.

    Transitivity of adaptation is in fact somewhat controversial, as
    is the relationship (if any) between adaptation and inheritance.

    The latter would not be controversial if we knew that inheritance
    always implies Liskov substitutability, which, unfortunately we
    don't.  If some special form, such as the interfaces proposed in
    [4], could indeed ensure Liskov substitutability, then for that
    kind of inheritance, only, we could perhaps assert that if X
    conforms to Y and Y inherits from Z then X conforms to Z... but
    only if substitutability was taken in a very strong sense to
    include semantics and pragmatics, which seems doubtful.  (For what
    it's worth: in QueryInterface, inheritance does not require nor
    imply conformance).  This proposal does not include any "strong"
    effects of inheritance, beyond the small ones specifically
    detailed above.

    Similarly, transitivity might imply multiple "internal" adaptation
    passes to get the result of adapt(X, Z) via some intermediate Y,
    intrinsically like adapt(adapt(X, Y), Z), for some suitable and
    automatically chosen Y.  Again, this may perhaps be feasible under
    suitably strong constraints, but the practical implications of
    such a scheme are still unclear to this proposal's authors.  Thus,
    this proposal does not include any automatic or implicit
    transitivity of adaptation, under whatever circumstances.

    For an implementation of the original version of this proposal
    which performs more advanced processing in terms of transitivity,
    and of the effects of inheritance, see Phillip J. Eby's
    PyProtocols [5].  The documentation accompanying PyProtocols is
    well worth studying for its considerations on how adapters should
    be coded and used, and on how adaptation can remove any need for
    typechecking in application code.


Questions and Answers

    Q:  What benefit does this proposal provide?

    A:  The typical Python programmer is an integrator, someone who is
        connecting components from various suppliers.  Often, to
        interface between these components, one needs intermediate
        adapters.  Usually the burden falls upon the programmer to
        study the interface exposed by one component and required by
        another, determine if they are directly compatible, or develop
        an adapter.  Sometimes a supplier may even include the
        appropriate adapter, but even then searching for the adapter
        and figuring out how to deploy the adapter takes time.

        This technique enables suppliers to work with each other
        directly, by implementing __conform__ or __adapt__ as
        necessary.  This frees the integrator from making their own
        adapters.  In essence, this allows the components to have a
        simple dialogue among themselves.  The integrator simply
        connects one component to another, and if the types don't
        automatically match an adapting mechanism is built-in.

        Moreover, thanks to the adapter registry, a "fourth party" may
        supply adapters to allow interoperation of frameworks which
        are totally unaware of each other, non-invasively, and without
        requiring the integrator to do anything more than install the
        appropriate adapter factories in the registry at start-up.

        As long as libraries and frameworks cooperate with the
        adaptation infrastructure proposed here (essentially by
        defining and using protocols appropriately, and calling
        'adapt' as needed on arguments received and results of
        call-back factory functions), the integrator's work thereby
        becomes much simpler.

        For example, consider SAX1 and SAX2 interfaces: there is an
        adapter required to switch between them.  Normally, the
        programmer must be aware of this; however, with this
        adaptation proposal in place, this is no longer the case --
        indeed, thanks to the adapter registry, this need may be
        removed even if the framework supplying SAX1 and the one
        requiring SAX2 are unaware of each other.


    Q:  Why does this have to be built-in, can't it be standalone?

    A:  Yes, it does work standalone.  However, if it is built-in, it
        has a greater chance of usage.  The value of this proposal is
        primarily in standardization: having libraries and frameworks
        coming from different suppliers, including the Python standard
        library, use a single approach to adaptation.  Furthermore:

        0.  The mechanism is by its very nature a singleton.

        1.  If used frequently, it will be much faster as a built-in.

        2.  It is extensible and unassuming.

        3.  Once 'adapt' is built-in, it can support syntax extensions
            and even be of some help to a type inference system.


    Q:  Why the verbs __conform__ and __adapt__?

    A:  conform, verb intransitive
            1. To correspond in form or character; be similar.
            2. To act or be in accord or agreement; comply.
            3. To act in accordance with current customs or modes.

        adapt, verb transitive
            1. To make suitable to or fit for a specific use or
               situation.

        Source:  The American Heritage Dictionary of the English
                 Language, Third Edition


Backwards Compatibility

    There should be no problem with backwards compatibility unless
    someone had used the special names __conform__ or __adapt__ in
    other ways, but this seems unlikely, and, in any case, user code
    should never use special names for non-standard purposes.

    This proposal could be implemented and tested without changes to
    the interpreter.


Credits

    This proposal was created in large part by the feedback of the
    talented individuals on the main Python mailing lists and the
    type-sig list.  To name specific contributors (with apologies if
    we missed anyone!), besides the proposal's authors: the main
    suggestions for the proposal's first versions came from Paul
    Prescod, with significant feedback from Robin Thomas, and we also
    borrowed ideas from Marcin 'Qrczak' Kowalczyk and Carlos Ribeiro.

    Other contributors (via comments) include Michel Pelletier, Jeremy
    Hylton, Aahz Maruch, Fredrik Lundh, Rainer Deyke, Timothy Delaney,
    and Huaiyu Zhu.  The current version owes a lot to discussions
    with (among others) Phillip J. Eby, Guido van Rossum, Bruce Eckel,
    Jim Fulton, and Ka-Ping Yee, and to study and reflection of their
    proposals, implementations, and documentation about use and
    adaptation of interfaces and protocols in Python.


References and Footnotes

    [1] PEP 245, Python Interface Syntax, Pelletier
        http://www.python.org/dev/peps/pep-0245/

    [2] http://www.zope.org/Wikis/Interfaces/FrontPage

    [3] http://www.artima.com/weblogs/index.jsp?blogger=guido

    [4] http://www.artima.com/weblogs/viewpost.jsp?thread=87182

    [5] http://peak.telecommunity.com/PyProtocols.html


Copyright

    This document has been placed in the public domain.



pep-0247 API for Cryptographic Hash Functions

PEP: 247
Title: API for Cryptographic Hash Functions
Version: $Revision$
Last-Modified: $Date$
Author: A.M. Kuchling <amk at amk.ca>
Status: Final
Type: Informational
Created: 23-Mar-2001
Post-History: 20-Sep-2001

Abstract

    There are several different modules available that implement
    cryptographic hashing algorithms such as MD5 or SHA.  This
    document specifies a standard API for such algorithms, to make it
    easier to switch between different implementations.


Specification

    All hashing modules should present the same interface.  Additional
    methods or variables can be added, but those described in this
    document should always be present.

    Hash function modules define one function:

    new([string])            (unkeyed hashes)
    new([key] , [string])    (keyed hashes)

        Create a new hashing object and return it.  The first form is
        for hashes that are unkeyed, such as MD5 or SHA.  For keyed
        hashes such as HMAC, 'key' is a required parameter containing
        a string giving the key to use.  In both cases, the optional
        'string' parameter, if supplied, will be immediately hashed
        into the object's starting state, as if obj.update(string) was
        called.
        
        After creating a hashing object, arbitrary strings can be fed
        into the object using its update() method, and the hash value
        can be obtained at any time by calling the object's digest()
        method.

        Arbitrary additional keyword arguments can be added to this
        function, but if they're not supplied, sensible default values
        should be used.  For example, 'rounds' and 'digest_size'
        keywords could be added for a hash function which supports a
        variable number of rounds and several different output sizes,
        and they should default to values believed to be secure.

    Hash function modules define one variable:

    digest_size

        An integer value; the size of the digest produced by the
        hashing objects created by this module, measured in bytes.
        You could also obtain this value by creating a sample object
        and accessing its 'digest_size' attribute, but it can be
        convenient to have this value available from the module.
        Hashes with a variable output size will set this variable to
        None.

    Hashing objects require a single attribute:

    digest_size

        This attribute is identical to the module-level digest_size
        variable, measuring the size of the digest produced by the
        hashing object, measured in bytes.  If the hash has a variable
        output size, this output size must be chosen when the hashing
        object is created, and this attribute must contain the
        selected size.  Therefore None is *not* a legal value for this
        attribute.
                

    Hashing objects require the following methods:

    copy()

        Return a separate copy of this hashing object.  An update to
        this copy won't affect the original object.

    digest()

        Return the hash value of this hashing object as a string
        containing 8-bit data.  The object is not altered in any way
        by this function; you can continue updating the object after
        calling this function.

    hexdigest()

        Return the hash value of this hashing object as a string
        containing hexadecimal digits.  Lowercase letters should be used 
        for the digits 'a' through 'f'.  Like the .digest() method, this
        method mustn't alter the object.
        
    update(string)

        Hash 'string' into the current state of the hashing object.
        update() can be called any number of times during a hashing
        object's lifetime.

    Hashing modules can define additional module-level functions or 
    object methods and still be compliant with this specification.
    
    Here's an example, using a module named 'MD5':

        >>> from Crypto.Hash import MD5
        >>> m = MD5.new()
        >>> m.digest_size
        16
        >>> m.update('abc')
        >>> m.digest()
        '\x90\x01P\x98<\xd2O\xb0\xd6\x96?}(\xe1\x7fr'    
        >>> m.hexdigest()
        '900150983cd24fb0d6963f7d28e17f72' 
        >>> MD5.new('abc').digest()
        '\x90\x01P\x98<\xd2O\xb0\xd6\x96?}(\xe1\x7fr'    


Rationale

    The digest size is measured in bytes, not bits, even though hash
    algorithm sizes are usually quoted in bits; MD5 is a 128-bit
    algorithm and not a 16-byte one, for example.  This is because, in
    the sample code I looked at, the length in bytes is often needed
    (to seek ahead or behind in a file; to compute the length of an
    output string) while the length in bits is rarely used.
    Therefore, the burden will fall on the few people actually needing
    the size in bits, who will have to multiply digest_size by 8.

    It's been suggested that the update() method would be better named
    append().  However, that method is really causing the current
    state of the hashing object to be updated, and update() is already
    used by the md5 and sha modules included with Python, so it seems
    simplest to leave the name update() alone.

    The order of the constructor's arguments for keyed hashes was a
    sticky issue.  It wasn't clear whether the key should come first
    or second.  It's a required parameter, and the usual convention is
    to place required parameters first, but that also means that the
    'string' parameter moves from the first position to the second.
    It would be possible to get confused and pass a single argument to
    a keyed hash, thinking that you're passing an initial string to an 
    unkeyed hash, but it doesn't seem worth making the interface 
    for keyed hashes more obscure to avoid this potential error.


Changes

    2001-09-17: Renamed clear() to reset(); added digest_size attribute
    to objects; added .hexdigest() method.
    2001-09-20: Removed reset() method completely.
    2001-09-28: Set digest_size to None for variable-size hashes.


Acknowledgements

    Thanks to Aahz, Andrew Archibald, Rich Salz, Itamar
    Shtull-Trauring, and the readers of the python-crypto list for
    their comments on this PEP.


Copyright

    This document has been placed in the public domain.



pep-0248 Python Database API Specification v1.0

PEP: 248
Title: Python Database API Specification v1.0
Version: $Revision$
Last-Modified: $Date$
Author: Marc-AndrĂŠ Lemburg <mal at lemburg.com>
Discussions-To:  <db-sig at python.org>
Status: Final
Type: Informational
Created: 
Post-History: 
Superseded-By: 249

Introduction

    This API has been defined to encourage similarity between the
    Python modules that are used to access databases.  By doing this,
    we hope to achieve a consistency leading to more easily understood
    modules, code that is generally more portable across databases,
    and a broader reach of database connectivity from Python.
 
    This interface specification consists of several items:

        * Module Interface
        * Connection Objects
        * Cursor Objects
        * DBI Helper Objects
 
    Comments and questions about this specification may be directed to
    the SIG on Tabular Databases in Python
    (http://www.python.org/sigs/db-sig).

    This specification document was last updated on: April 9, 1996.
    It will be known as Version 1.0 of this specification.


Module Interface

    The database interface modules should typically be named with
    something terminated by 'db'.  Existing examples are: 'oracledb',
    'informixdb', and 'pg95db'.  These modules should export several
    names:
 
        modulename(connection_string)

            Constructor for creating a connection to the database.
            Returns a Connection Object.
 
        error
        
            Exception raised for errors from the database module.


Connection Objects

    Connection Objects should respond to the following methods:
 
        close()

            Close the connection now (rather than whenever __del__ is
            called).  The connection will be unusable from this point
            forward; an exception will be raised if any operation is
            attempted with the connection.
 
        commit()

            Commit any pending transaction to the database.
 
        rollback()

            Roll the database back to the start of any pending
            transaction.
 
        cursor()

            Return a new Cursor Object.  An exception may be thrown if
            the database does not support a cursor concept.
 
        callproc([params])

            (Note: this method is not well-defined yet.)  Call a
            stored database procedure with the given (optional)
            parameters.  Returns the result of the stored procedure.
 
        (all Cursor Object attributes and methods)

            For databases that do not have cursors and for simple
            applications that do not require the complexity of a
            cursor, a Connection Object should respond to each of the
            attributes and methods of the Cursor Object.  Databases
            that have cursor can implement this by using an implicit,
            internal cursor.

 

Cursor Objects

    These objects represent a database cursor, which is used to manage
    the context of a fetch operation.
 
    Cursor Objects should respond to the following methods and
    attributes:
 
        arraysize

            This read/write attribute specifies the number of rows to
            fetch at a time with fetchmany().  This value is also used
            when inserting multiple rows at a time (passing a
            tuple/list of tuples/lists as the params value to
            execute()).  This attribute will default to a single row.
 
            Note that the arraysize is optional and is merely provided
            for higher performance database interactions.
            Implementations should observe it with respect to the
            fetchmany() method, but are free to interact with the
            database a single row at a time.
 
        description

            This read-only attribute is a tuple of 7-tuples.  Each
            7-tuple contains information describing each result
            column: (name, type_code, display_size, internal_size,
            precision, scale, null_ok). This attribute will be None
            for operations that do not return rows or if the cursor
            has not had an operation invoked via the execute() method
            yet.
 
            The 'type_code' is one of the 'dbi' values specified in
            the section below.
 
            Note: this is a bit in flux. Generally, the first two
            items of the 7-tuple will always be present; the others
            may be database specific.
 
        close()

            Close the cursor now (rather than whenever __del__ is
            called).  The cursor will be unusable from this point
            forward; an exception will be raised if any operation is
            attempted with the cursor.
 
        execute(operation [,params])

            Execute (prepare) a database operation (query or command).
            Parameters may be provided (as a sequence
            (e.g. tuple/list)) and will be bound to variables in the
            operation.  Variables are specified in a database-specific
            notation that is based on the index in the parameter tuple
            (position-based rather than name-based).
 
            The parameters may also be specified as a sequence of
            sequences (e.g. a list of tuples) to insert multiple rows
            in a single operation.
 
            A reference to the operation will be retained by the
            cursor.  If the same operation object is passed in again,
            then the cursor can optimize its behavior.  This is most
            effective for algorithms where the same operation is used,
            but different parameters are bound to it (many times).
 
            For maximum efficiency when reusing an operation, it is
            best to use the setinputsizes() method to specify the
            parameter types and sizes ahead of time.  It is legal for
            a parameter to not match the predefined information; the
            implementation should compensate, possibly with a loss of
            efficiency.
 
            Using SQL terminology, these are the possible result
            values from the execute() method:

                If the statement is DDL (e.g. CREATE TABLE), then 1 is
                returned.

                If the statement is DML (e.g. UPDATE or INSERT), then the
                number of rows affected is returned (0 or a positive
                integer).

                If the statement is DQL (e.g. SELECT), None is returned,
                indicating that the statement is not really complete until
                you use one of the  'fetch' methods.

        fetchone()

            Fetch the next row of a query result, returning a single
            tuple.

        fetchmany([size])

            Fetch the next set of rows of a query result, returning as
            a list of tuples. An empty list is returned when no more
            rows are available. The number of rows to fetch is
            specified by the parameter.  If it is None, then the
            cursor's arraysize determines the number of rows to be
            fetched.
 
            Note there are performance considerations involved with
            the size parameter.  For optimal performance, it is
            usually best to use the arraysize attribute.  If the size
            parameter is used, then it is best for it to retain the
            same value from one fetchmany() call to the next.
 
        fetchall()

            Fetch all rows of a query result, returning as a list of
            tuples.  Note that the cursor's arraysize attribute can
            affect the performance of this operation.
 
        setinputsizes(sizes)

            (Note: this method is not well-defined yet.)  This can be
            used before a call to 'execute()' to predefine memory
            areas for the operation's parameters.  sizes is specified
            as a tuple -- one item for each input parameter.  The item
            should be a Type object that corresponds to the input that
            will be used, or it should be an integer specifying the
            maximum length of a string parameter.  If the item is
            'None', then no predefined memory area will be reserved
            for that column (this is useful to avoid predefined areas
            for large inputs).
 
            This method would be used before the execute() method is
            invoked.
 
            Note that this method is optional and is merely provided
            for higher performance database interaction.
            Implementations are free to do nothing and users are free
            to not use it.
 
        setoutputsize(size [,col])

            (Note: this method is not well-defined yet.)

            Set a column buffer size for fetches of large columns
            (e.g. LONG).  The column is specified as an index into the
            result tuple.  Using a column of None will set the default
            size for all large columns in the cursor.
 
            This method would be used before the 'execute()' method is
            invoked.
 
            Note that this method is optional and is merely provided
            for higher performance database interaction.
            Implementations are free to do nothing and users are free
            to not use it.
 

DBI Helper Objects

    Many databases need to have the input in a particular format for
    binding to an operation's input parameters.  For example, if an
    input is destined for a DATE column, then it must be bound to the
    database in a particular string format.  Similar problems exist
    for "Row ID" columns or large binary items (e.g. blobs or RAW
    columns).  This presents problems for Python since the parameters
    to the 'execute()' method are untyped.  When the database module
    sees a Python string object, it doesn't know if it should be bound
    as a simple CHAR column, as a raw binary item, or as a DATE.
 
    To overcome this problem, the 'dbi' module was created.  This
    module specifies some basic database interface types for working
    with databases.  There are two classes: 'dbiDate' and 'dbiRaw'.
    These are simple container classes that wrap up a value.  When
    passed to the database modules, the module can then detect that
    the input parameter is intended as a DATE or a RAW.  For symmetry,
    the database modules will return DATE and RAW columns as instances
    of these classes.
 
    A Cursor Object's 'description' attribute returns information
    about each of the result columns of a query.  The 'type_code is
    defined to be one of five types exported by this module: 'STRING',
    'RAW', 'NUMBER', 'DATE', or 'ROWID'.
 
    The module exports the following names:
 
        dbiDate(value)

            This function constructs a 'dbiDate' instance that holds a
            date value.  The value should be specified as an integer
            number of seconds since the "epoch" (e.g. time.time()).
 
        dbiRaw(value)

            This function constructs a 'dbiRaw' instance that holds a
            raw (binary) value.  The value should be specified as a
            Python string.

        STRING

            This object is used to describe columns in a database that
            are string-based (e.g. CHAR).
 
        RAW

            This object is used to describe (large) binary columns in
            a database (e.g. LONG RAW, blobs).
 
        NUMBER

            This object is used to describe numeric columns in a
            database.
 
        DATE

            This object is used to describe date columns in a
            database.
 
        ROWID

            This object is used to describe the "Row ID" column in a
            database.

Acknowledgements

    Many thanks go to Andrew Kuchling who converted the Python
    Database API Specification 1.0 from the original HTML format into
    the PEP format.


Copyright

    This document has been placed in the Public Domain.



pep-0249 Python Database API Specification v2.0

PEP:249
Title:Python Database API Specification v2.0
Version:$Revision$
Last-Modified:$Date$
Author:mal at lemburg.com (Marc-André Lemburg)
Discussions-To:db-sig at python.org
Status:Final
Type:Informational
Content-Type:text/x-rst
Created:
Post-History:
Replaces:248

Introduction

This API has been defined to encourage similarity between the Python modules that are used to access databases. By doing this, we hope to achieve a consistency leading to more easily understood modules, code that is generally more portable across databases, and a broader reach of database connectivity from Python.

Comments and questions about this specification may be directed to the SIG for Database Interfacing with Python.

For more information on database interfacing with Python and available packages see the Database Topic Guide.

This document describes the Python Database API Specification 2.0 and a set of common optional extensions. The previous version 1.0 version is still available as reference, in PEP 248. Package writers are encouraged to use this version of the specification as basis for new interfaces.

Module Interface

Constructors

Access to the database is made available through connection objects. The module must provide the following constructor for these:

connect( parameters... )

Constructor for creating a connection to the database.

Returns a Connection Object. It takes a number of parameters which are database dependent. [1]

Globals

These module globals must be defined:

apilevel

String constant stating the supported DB API level.

Currently only the strings "1.0" and "2.0" are allowed. If not given, a DB-API 1.0 level interface should be assumed.

threadsafety

Integer constant stating the level of thread safety the interface supports. Possible values are:

threadsafety Meaning
0 Threads may not share the module.
1 Threads may share the module, but not connections.
2 Threads may share the module and connections.
3 Threads may share the module, connections and cursors.

Sharing in the above context means that two threads may use a resource without wrapping it using a mutex semaphore to implement resource locking. Note that you cannot always make external resources thread safe by managing access using a mutex: the resource may rely on global variables or other external sources that are beyond your control.

paramstyle

String constant stating the type of parameter marker formatting expected by the interface. Possible values are [2]:

paramstyle Meaning
qmark Question mark style, e.g. ...WHERE name=?
numeric Numeric, positional style, e.g. ...WHERE name=:1
named Named style, e.g. ...WHERE name=:name
format ANSI C printf format codes, e.g. ...WHERE name=%s
pyformat Python extended format codes, e.g. ...WHERE name=%(name)s

Exceptions

The module should make all error information available through these exceptions or subclasses thereof:

Warning
Exception raised for important warnings like data truncations while inserting, etc. It must be a subclass of the Python StandardError (defined in the module exceptions).
Error
Exception that is the base class of all other error exceptions. You can use this to catch all errors with one single except statement. Warnings are not considered errors and thus should not use this class as base. It must be a subclass of the Python StandardError (defined in the module exceptions).
InterfaceError
Exception raised for errors that are related to the database interface rather than the database itself. It must be a subclass of Error.
DatabaseError
Exception raised for errors that are related to the database. It must be a subclass of Error.
DataError
Exception raised for errors that are due to problems with the processed data like division by zero, numeric value out of range, etc. It must be a subclass of DatabaseError.
OperationalError
Exception raised for errors that are related to the database's operation and not necessarily under the control of the programmer, e.g. an unexpected disconnect occurs, the data source name is not found, a transaction could not be processed, a memory allocation error occurred during processing, etc. It must be a subclass of DatabaseError.
IntegrityError
Exception raised when the relational integrity of the database is affected, e.g. a foreign key check fails. It must be a subclass of DatabaseError.
InternalError
Exception raised when the database encounters an internal error, e.g. the cursor is not valid anymore, the transaction is out of sync, etc. It must be a subclass of DatabaseError.
ProgrammingError
Exception raised for programming errors, e.g. table not found or already exists, syntax error in the SQL statement, wrong number of parameters specified, etc. It must be a subclass of DatabaseError.
NotSupportedError
Exception raised in case a method or database API was used which is not supported by the database, e.g. requesting a .rollback() on a connection that does not support transaction or has transactions turned off. It must be a subclass of DatabaseError.

This is the exception inheritance layout:

StandardError
|__Warning
|__Error
   |__InterfaceError
   |__DatabaseError
      |__DataError
      |__OperationalError
      |__IntegrityError
      |__InternalError
      |__ProgrammingError
      |__NotSupportedError

Note

The values of these exceptions are not defined. They should give the user a fairly good idea of what went wrong, though.

Connection Objects

Connection objects should respond to the following methods.

Connection methods

.close()

Close the connection now (rather than whenever .__del__() is called).

The connection will be unusable from this point forward; an Error (or subclass) exception will be raised if any operation is attempted with the connection. The same applies to all cursor objects trying to use the connection. Note that closing a connection without committing the changes first will cause an implicit rollback to be performed.

.commit()

Commit any pending transaction to the database.

Note that if the database supports an auto-commit feature, this must be initially off. An interface method may be provided to turn it back on.

Database modules that do not support transactions should implement this method with void functionality.

.rollback()

This method is optional since not all databases provide transaction support. [3]

In case a database does provide transactions this method causes the database to roll back to the start of any pending transaction. Closing a connection without committing the changes first will cause an implicit rollback to be performed.

.cursor()

Return a new Cursor Object using the connection.

If the database does not provide a direct cursor concept, the module will have to emulate cursors using other means to the extent needed by this specification. [4]

Cursor Objects

These objects represent a database cursor, which is used to manage the context of a fetch operation. Cursors created from the same connection are not isolated, i.e., any changes done to the database by a cursor are immediately visible by the other cursors. Cursors created from different connections can or can not be isolated, depending on how the transaction support is implemented (see also the connection's .rollback() and .commit() methods).

Cursor Objects should respond to the following methods and attributes.

Cursor attributes

.description

This read-only attribute is a sequence of 7-item sequences.

Each of these sequences contains information describing one result column:

  • name
  • type_code
  • display_size
  • internal_size
  • precision
  • scale
  • null_ok

The first two items (name and type_code) are mandatory, the other five are optional and are set to None if no meaningful values can be provided.

This attribute will be None for operations that do not return rows or if the cursor has not had an operation invoked via the .execute*() method yet.

The type_code can be interpreted by comparing it to the Type Objects specified in the section below.

.rowcount

This read-only attribute specifies the number of rows that the last .execute*() produced (for DQL statements like SELECT) or affected (for DML statements like UPDATE or INSERT). [9]

The attribute is -1 in case no .execute*() has been performed on the cursor or the rowcount of the last operation is cannot be determined by the interface. [7]

Note

Future versions of the DB API specification could redefine the latter case to have the object return None instead of -1.

Cursor methods

.callproc( procname [, parameters ] )

(This method is optional since not all databases provide stored procedures. [3])

Call a stored database procedure with the given name. The sequence of parameters must contain one entry for each argument that the procedure expects. The result of the call is returned as modified copy of the input sequence. Input parameters are left untouched, output and input/output parameters replaced with possibly new values.

The procedure may also provide a result set as output. This must then be made available through the standard .fetch*() methods.

.close()

Close the cursor now (rather than whenever __del__ is called).

The cursor will be unusable from this point forward; an Error (or subclass) exception will be raised if any operation is attempted with the cursor.

.execute(operation [, parameters])

Prepare and execute a database operation (query or command).

Parameters may be provided as sequence or mapping and will be bound to variables in the operation. Variables are specified in a database-specific notation (see the module's paramstyle attribute for details). [5]

A reference to the operation will be retained by the cursor. If the same operation object is passed in again, then the cursor can optimize its behavior. This is most effective for algorithms where the same operation is used, but different parameters are bound to it (many times).

For maximum efficiency when reusing an operation, it is best to use the .setinputsizes() method to specify the parameter types and sizes ahead of time. It is legal for a parameter to not match the predefined information; the implementation should compensate, possibly with a loss of efficiency.

The parameters may also be specified as list of tuples to e.g. insert multiple rows in a single operation, but this kind of usage is deprecated: .executemany() should be used instead.

Return values are not defined.

.executemany( operation, seq_of_parameters )

Prepare a database operation (query or command) and then execute it against all parameter sequences or mappings found in the sequence seq_of_parameters.

Modules are free to implement this method using multiple calls to the .execute() method or by using array operations to have the database process the sequence as a whole in one call.

Use of this method for an operation which produces one or more result sets constitutes undefined behavior, and the implementation is permitted (but not required) to raise an exception when it detects that a result set has been created by an invocation of the operation.

The same comments as for .execute() also apply accordingly to this method.

Return values are not defined.

.fetchone()

Fetch the next row of a query result set, returning a single sequence, or None when no more data is available. [6]

An Error (or subclass) exception is raised if the previous call to .execute*() did not produce any result set or no call was issued yet.

.fetchmany([size=cursor.arraysize])

Fetch the next set of rows of a query result, returning a sequence of sequences (e.g. a list of tuples). An empty sequence is returned when no more rows are available.

The number of rows to fetch per call is specified by the parameter. If it is not given, the cursor's arraysize determines the number of rows to be fetched. The method should try to fetch as many rows as indicated by the size parameter. If this is not possible due to the specified number of rows not being available, fewer rows may be returned.

An Error (or subclass) exception is raised if the previous call to .execute*() did not produce any result set or no call was issued yet.

Note there are performance considerations involved with the size parameter. For optimal performance, it is usually best to use the .arraysize attribute. If the size parameter is used, then it is best for it to retain the same value from one .fetchmany() call to the next.

.fetchall()

Fetch all (remaining) rows of a query result, returning them as a sequence of sequences (e.g. a list of tuples). Note that the cursor's arraysize attribute can affect the performance of this operation.

An Error (or subclass) exception is raised if the previous call to .execute*() did not produce any result set or no call was issued yet.

.nextset()

(This method is optional since not all databases support multiple result sets. [3])

This method will make the cursor skip to the next available set, discarding any remaining rows from the current set.

If there are no more sets, the method returns None. Otherwise, it returns a true value and subsequent calls to the .fetch*() methods will return rows from the next result set.

An Error (or subclass) exception is raised if the previous call to .execute*() did not produce any result set or no call was issued yet.

.arraysize

This read/write attribute specifies the number of rows to fetch at a time with .fetchmany(). It defaults to 1 meaning to fetch a single row at a time.

Implementations must observe this value with respect to the .fetchmany() method, but are free to interact with the database a single row at a time. It may also be used in the implementation of .executemany().

.setinputsizes(sizes)

This can be used before a call to .execute*() to predefine memory areas for the operation's parameters.

sizes is specified as a sequence — one item for each input parameter. The item should be a Type Object that corresponds to the input that will be used, or it should be an integer specifying the maximum length of a string parameter. If the item is None, then no predefined memory area will be reserved for that column (this is useful to avoid predefined areas for large inputs).

This method would be used before the .execute*() method is invoked.

Implementations are free to have this method do nothing and users are free to not use it.

.setoutputsize(size [, column])

Set a column buffer size for fetches of large columns (e.g. LONGs, BLOBs, etc.). The column is specified as an index into the result sequence. Not specifying the column will set the default size for all large columns in the cursor.

This method would be used before the .execute*() method is invoked.

Implementations are free to have this method do nothing and users are free to not use it.

Type Objects and Constructors

Many databases need to have the input in a particular format for binding to an operation's input parameters. For example, if an input is destined for a DATE column, then it must be bound to the database in a particular string format. Similar problems exist for "Row ID" columns or large binary items (e.g. blobs or RAW columns). This presents problems for Python since the parameters to the .execute*() method are untyped. When the database module sees a Python string object, it doesn't know if it should be bound as a simple CHAR column, as a raw BINARY item, or as a DATE.

To overcome this problem, a module must provide the constructors defined below to create objects that can hold special values. When passed to the cursor methods, the module can then detect the proper type of the input parameter and bind it accordingly.

A Cursor Object's description attribute returns information about each of the result columns of a query. The type_code must compare equal to one of Type Objects defined below. Type Objects may be equal to more than one type code (e.g. DATETIME could be equal to the type codes for date, time and timestamp columns; see the Implementation Hints below for details).

The module exports the following constructors and singletons:

Date(year, month, day)
This function constructs an object holding a date value.
Time(hour, minute, second)
This function constructs an object holding a time value.
Timestamp(year, month, day, hour, minute, second)
This function constructs an object holding a time stamp value.
DateFromTicks(ticks)
This function constructs an object holding a date value from the given ticks value (number of seconds since the epoch; see the documentation of the standard Python time module for details).
TimeFromTicks(ticks)
This function constructs an object holding a time value from the given ticks value (number of seconds since the epoch; see the documentation of the standard Python time module for details).
TimestampFromTicks(ticks)
This function constructs an object holding a time stamp value from the given ticks value (number of seconds since the epoch; see the documentation of the standard Python time module for details).
Binary(string)
This function constructs an object capable of holding a binary (long) string value.
STRING type
This type object is used to describe columns in a database that are string-based (e.g. CHAR).
BINARY type
This type object is used to describe (long) binary columns in a database (e.g. LONG, RAW, BLOBs).
NUMBER type
This type object is used to describe numeric columns in a database.
DATETIME type
This type object is used to describe date/time columns in a database.
ROWID type
This type object is used to describe the "Row ID" column in a database.

SQL NULL values are represented by the Python None singleton on input and output.

Note

Usage of Unix ticks for database interfacing can cause troubles because of the limited date range they cover.

Implementation Hints for Module Authors

  • Date/time objects can be implemented as Python datetime module objects (available since Python 2.3, with a C API since 2.4) or using the mxDateTime package (available for all Python versions since 1.5.2). They both provide all necessary constructors and methods at Python and C level.

  • Here is a sample implementation of the Unix ticks based constructors for date/time delegating work to the generic constructors:

    import time
    
    def DateFromTicks(ticks):
        return Date(*time.localtime(ticks)[:3])
    
    def TimeFromTicks(ticks):
        return Time(*time.localtime(ticks)[3:6])
    
    def TimestampFromTicks(ticks):
        return Timestamp(*time.localtime(ticks)[:6])
    
  • The preferred object type for Binary objects are the buffer types available in standard Python starting with version 1.5.2. Please see the Python documentation for details. For information about the C interface have a look at Include/bufferobject.h and Objects/bufferobject.c in the Python source distribution.

  • This Python class allows implementing the above type objects even though the description type code field yields multiple values for on type object:

    class DBAPITypeObject:
        def __init__(self,*values):
            self.values = values
        def __cmp__(self,other):
            if other in self.values:
                return 0
            if other < self.values:
                return 1
            else:
                return -1
    

    The resulting type object compares equal to all values passed to the constructor.

  • Here is a snippet of Python code that implements the exception hierarchy defined above:

    import exceptions
    
    class Error(exceptions.StandardError):
        pass
    
    class Warning(exceptions.StandardError):
        pass
    
    class InterfaceError(Error):
        pass
    
    class DatabaseError(Error):
        pass
    
    class InternalError(DatabaseError):
        pass
    
    class OperationalError(DatabaseError):
        pass
    
    class ProgrammingError(DatabaseError):
        pass
    
    class IntegrityError(DatabaseError):
        pass
    
    class DataError(DatabaseError):
        pass
    
    class NotSupportedError(DatabaseError):
        pass
    

    In C you can use the PyErr_NewException(fullname, base, NULL) API to create the exception objects.

Optional DB API Extensions

During the lifetime of DB API 2.0, module authors have often extended their implementations beyond what is required by this DB API specification. To enhance compatibility and to provide a clean upgrade path to possible future versions of the specification, this section defines a set of common extensions to the core DB API 2.0 specification.

As with all DB API optional features, the database module authors are free to not implement these additional attributes and methods (using them will then result in an AttributeError) or to raise a NotSupportedError in case the availability can only be checked at run-time.

It has been proposed to make usage of these extensions optionally visible to the programmer by issuing Python warnings through the Python warning framework. To make this feature useful, the warning messages must be standardized in order to be able to mask them. These standard messages are referred to below as Warning Message.

Cursor.rownumber

This read-only attribute should provide the current 0-based index of the cursor in the result set or None if the index cannot be determined.

The index can be seen as index of the cursor in a sequence (the result set). The next fetch operation will fetch the row indexed by .rownumber in that sequence.

Warning Message: "DB-API extension cursor.rownumber used"

Connection.Error, Connection.ProgrammingError, etc.

All exception classes defined by the DB API standard should be exposed on the Connection objects as attributes (in addition to being available at module scope).

These attributes simplify error handling in multi-connection environments.

Warning Message: "DB-API extension connection.<exception> used"

Cursor.connection

This read-only attribute return a reference to the Connection object on which the cursor was created.

The attribute simplifies writing polymorph code in multi-connection environments.

Warning Message: "DB-API extension cursor.connection used"

Cursor.scroll(value [, mode='relative' ])

Scroll the cursor in the result set to a new position according to mode.

If mode is relative (default), value is taken as offset to the current position in the result set, if set to absolute, value states an absolute target position.

An IndexError should be raised in case a scroll operation would leave the result set. In this case, the cursor position is left undefined (ideal would be to not move the cursor at all).

Note

This method should use native scrollable cursors, if available, or revert to an emulation for forward-only scrollable cursors. The method may raise NotSupportedError to signal that a specific operation is not supported by the database (e.g. backward scrolling).

Warning Message: "DB-API extension cursor.scroll() used"

Cursor.messages

This is a Python list object to which the interface appends tuples (exception class, exception value) for all messages which the interfaces receives from the underlying database for this cursor.

The list is cleared by all standard cursor methods calls (prior to executing the call) except for the .fetch*() calls automatically to avoid excessive memory usage and can also be cleared by executing del cursor.messages[:].

All error and warning messages generated by the database are placed into this list, so checking the list allows the user to verify correct operation of the method calls.

The aim of this attribute is to eliminate the need for a Warning exception which often causes problems (some warnings really only have informational character).

Warning Message: "DB-API extension cursor.messages used"

Connection.messages

Same as Cursor.messages except that the messages in the list are connection oriented.

The list is cleared automatically by all standard connection methods calls (prior to executing the call) to avoid excessive memory usage and can also be cleared by executing del connection.messages[:].

Warning Message: "DB-API extension connection.messages used"

Cursor.next()

Return the next row from the currently executing SQL statement using the same semantics as .fetchone(). A StopIteration exception is raised when the result set is exhausted for Python versions 2.2 and later. Previous versions don't have the StopIteration exception and so the method should raise an IndexError instead.

Warning Message: "DB-API extension cursor.next() used"

Cursor.__iter__()

Return self to make cursors compatible to the iteration protocol [8].

Warning Message: "DB-API extension cursor.__iter__() used"

Cursor.lastrowid

This read-only attribute provides the rowid of the last modified row (most databases return a rowid only when a single INSERT operation is performed). If the operation does not set a rowid or if the database does not support rowids, this attribute should be set to None.

The semantics of .lastrowid are undefined in case the last executed statement modified more than one row, e.g. when using INSERT with .executemany().

Warning Message: "DB-API extension cursor.lastrowid used"

Optional Error Handling Extensions

The core DB API specification only introduces a set of exceptions which can be raised to report errors to the user. In some cases, exceptions may be too disruptive for the flow of a program or even render execution impossible.

For these cases and in order to simplify error handling when dealing with databases, database module authors may choose to implement user defineable error handlers. This section describes a standard way of defining these error handlers.

Connection.errorhandler, Cursor.errorhandler

Read/write attribute which references an error handler to call in case an error condition is met.

The handler must be a Python callable taking the following arguments:

errorhandler(connection, cursor, errorclass, errorvalue)

where connection is a reference to the connection on which the cursor operates, cursor a reference to the cursor (or None in case the error does not apply to a cursor), errorclass is an error class which to instantiate using errorvalue as construction argument.

The standard error handler should add the error information to the appropriate .messages attribute (Connection.messages or Cursor.messages) and raise the exception defined by the given errorclass and errorvalue parameters.

If no .errorhandler is set (the attribute is None), the standard error handling scheme as outlined above, should be applied.

Warning Message: "DB-API extension .errorhandler used"

Cursors should inherit the .errorhandler setting from their connection objects at cursor creation time.

Optional Two-Phase Commit Extensions

Many databases have support for two-phase commit (TPC) which allows managing transactions across multiple database connections and other resources.

If a database backend provides support for two-phase commit and the database module author wishes to expose this support, the following API should be implemented. NotSupportedError should be raised, if the database backend support for two-phase commit can only be checked at run-time.

TPC Transaction IDs

As many databases follow the XA specification, transaction IDs are formed from three components:

  • a format ID
  • a global transaction ID
  • a branch qualifier

For a particular global transaction, the first two components should be the same for all resources. Each resource in the global transaction should be assigned a different branch qualifier.

The various components must satisfy the following criteria:

  • format ID: a non-negative 32-bit integer.
  • global transaction ID and branch qualifier: byte strings no longer than 64 characters.

Transaction IDs are created with the .xid() Connection method:

.xid(format_id, global_transaction_id, branch_qualifier)

Returns a transaction ID object suitable for passing to the .tpc_*() methods of this connection.

If the database connection does not support TPC, a NotSupportedError is raised.

The type of the object returned by .xid() is not defined, but it must provide sequence behaviour, allowing access to the three components. A conforming database module could choose to represent transaction IDs with tuples rather than a custom object.

TPC Connection Methods

.tpc_begin(xid)

Begins a TPC transaction with the given transaction ID xid.

This method should be called outside of a transaction (i.e. nothing may have executed since the last .commit() or .rollback()).

Furthermore, it is an error to call .commit() or .rollback() within the TPC transaction. A ProgrammingError is raised, if the application calls .commit() or .rollback() during an active TPC transaction.

If the database connection does not support TPC, a NotSupportedError is raised.

.tpc_prepare()

Performs the first phase of a transaction started with .tpc_begin(). A ProgrammingError should be raised if this method outside of a TPC transaction.

After calling .tpc_prepare(), no statements can be executed until .tpc_commit() or .tpc_rollback() have been called.

.tpc_commit([ xid ])

When called with no arguments, .tpc_commit() commits a TPC transaction previously prepared with .tpc_prepare().

If .tpc_commit() is called prior to .tpc_prepare(), a single phase commit is performed. A transaction manager may choose to do this if only a single resource is participating in the global transaction.

When called with a transaction ID xid, the database commits the given transaction. If an invalid transaction ID is provided, a ProgrammingError will be raised. This form should be called outside of a transaction, and is intended for use in recovery.

On return, the TPC transaction is ended.

.tpc_rollback([ xid ])

When called with no arguments, .tpc_rollback() rolls back a TPC transaction. It may be called before or after .tpc_prepare().

When called with a transaction ID xid, it rolls back the given transaction. If an invalid transaction ID is provided, a ProgrammingError is raised. This form should be called outside of a transaction, and is intended for use in recovery.

On return, the TPC transaction is ended.

.tpc_recover()

Returns a list of pending transaction IDs suitable for use with .tpc_commit(xid) or .tpc_rollback(xid).

If the database does not support transaction recovery, it may return an empty list or raise NotSupportedError.

Frequently Asked Questions

The database SIG often sees reoccurring questions about the DB API specification. This section covers some of the issues people sometimes have with the specification.

Question:

How can I construct a dictionary out of the tuples returned by .fetch*():

Answer:

There are several existing tools available which provide helpers for this task. Most of them use the approach of using the column names defined in the cursor attribute .description as basis for the keys in the row dictionary.

Note that the reason for not extending the DB API specification to also support dictionary return values for the .fetch*() methods is that this approach has several drawbacks:

  • Some databases don't support case-sensitive column names or auto-convert them to all lowercase or all uppercase characters.
  • Columns in the result set which are generated by the query (e.g. using SQL functions) don't map to table column names and databases usually generate names for these columns in a very database specific way.

As a result, accessing the columns through dictionary keys varies between databases and makes writing portable code impossible.

Major Changes from Version 1.0 to Version 2.0

The Python Database API 2.0 introduces a few major changes compared to the 1.0 version. Because some of these changes will cause existing DB API 1.0 based scripts to break, the major version number was adjusted to reflect this change.

These are the most important changes from 1.0 to 2.0:

  • The need for a separate dbi module was dropped and the functionality merged into the module interface itself.
  • New constructors and Type Objects were added for date/time values, the RAW Type Object was renamed to BINARY. The resulting set should cover all basic data types commonly found in modern SQL databases.
  • New constants (apilevel, threadsafety, paramstyle) and methods (.executemany(), .nextset()) were added to provide better database bindings.
  • The semantics of .callproc() needed to call stored procedures are now clearly defined.
  • The definition of the .execute() return value changed. Previously, the return value was based on the SQL statement type (which was hard to implement right) — it is undefined now; use the more flexible .rowcount attribute instead. Modules are free to return the old style return values, but these are no longer mandated by the specification and should be considered database interface dependent.
  • Class based exceptions were incorporated into the specification. Module implementors are free to extend the exception layout defined in this specification by subclassing the defined exception classes.

Post-publishing additions to the DB API 2.0 specification:

  • Additional optional DB API extensions to the set of core functionality were specified.

Open Issues

Although the version 2.0 specification clarifies a lot of questions that were left open in the 1.0 version, there are still some remaining issues which should be addressed in future versions:

  • Define a useful return value for .nextset() for the case where a new result set is available.
  • Integrate the decimal module Decimal object for use as loss-less monetary and decimal interchange format.

Footnotes

[1]

As a guideline the connection constructor parameters should be implemented as keyword parameters for more intuitive use and follow this order of parameters:

Parameter Meaning
dsn Data source name as string
user User name as string (optional)
password Password as string (optional)
host Hostname (optional)
database Database name (optional)

E.g. a connect could look like this:

connect(dsn='myhost:MYDB', user='guido', password='234$')
[2]Module implementors should prefer numeric, named or pyformat over the other formats because these offer more clarity and flexibility.
[3](1, 2, 3)

If the database does not support the functionality required by the method, the interface should throw an exception in case the method is used.

The preferred approach is to not implement the method and thus have Python generate an AttributeError in case the method is requested. This allows the programmer to check for database capabilities using the standard hasattr() function.

For some dynamically configured interfaces it may not be appropriate to require dynamically making the method available. These interfaces should then raise a NotSupportedError to indicate the non-ability to perform the roll back when the method is invoked.

[4]A database interface may choose to support named cursors by allowing a string argument to the method. This feature is not part of the specification, since it complicates semantics of the .fetch*() methods.
[5]

The module will use the __getitem__ method of the parameters object to map either positions (integers) or names (strings) to parameter values. This allows for both sequences and mappings to be used as input.

The term bound refers to the process of binding an input value to a database execution buffer. In practical terms, this means that the input value is directly used as a value in the operation. The client should not be required to "escape" the value so that it can be used — the value should be equal to the actual database value.

[6]Note that the interface may implement row fetching using arrays and other optimizations. It is not guaranteed that a call to this method will only move the associated cursor forward by one row.
[7]The rowcount attribute may be coded in a way that updates its value dynamically. This can be useful for databases that return usable rowcount values only after the first call to a .fetch*() method.
[8]Implementation Note: Python C extensions will have to implement the tp_iter slot on the cursor object instead of the .__iter__() method.
[9]The term number of affected rows generally refers to the number of rows deleted, updated or inserted by the last statement run on the database cursor. Most databases will return the total number of rows that were found by the corresponding WHERE clause of the statement. Some databases use a different interpretation for UPDATEs and only return the number of rows that were changed by the UPDATE, even though the WHERE clause of the statement may have found more matching rows. Database module authors should try to implement the more common interpretation of returning the total number of rows found by the WHERE clause, or clearly document a different interpretation of the .rowcount attribute.

Acknowledgements

Many thanks go to Andrew Kuchling who converted the Python Database API Specification 2.0 from the original HTML format into the PEP format.

Many thanks to James Henstridge for leading the discussion which led to the standardization of the two-phase commit API extensions.

Many thanks to Daniele Varrazzo for converting the specification from text PEP format to ReST PEP format, which allows linking to various parts.

pep-0250 Using site-packages on Windows

PEP: 250
Title: Using site-packages on Windows
Version: $Revision$
Last-Modified: $Date$
Author: Paul Moore <p.f.moore at gmail.com>
Status: Final
Type: Standards Track
Created: 30-Mar-2001
Python-Version: 2.2
Post-History: 30-Mar-2001

Abstract

    The standard Python distribution includes a directory
    Lib/site-packages, which is used on Unix platforms to hold
    locally-installed modules and packages.  The site.py module
    distributed with Python includes support for locating other
    modules in the site-packages directory.

    This PEP proposes that the site-packages directory should be used
    on the Windows platform in a similar manner.


Motivation

    On Windows platforms, the default setting for sys.path does not
    include a directory suitable for users to install locally
    developed modules.  The "expected" location appears to be the
    directory containing the Python executable itself.  This is also
    the location where distutils (and distutils-generated installers)
    installs packages.  Including locally developed code in the same
    directory as installed executables is not good practice.

    Clearly, users can manipulate sys.path, either in a locally
    modified site.py, or in a suitable sitecustomize.py, or even via
    .pth files.  However, there should be a standard location for such
    files, rather than relying on every individual site having to set
    their own policy.

    In addition, with distutils becoming more prevalent as a means of
    distributing modules, the need for a standard install location for
    distributed modules will become more common.  It would be better
    to define such a standard now, rather than later when more
    distutils-based packages exist which will need rebuilding.

    It is relevant to note that prior to Python 2.1, the site-packages
    directory was not included in sys.path for Macintosh platforms.
    This has been changed in 2.1, and Macintosh includes sys.path now,
    leaving Windows as the only major platform with no site-specific
    modules directory.


Implementation

    The implementation of this feature is fairly trivial.  All that
    would be required is a change to site.py, to change the section
    setting sitedirs.  The Python 2.1 version has

        if os.sep == '/':
            sitedirs = [makepath(prefix,
                                 "lib",
                                 "python" + sys.version[:3],
                                 "site-packages"),
                        makepath(prefix, "lib", "site-python")]
        elif os.sep == ':':
            sitedirs = [makepath(prefix, "lib", "site-packages")]
        else:
            sitedirs = [prefix]

    A suitable change would be to simply replace the last 4 lines with

        else:
            sitedirs == [prefix, makepath(prefix, "lib", "site-packages")]

    Changes would also be required to distutils, to reflect this change
    in policy. A patch is available on Sourceforge, patch ID 445744,
    which implements this change.  Note that the patch checks the Python
    version and only invokes the new behaviour for Python versions from
    2.2 onwards. This is to ensure that distutils remains compatible
    with earlier versions of Python.

    Finally, the executable code which implements the Windows installer
    used by the bdist_wininst command will need changing to use the new
    location.  A separate patch is available for this, currently
    maintained by Thomas Heller.


Notes

    - This change does not preclude packages using the current
      location -- the change only adds a directory to sys.path, it
      does not remove anything.

    - Both the current location (sys.prefix) and the new directory
      (site-packages) are included in sitedirs, so that .pth files
      will be recognised in either location.

    - This proposal adds a single additional site-packages directory
      to sitedirs. On Unix platforms, two directories are added, one
      for version-independent files (Python code) and one for
      version-dependent code (C extensions). This is necessary on
      Unix, as the sitedirs include a common (across Python versions)
      package location, in /usr/local by default. As there is no such
      common location available on Windows, there is also no need for
      having two separate package directories.

    - If users want to keep DLLs in a single location on Windows, rather
      than keeping them in the package directory, the DLLs subdirectory
      of the Python install directory is already available for that
      purpose. Adding an extra directory solely for DLLs should not be
      necessary.


Open Issues

    - Comments from Unix users indicate that there may be issues with
      the current setup on the Unix platform.  Rather than become
      involved in cross-platform issues, this PEP specifically limits
      itself to the Windows platform, leaving changes for other platforms
      to be covered inother PEPs.

    - There could be issues with applications which embed Python. To the
      author's knowledge, there should be no problem as a result of this
      change. There have been no comments (supportive or otherwise) from
      users who embed Python.


Copyright

    This document has been placed in the public domain.



pep-0251 Python 2.2 Release Schedule

PEP: 251
Title: Python 2.2 Release Schedule
Version: $Revision$
Last-Modified: $Date$
Author: Barry Warsaw <barry at python.org>, Guido van Rossum <guido at python.org>
Status: Final
Type: Informational
Created: 17-Apr-2001
Python-Version: 2.2
Post-History: 14-Aug-2001

Abstract

    This document describes the Python 2.2 development and release
    schedule.  The schedule primarily concerns itself with PEP-sized
    items.  Small bug fixes and changes will occur up until the first
    beta release.

    The schedule below represents the actual release dates of Python
    2.2.  Note that any subsequent maintenance releases of Python 2.2
    should be covered by separate PEPs.


Release Schedule

    Tentative future release dates.  Note that we've slipped this
    compared to the schedule posted around the release of 2.2a1.

    21-Dec-2001: 2.2   [Released] (final release)
    14-Dec-2001: 2.2c1 [Released]
    14-Nov-2001: 2.2b2 [Released]
    19-Oct-2001: 2.2b1 [Released]
    28-Sep-2001: 2.2a4 [Released]
     7-Sep-2001: 2.2a3 [Released]
    22-Aug-2001: 2.2a2 [Released]
    18-Jul-2001: 2.2a1 [Released]


Release Manager

    Barry Warsaw was the Python 2.2 release manager.


Release Mechanics

    We experimented with a new mechanism for releases: a week before
    every alpha, beta or other release, we forked off a branch which
    became the release.  Changes to the branch are limited to the
    release manager and his designated 'bots.  This experiment was
    deemed a success and should be observed for future releases.  See
    PEP 101 for the actual release mechanics[1].


New features for Python 2.2

    The following new features are introduced in Python 2.2.  For a
    more detailed account, see Misc/NEWS[2] in the Python
    distribution, or Andrew Kuchling's "What's New in Python 2.2"
    document[3].

    - iterators (PEP 234)
    - generators (PEP 255)
    - unifying long ints and plain ints (PEP 237)
    - division (PEP 238)
    - unification of types and classes (PEP 252, PEP 253)


References

    [1] PEP 101, Doing Python Releases 101
        http://www.python.org/dev/peps/pep-0101/

    [2] Misc/NEWS file from CVS
        http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/python/python/dist/src/Misc/NEWS?rev=1.337.2.4&content-type=text/vnd.viewcvs-markup

    [3] Andrew Kuchling, What's New in Python 2.2
        http://www.python.org/doc/2.2.1/whatsnew/whatsnew22.html


Copyright

    This document has been placed in the public domain.



pep-0252 Making Types Look More Like Classes

PEP: 252
Title: Making Types Look More Like Classes
Version: $Revision$
Last-Modified: $Date$
Author: Guido van Rossum <guido at python.org>
Status: Final
Type: Standards Track
Created: 19-Apr-2001
Python-Version: 2.2
Post-History: 

Abstract

    This PEP proposes changes to the introspection API for types that
    makes them look more like classes, and their instances more like
    class instances.  For example, type(x) will be equivalent to
    x.__class__ for most built-in types.  When C is x.__class__,
    x.meth(a) will generally be equivalent to C.meth(x, a), and
    C.__dict__ contains x's methods and other attributes.

    This PEP also introduces a new approach to specifying attributes,
    using attribute descriptors, or descriptors for short.
    Descriptors unify and generalize several different common
    mechanisms used for describing attributes: a descriptor can
    describe a method, a typed field in the object structure, or a
    generalized attribute represented by getter and setter functions.

    Based on the generalized descriptor API, this PEP also introduces
    a way to declare class methods and static methods.

    [Editor's note: the ideas described in this PEP have been incorporated
     into Python.  The PEP no longer accurately describes the implementation.]


Introduction

    One of Python's oldest language warts is the difference between
    classes and types.  For example, you can't directly subclass the
    dictionary type, and the introspection interface for finding out
    what methods and instance variables an object has is different for
    types and for classes.

    Healing the class/type split is a big effort, because it affects
    many aspects of how Python is implemented.  This PEP concerns
    itself with making the introspection API for types look the same
    as that for classes.  Other PEPs will propose making classes look
    more like types, and subclassing from built-in types; these topics
    are not on the table for this PEP.


Introspection APIs

    Introspection concerns itself with finding out what attributes an
    object has.  Python's very general getattr/setattr API makes it
    impossible to guarantee that there always is a way to get a list
    of all attributes supported by a specific object, but in practice
    two conventions have appeared that together work for almost all
    objects.  I'll call them the class-based introspection API and the
    type-based introspection API; class API and type API for short.

    The class-based introspection API is used primarily for class
    instances; it is also used by Jim Fulton's ExtensionClasses.  It
    assumes that all data attributes of an object x are stored in the
    dictionary x.__dict__, and that all methods and class variables
    can be found by inspection of x's class, written as x.__class__.
    Classes have a __dict__ attribute, which yields a dictionary
    containing methods and class variables defined by the class
    itself, and a __bases__ attribute, which is a tuple of base
    classes that must be inspected recursively.  Some assumptions here
    are:

    - attributes defined in the instance dict override attributes
      defined by the object's class;

    - attributes defined in a derived class override attributes
      defined in a base class;

    - attributes in an earlier base class (meaning occurring earlier
      in __bases__) override attributes in a later base class.

    (The last two rules together are often summarized as the
    left-to-right, depth-first rule for attribute search.  This is the
    classic Python attribute lookup rule.  Note that PEP 253 will
    propose to change the attribute lookup order, and if accepted,
    this PEP will follow suit.)

    The type-based introspection API is supported in one form or
    another by most built-in objects.  It uses two special attributes,
    __members__ and __methods__.  The __methods__ attribute, if
    present, is a list of method names supported by the object.  The
    __members__ attribute, if present, is a list of data attribute
    names supported by the object.

    The type API is sometimes combined with a __dict__ that works the
    same as for instances (for example for function objects in
    Python 2.1, f.__dict__ contains f's dynamic attributes, while
    f.__members__ lists the names of f's statically defined
    attributes).

    Some caution must be exercised: some objects don't list their
    "intrinsic" attributes (like __dict__ and __doc__) in __members__,
    while others do; sometimes attribute names occur both in
    __members__ or __methods__ and as keys in __dict__, in which case
    it's anybody's guess whether the value found in __dict__ is used
    or not.

    The type API has never been carefully specified.  It is part of
    Python folklore, and most third party extensions support it
    because they follow examples that support it.  Also, any type that
    uses Py_FindMethod() and/or PyMember_Get() in its tp_getattr
    handler supports it, because these two functions special-case the
    attribute names __methods__ and __members__, respectively.

    Jim Fulton's ExtensionClasses ignore the type API, and instead
    emulate the class API, which is more powerful.  In this PEP, I
    propose to phase out the type API in favor of supporting the class
    API for all types.

    One argument in favor of the class API is that it doesn't require
    you to create an instance in order to find out which attributes a
    type supports; this in turn is useful for documentation
    processors.  For example, the socket module exports the SocketType
    object, but this currently doesn't tell us what methods are
    defined on socket objects.  Using the class API, SocketType would
    show exactly what the methods for socket objects are, and we can
    even extract their docstrings, without creating a socket.  (Since
    this is a C extension module, the source-scanning approach to
    docstring extraction isn't feasible in this case.)


Specification of the class-based introspection API

    Objects may have two kinds of attributes: static and dynamic.  The
    names and sometimes other properties of static attributes are
    knowable by inspection of the object's type or class, which is
    accessible through obj.__class__ or type(obj).  (I'm using type
    and class interchangeably; a clumsy but descriptive term that fits
    both is "meta-object".)

    (XXX static and dynamic are not great terms to use here, because
    "static" attributes may actually behave quite dynamically, and
    because they have nothing to do with static class members in C++
    or Java.  Barry suggests to use immutable and mutable instead, but
    those words already have precise and different meanings in
    slightly different contexts, so I think that would still be
    confusing.)

    Examples of dynamic attributes are instance variables of class
    instances, module attributes, etc.  Examples of static attributes
    are the methods of built-in objects like lists and dictionaries,
    and the attributes of frame and code objects (f.f_code,
    c.co_filename, etc.).  When an object with dynamic attributes
    exposes these through its __dict__ attribute, __dict__ is a static
    attribute.

    The names and values of dynamic properties are typically stored in
    a dictionary, and this dictionary is typically accessible as
    obj.__dict__.  The rest of this specification is more concerned
    with discovering the names and properties of static attributes
    than with dynamic attributes; the latter are easily discovered by
    inspection of obj.__dict__.

    In the discussion below, I distinguish two kinds of objects:
    regular objects (like lists, ints, functions) and meta-objects.
    Types and classes are meta-objects.  Meta-objects are also regular
    objects, but we're mostly interested in them because they are
    referenced by the __class__ attribute of regular objects (or by
    the __bases__ attribute of other meta-objects).

    The class introspection API consists of the following elements:

    - the __class__ and __dict__ attributes on regular objects;

    - the __bases__ and __dict__ attributes on meta-objects;

    - precedence rules;

    - attribute descriptors.

    Together, these not only tell us about *all* attributes defined by
    a meta-object, but they also help us calculate the value of a
    specific attribute of a given object.

    1. The __dict__ attribute on regular objects

       A regular object may have a __dict__ attribute.  If it does,
       this should be a mapping (not necessarily a dictionary)
       supporting at least __getitem__(), keys(), and has_key().  This
       gives the dynamic attributes of the object.  The keys in the
       mapping give attribute names, and the corresponding values give
       their values.

       Typically, the value of an attribute with a given name is the
       same object as the value corresponding to that name as a key in
       the __dict__.  In othe words, obj.__dict__['spam'] is obj.spam.
       (But see the precedence rules below; a static attribute with
       the same name *may* override the dictionary item.)

    2. The __class__ attribute on regular objects

       A regular object usually has a __class__ attribute.  If it
       does, this references a meta-object.  A meta-object can define
       static attributes for the regular object whose __class__ it
       is.  This is normally done through the following mechanism:

    3. The __dict__ attribute on meta-objects

       A meta-object may have a __dict__ attribute, of the same form
       as the __dict__ attribute for regular objects (a mapping but
       not necessarily a dictionary).  If it does, the keys of the
       meta-object's __dict__ are names of static attributes for the
       corresponding regular object.  The values are attribute
       descriptors; we'll explain these later.  An unbound method is a
       special case of an attribute descriptor.

       Because a meta-object is also a regular object, the items in a
       meta-object's __dict__ correspond to attributes of the
       meta-object; however, some transformation may be applied, and
       bases (see below) may define additional dynamic attributes.  In
       other words, mobj.spam is not always mobj.__dict__['spam'].
       (This rule contains a loophole because for classes, if
       C.__dict__['spam'] is a function, C.spam is an unbound method
       object.)

    4. The __bases__ attribute on meta-objects

       A meta-object may have a __bases__ attribute.  If it does, this
       should be a sequence (not necessarily a tuple) of other
       meta-objects, the bases.  An absent __bases__ is equivalent to
       an empty sequence of bases.  There must never be a cycle in the
       relationship between meta-objects defined by __bases__
       attributes; in other words, the __bases__ attributes define a
       directed acyclic graph, with arcs pointing from derived
       meta-objects to their base meta-objects.  (It is not
       necessarily a tree, since multiple classes can have the same
       base class.)  The __dict__ attributes of a meta-object in the
       inheritance graph supply attribute descriptors for the regular
       object whose __class__ attribute points to the root of the
       inheritance tree (which is not the same as the root of the
       inheritance hierarchy -- rather more the opposite, at the
       bottom given how inheritance trees are typically drawn).
       Descriptors are first searched in the dictionary of the root
       meta-object, then in its bases, according to a precedence rule
       (see the next paragraph).

    5. Precedence rules

       When two meta-objects in the inheritance graph for a given
       regular object both define an attribute descriptor with the
       same name, the search order is up to the meta-object.  This
       allows different meta-objects to define different search
       orders.  In particular, classic classes use the old
       left-to-right depth-first rule, while new-style classes use a
       more advanced rule (see the section on method resolution order
       in PEP 253).

       When a dynamic attribute (one defined in a regular object's
       __dict__) has the same name as a static attribute (one defined
       by a meta-object in the inheritance graph rooted at the regular
       object's __class__), the static attribute has precedence if it
       is a descriptor that defines a __set__ method (see below);
       otherwise (if there is no __set__ method) the dynamic attribute
       has precedence.  In other words, for data attributes (those
       with a __set__ method), the static definition overrides the
       dynamic definition, but for other attributes, dynamic overrides
       static.

       Rationale: we can't have a simple rule like "static overrides
       dynamic" or "dynamic overrides static", because some static
       attributes indeed override dynamic attributes; for example, a
       key '__class__' in an instance's __dict__ is ignored in favor
       of the statically defined __class__ pointer, but on the other
       hand most keys in inst.__dict__ override attributes defined in
       inst.__class__.  Presence of a __set__ method on a descriptor
       indicates that this is a data descriptor.  (Even read-only data
       descriptors have a __set__ method: it always raises an
       exception.)  Absence of a __set__ method on a descriptor
       indicates that the descriptor isn't interested in intercepting
       assignment, and then the classic rule applies: an instance
       variable with the same name as a method hides the method until
       it is deleted.

    6. Attribute descriptors

       This is where it gets interesting -- and messy.  Attribute
       descriptors (descriptors for short) are stored in the
       meta-object's __dict__ (or in the __dict__ of one of its
       ancestors), and have two uses: a descriptor can be used to get
       or set the corresponding attribute value on the (regular,
       non-meta) object, and it has an additional interface that
       describes the attribute for documentation and introspection
       purposes.

       There is little prior art in Python for designing the
       descriptor's interface, neither for getting/setting the value
       nor for describing the attribute otherwise, except some trivial
       properties (it's reasonable to assume that __name__ and __doc__
       should be the attribute's name and docstring).  I will propose
       such an API below.

       If an object found in the meta-object's __dict__ is not an
       attribute descriptor, backward compatibility dictates certain
       minimal semantics.  This basically means that if it is a Python
       function or an unbound method, the attribute is a method;
       otherwise, it is the default value for a dynamic data
       attribute.  Backwards compatibility also dictates that (in the
       absence of a __setattr__ method) it is legal to assign to an
       attribute corresponding to a method, and that this creates a
       data attribute shadowing the method for this particular
       instance.  However, these semantics are only required for
       backwards compatibility with regular classes.

    The introspection API is a read-only API.  We don't define the
    effect of assignment to any of the special attributes (__dict__,
    __class__ and __bases__), nor the effect of assignment to the
    items of a __dict__.  Generally, such assignments should be
    considered off-limits.  A future PEP may define some semantics for
    some such assignments.  (Especially because currently instances
    support assignment to __class__ and __dict__, and classes support
    assignment to __bases__ and __dict__.)


Specification of the attribute descriptor API

    Attribute descriptors may have the following attributes.  In the
    examples, x is an object, C is x.__class__, x.meth() is a method,
    and x.ivar is a data attribute or instance variable.  All
    attributes are optional -- a specific attribute may or may not be
    present on a given descriptor.  An absent attribute means that the
    corresponding information is not available or the corresponding
    functionality is not implemented.

    - __name__: the attribute name.  Because of aliasing and renaming,
      the attribute may (additionally or exclusively) be known under a
      different name, but this is the name under which it was born.
      Example: C.meth.__name__ == 'meth'.

    - __doc__: the attribute's documentation string.  This may be
      None.

    - __objclass__: the class that declared this attribute.  The
      descriptor only applies to objects that are instances of this
      class (this includes instances of its subclasses).  Example:
      C.meth.__objclass__ is C.

    - __get__(): a function callable with one or two arguments that
      retrieves the attribute value from an object.  This is also
      referred to as a "binding" operation, because it may return a
      "bound method" object in the case of method descriptors.  The
      first argument, X, is the object from which the attribute must
      be retrieved or to which it must be bound.  When X is None, the
      optional second argument, T, should be meta-object and the
      binding operation may return an *unbound* method restricted to
      instances of T.  When both X and T are specified, X should be an
      instance of T.  Exactly what is returned by the binding
      operation depends on the semantics of the descriptor; for
      example, static methods and class methods (see below) ignore the
      instance and bind to the type instead.

    - __set__(): a function of two arguments that sets the attribute
      value on the object.  If the attribute is read-only, this method
      may raise a TypeError or AttributeError exception (both are
      allowed, because both are historically found for undefined or
      unsettable attributes).  Example:
      C.ivar.set(x, y) ~~ x.ivar = y.


Static methods and class methods

    The descriptor API makes it possible to add static methods and
    class methods.  Static methods are easy to describe: they behave
    pretty much like static methods in C++ or Java.  Here's an
    example:

      class C:

          def foo(x, y):
              print "staticmethod", x, y
          foo = staticmethod(foo)

      C.foo(1, 2)
      c = C()
      c.foo(1, 2)

    Both the call C.foo(1, 2) and the call c.foo(1, 2) call foo() with
    two arguments, and print "staticmethod 1 2".  No "self" is declared in
    the definition of foo(), and no instance is required in the call.

    The line "foo = staticmethod(foo)" in the class statement is the
    crucial element: this makes foo() a static method.  The built-in
    staticmethod() wraps its function argument in a special kind of
    descriptor whose __get__() method returns the original function
    unchanged.  Without this, the __get__() method of standard
    function objects would have created a bound method object for
    'c.foo' and an unbound method object for 'C.foo'.

    (XXX Barry suggests to use "sharedmethod" instead of
    "staticmethod", because the word statis is being overloaded in so
    many ways already.  But I'm not sure if shared conveys the right
    meaning.)

    Class methods use a similar pattern to declare methods that
    receive an implicit first argument that is the *class* for which
    they are invoked.  This has no C++ or Java equivalent, and is not
    quite the same as what class methods are in Smalltalk, but may
    serve a similar purpose.  According to Armin Rigo, they are
    similar to "virtual class methods" in Borland Pascal dialect
    Delphi.  (Python also has real metaclasses, and perhaps methods
    defined in a metaclass have more right to the name "class method";
    but I expect that most programmers won't be using metaclasses.)
    Here's an example:

      class C:

          def foo(cls, y):
              print "classmethod", cls, y
          foo = classmethod(foo)

      C.foo(1)
      c = C()
      c.foo(1)

    Both the call C.foo(1) and the call c.foo(1) end up calling foo()
    with *two* arguments, and print "classmethod __main__.C 1".  The
    first argument of foo() is implied, and it is the class, even if
    the method was invoked via an instance.  Now let's continue the
    example:

      class D(C):
          pass

      D.foo(1)
      d = D()
      d.foo(1)

    This prints "classmethod __main__.D 1" both times; in other words,
    the class passed as the first argument of foo() is the class
    involved in the call, not the class involved in the definition of
    foo().

    But notice this:

      class E(C):
          def foo(cls, y): # override C.foo
              print "E.foo() called"
              C.foo(y)
          foo = classmethod(foo)

      E.foo(1)
      e = E()
      e.foo(1)

    In this example, the call to C.foo() from E.foo() will see class C
    as its first argument, not class E.  This is to be expected, since
    the call specifies the class C.  But it stresses the difference
    between these class methods and methods defined in metaclasses,
    where an upcall to a metamethod would pass the target class as an
    explicit first argument.  (If you don't understand this, don't
    worry, you're not alone.)  Note that calling cls.foo(y) would be a
    mistake -- it would cause infinite recursion.  Also note that you
    can't specify an explicit 'cls' argument to a class method.  If
    you want this (e.g. the __new__ method in PEP 253 requires this),
    use a static method with a class as its explicit first argument
    instead.


C API

    XXX The following is VERY rough text that I wrote with a different
    audience in mind; I'll have to go through this to edit it more.
    XXX It also doesn't go into enough detail for the C API.

    A built-in type can declare special data attributes in two ways:
    using a struct memberlist (defined in structmember.h) or a struct
    getsetlist (defined in descrobject.h).  The struct memberlist is
    an old mechanism put to new use: each attribute has a descriptor
    record including its name, an enum giving its type (various C
    types are supported as well as PyObject *), an offset from the
    start of the instance, and a read-only flag.

    The struct getsetlist mechanism is new, and intended for cases
    that don't fit in that mold, because they either require
    additional checking, or are plain calculated attributes.  Each
    attribute here has a name, a getter C function pointer, a setter C
    function pointer, and a context pointer.  The function pointers
    are optional, so that for example setting the setter function
    pointer to NULL makes a read-only attribute.  The context pointer
    is intended to pass auxiliary information to generic getter/setter
    functions, but I haven't found a need for this yet.

    Note that there is also a similar mechanism to declare built-in
    methods: these are PyMethodDef structures, which contain a name
    and a C function pointer (and some flags for the calling
    convention).

    Traditionally, built-in types have had to define their own
    tp_getattro and tp_setattro slot functions to make these attribute
    definitions work (PyMethodDef and struct memberlist are quite
    old).  There are convenience functions that take an array of
    PyMethodDef or memberlist structures, an object, and an attribute
    name, and return or set the attribute if found in the list, or
    raise an exception if not found.  But these convenience functions
    had to be explicitly called by the tp_getattro or tp_setattro
    method of the specific type, and they did a linear search of the
    array using strcmp() to find the array element describing the
    requested attribute.

    I now have a brand spanking new generic mechanism that improves
    this situation substantially.

    - Pointers to arrays of PyMethodDef, memberlist, getsetlist
      structures are part of the new type object (tp_methods,
      tp_members, tp_getset).

    - At type initialization time (in PyType_InitDict()), for each
      entry in those three arrays, a descriptor object is created and
      placed in a dictionary that belongs to the type (tp_dict).

    - Descriptors are very lean objects that mostly point to the
      corresponding structure.  An implementation detail is that all
      descriptors share the same object type, and a discriminator
      field tells what kind of descriptor it is (method, member, or
      getset).

    - As explained in PEP 252, descriptors have a get() method that
      takes an object argument and returns that object's attribute;
      descriptors for writable attributes also have a set() method
      that takes an object and a value and set that object's
      attribute.  Note that the get() object also serves as a bind()
      operation for methods, binding the unbound method implementation
      to the object.

    - Instead of providing their own tp_getattro and tp_setattro
      implementation, almost all built-in objects now place
      PyObject_GenericGetAttr and (if they have any writable
      attributes) PyObject_GenericSetAttr in their tp_getattro and
      tp_setattro slots.  (Or, they can leave these NULL, and inherit
      them from the default base object, if they arrange for an
      explicit call to PyType_InitDict() for the type before the first
      instance is created.)

    - In the simplest case, PyObject_GenericGetAttr() does exactly one
      dictionary lookup: it looks up the attribute name in the type's
      dictionary (obj->ob_type->tp_dict).  Upon success, there are two
      possibilities: the descriptor has a get method, or it doesn't.
      For speed, the get and set methods are type slots: tp_descr_get
      and tp_descr_set.  If the tp_descr_get slot is non-NULL, it is
      called, passing the object as its only argument, and the return
      value from this call is the result of the getattr operation.  If
      the tp_descr_get slot is NULL, as a fallback the descriptor
      itself is returned (compare class attributes that are not
      methods but simple values).

    - PyObject_GenericSetAttr() works very similar but uses the
      tp_descr_set slot and calls it with the object and the new
      attribute value; if the tp_descr_set slot is NULL, an
      AttributeError is raised.

    - But now for a more complicated case.  The approach described
      above is suitable for most built-in objects such as lists,
      strings, numbers.  However, some object types have a dictionary
      in each instance that can store arbitrary attributes.  In fact,
      when you use a class statement to subtype an existing built-in
      type, you automatically get such a dictionary (unless you
      explicitly turn it off, using another advanced feature,
      __slots__).  Let's call this the instance dict, to distinguish
      it from the type dict.

    - In the more complicated case, there's a conflict between names
      stored in the instance dict and names stored in the type dict.
      If both dicts have an entry with the same key, which one should
      we return?  Looking at classic Python for guidance, I find
      conflicting rules: for class instances, the instance dict
      overrides the class dict, *except* for the special attributes
      (like __dict__ and __class__), which have priority over the
      instance dict.

    - I resolved this with the following set of rules, implemented in
      PyObject_GenericGetAttr():

      1. Look in the type dict.  If you find a *data* descriptor, use
         its get() method to produce the result.  This takes care of
         special attributes like __dict__ and __class__.

      2. Look in the instance dict.  If you find anything, that's it.
         (This takes care of the requirement that normally the
         instance dict overrides the class dict.)

      3. Look in the type dict again (in reality this uses the saved
         result from step 1, of course).  If you find a descriptor,
         use its get() method; if you find something else, that's it;
         if it's not there, raise AttributeError.

      This requires a classification of descriptors as data and
      nondata descriptors.  The current implementation quite sensibly
      classifies member and getset descriptors as data (even if they
      are read-only!)  and method descriptors as nondata.
      Non-descriptors (like function pointers or plain values) are
      also classified as non-data (!).

    - This scheme has one drawback: in what I assume to be the most
      common case, referencing an instance variable stored in the
      instance dict, it does *two* dictionary lookups, whereas the
      classic scheme did a quick test for attributes starting with two
      underscores plus a single dictionary lookup.  (Although the
      implementation is sadly structured as instance_getattr() calling
      instance_getattr1() calling instance_getattr2() which finally
      calls PyDict_GetItem(), and the underscore test calls
      PyString_AsString() rather than inlining this.  I wonder if
      optimizing the snot out of this might not be a good idea to
      speed up Python 2.2, if we weren't going to rip it all out. :-)

    - A benchmark verifies that in fact this is as fast as classic
      instance variable lookup, so I'm no longer worried.

    - Modification for dynamic types: step 1 and 3 look in the
      dictionary of the type and all its base classes (in MRO
      sequence, or couse).


Discussion

    XXX


Examples

    Let's look at lists.  In classic Python, the method names of
    lists were available as the __methods__ attribute of list objects:

      >>> [].__methods__
      ['append', 'count', 'extend', 'index', 'insert', 'pop',
      'remove', 'reverse', 'sort']
      >>>

    Under the new proposal, the __methods__ attribute no longer exists:

      >>> [].__methods__
      Traceback (most recent call last):
        File "<stdin>", line 1, in ?
      AttributeError: 'list' object has no attribute '__methods__'
      >>>

    Instead, you can get the same information from the list type:

      >>> T = [].__class__
      >>> T
      <type 'list'>
      >>> dir(T)                # like T.__dict__.keys(), but sorted
      ['__add__', '__class__', '__contains__', '__eq__', '__ge__',
      '__getattr__', '__getitem__', '__getslice__', '__gt__',
      '__iadd__', '__imul__', '__init__', '__le__', '__len__',
      '__lt__', '__mul__', '__ne__', '__new__', '__radd__',
      '__repr__', '__rmul__', '__setitem__', '__setslice__', 'append',
      'count', 'extend', 'index', 'insert', 'pop', 'remove',
      'reverse', 'sort']
      >>>

    The new introspection API gives more information than the old one:
    in addition to the regular methods, it also shows the methods that
    are normally invoked through special notations, e.g.  __iadd__
    (+=), __len__ (len), __ne__ (!=).  You can invoke any method from
    this list directly:

      >>> a = ['tic', 'tac']
      >>> T.__len__(a)          # same as len(a)
      2
      >>> T.append(a, 'toe')    # same as a.append('toe')
      >>> a
      ['tic', 'tac', 'toe']
      >>>

    This is just like it is for user-defined classes.

    Notice a familiar yet surprising name in the list: __init__.  This
    is the domain of PEP 253.


Backwards compatibility

    XXX


Warnings and Errors

    XXX


Implementation

    A partial implementation of this PEP is available from CVS as a
    branch named "descr-branch".  To experiment with this
    implementation, proceed to check out Python from CVS according to
    the instructions at http://sourceforge.net/cvs/?group_id=5470 but
    add the arguments "-r descr-branch" to the cvs checkout command.
    (You can also start with an existing checkout and do "cvs update
    -r descr-branch".)  For some examples of the features described
    here, see the file Lib/test/test_descr.py.

    Note: the code in this branch goes way beyond this PEP; it is also
    the experimentation area for PEP 253 (Subtyping Built-in Types).


References

    XXX


Copyright

    This document has been placed in the public domain.


pep-0253 Subtyping Built-in Types

PEP: 253
Title: Subtyping Built-in Types
Version: $Revision$
Last-Modified: $Date$
Author: Guido van Rossum <guido at python.org>
Status: Final
Type: Standards Track
Created: 14-May-2001
Python-Version: 2.2
Post-History: 

Abstract

    This PEP proposes additions to the type object API that will allow
    the creation of subtypes of built-in types, in C and in Python.

    [Editor's note: the ideas described in this PEP have been incorporated
     into Python.  The PEP no longer accurately describes the implementation.]


Introduction

    Traditionally, types in Python have been created statically, by
    declaring a global variable of type PyTypeObject and initializing
    it with a static initializer.  The slots in the type object
    describe all aspects of a Python type that are relevant to the
    Python interpreter.  A few slots contain dimensional information
    (like the basic allocation size of instances), others contain
    various flags, but most slots are pointers to functions to
    implement various kinds of behaviors.  A NULL pointer means that
    the type does not implement the specific behavior; in that case
    the system may provide a default behavior or raise an exception
    when the behavior is invoked for an instance of the type.  Some
    collections of function pointers that are usually defined together
    are obtained indirectly via a pointer to an additional structure
    containing more function pointers.

    While the details of initializing a PyTypeObject structure haven't
    been documented as such, they are easily gleaned from the examples
    in the source code, and I am assuming that the reader is
    sufficiently familiar with the traditional way of creating new
    Python types in C.

    This PEP will introduce the following features:

      - a type can be a factory function for its instances

      - types can be subtyped in C

      - types can be subtyped in Python with the class statement

      - multiple inheritance from types is supported (insofar as
        practical -- you still can't multiply inherit from list and
        dictionary)

      - the standard coercion functions (int, tuple, str etc.) will
        be redefined to be the corresponding type objects, which serve
        as their own factory functions

      - a class statement can contain a __metaclass__ declaration,
        specifying the metaclass to be used to create the new class

      - a class statement can contain a __slots__ declaration,
        specifying the specific names of the instance variables
        supported

    This PEP builds on PEP 252, which adds standard introspection to
    types; for example, when a particular type object initializes the
    tp_hash slot, that type object has a __hash__ method when
    introspected.  PEP 252 also adds a dictionary to type objects
    which contains all methods.  At the Python level, this dictionary
    is read-only for built-in types; at the C level, it is accessible
    directly (but it should not be modified except as part of
    initialization).

    For binary compatibility, a flag bit in the tp_flags slot
    indicates the existence of the various new slots in the type
    object introduced below.  Types that don't have the
    Py_TPFLAGS_HAVE_CLASS bit set in their tp_flags slot are assumed
    to have NULL values for all the subtyping slots.  (Warning: the
    current implementation prototype is not yet consistent in its
    checking of this flag bit.  This should be fixed before the final
    release.)

    In current Python, a distinction is made between types and
    classes.  This PEP together with PEP 254 will remove that
    distinction.  However, for backwards compatibility the distinction
    will probably remain for years to come, and without PEP 254, the
    distinction is still large: types ultimately have a built-in type
    as a base class, while classes ultimately derive from a
    user-defined class.  Therefore, in the rest of this PEP, I will
    use the word type whenever I can -- including base type or
    supertype, derived type or subtype, and metatype.  However,
    sometimes the terminology necessarily blends, for example an
    object's type is given by its __class__ attribute, and subtyping
    in Python is spelled with a class statement.  If further
    distinction is necessary, user-defined classes can be referred to
    as "classic" classes.


About metatypes

    Inevitably the discussion comes to metatypes (or metaclasses).
    Metatypes are nothing new in Python: Python has always been able
    to talk about the type of a type:

    >>> a = 0
    >>> type(a)
    <type 'int'>
    >>> type(type(a))
    <type 'type'>
    >>> type(type(type(a)))
    <type 'type'>
    >>>

    In this example, type(a) is a "regular" type, and type(type(a)) is
    a metatype.  While as distributed all types have the same metatype
    (PyType_Type, which is also its own metatype), this is not a
    requirement, and in fact a useful and relevant 3rd party extension
    (ExtensionClasses by Jim Fulton) creates an additional metatype.
    The type of classic classes, known as types.ClassType, can also be
    considered a distinct metatype.

    A feature closely connected to metatypes is the "Don Beaudry
    hook", which says that if a metatype is callable, its instances
    (which are regular types) can be subclassed (really subtyped)
    using a Python class statement.  I will use this rule to support
    subtyping of built-in types, and in fact it greatly simplifies the
    logic of class creation to always simply call the metatype.  When
    no base class is specified, a default metatype is called -- the
    default metatype is the "ClassType" object, so the class statement
    will behave as before in the normal case.  (This default can be
    changed per module by setting the global variable __metaclass__.)

    Python uses the concept of metatypes or metaclasses in a different
    way than Smalltalk.  In Smalltalk-80, there is a hierarchy of
    metaclasses that mirrors the hierarchy of regular classes,
    metaclasses map 1-1 to classes (except for some funny business at
    the root of the hierarchy), and each class statement creates both
    a regular class and its metaclass, putting class methods in the
    metaclass and instance methods in the regular class.

    Nice though this may be in the context of Smalltalk, it's not
    compatible with the traditional use of metatypes in Python, and I
    prefer to continue in the Python way.  This means that Python
    metatypes are typically written in C, and may be shared between
    many regular types. (It will be possible to subtype metatypes in
    Python, so it won't be absolutely necessary to write C to use
    metatypes; but the power of Python metatypes will be limited.  For
    example, Python code will never be allowed to allocate raw memory
    and initialize it at will.)

    Metatypes determine various *policies* for types, such as what
    happens when a type is called, how dynamic types are (whether a
    type's __dict__ can be modified after it is created), what the
    method resolution order is, how instance attributes are looked
    up, and so on.

    I'll argue that left-to-right depth-first is not the best
    solution when you want to get the most use from multiple
    inheritance.

    I'll argue that with multiple inheritance, the metatype of the
    subtype must be a descendant of the metatypes of all base types.

    I'll come back to metatypes later.


Making a type a factory for its instances

    Traditionally, for each type there is at least one C factory
    function that creates instances of the type (PyTuple_New(),
    PyInt_FromLong() and so on).  These factory functions take care of
    both allocating memory for the object and initializing that
    memory.  As of Python 2.0, they also have to interface with the
    garbage collection subsystem, if the type chooses to participate
    in garbage collection (which is optional, but strongly recommended
    for so-called "container" types: types that may contain references
    to other objects, and hence may participate in reference cycles).

    In this proposal, type objects can be factory functions for their
    instances, making the types directly callable from Python.  This
    mimics the way classes are instantiated.  The C APIs for creating
    instances of various built-in types will remain valid and in some
    cases more efficient.  Not all types will become their own factory
    functions.

    The type object has a new slot, tp_new, which can act as a factory
    for instances of the type.  Types are now callable, because the
    tp_call slot is set in PyType_Type (the metatype); the function
    looks for the tp_new slot of the type that is being called.

    Explanation: the tp_call slot of a regular type object (such as
    PyInt_Type or PyList_Type) defines what happens when *instances*
    of that type are called; in particular, the tp_call slot in the
    function type, PyFunction_Type, is the key to making functions
    callable.  As another example, PyInt_Type.tp_call is NULL, because
    integers are not callable.  The new paradigm makes *type objects*
    callable.  Since type objects are instances of their metatype
    (PyType_Type), the metatype's tp_call slot (PyType_Type.tp_call)
    points to a function that is invoked when any type object is
    called.  Now, since each type has do do something different to
    create an instance of itself, PyType_Type.tp_call immediately
    defers to the tp_new slot of the type that is being called.
    PyType_Type itself is also callable: its tp_new slot creates a new
    type.  This is used by the class statement (formalizing the Don
    Beaudry hook, see above).  And what makes PyType_Type callable?
    The tp_call slot of *its* metatype -- but since it is its own
    metatype, that is its own tp_call slot!

    If the type's tp_new slot is NULL, an exception is raised.
    Otherwise, the tp_new slot is called.  The signature for the
    tp_new slot is

        PyObject *tp_new(PyTypeObject *type,
                         PyObject *args,
                         PyObject *kwds)

    where 'type' is the type whose tp_new slot is called, and 'args'
    and 'kwds' are the sequential and keyword arguments to the call,
    passed unchanged from tp_call.  (The 'type' argument is used in
    combination with inheritance, see below.)

    There are no constraints on the object type that is returned,
    although by convention it should be an instance of the given
    type.  It is not necessary that a new object is returned; a
    reference to an existing object is fine too.  The return value
    should always be a new reference, owned by the caller.

    Once the tp_new slot has returned an object, further initialization
    is attempted by calling the tp_init() slot of the resulting
    object's type, if not NULL.  This has the following signature:

         int tp_init(PyObject *self,
                     PyObject *args,
                     PyObject *kwds)

    It corresponds more closely to the __init__() method of classic
    classes, and in fact is mapped to that by the slot/special-method
    correspondence rules.  The difference in responsibilities between
    the tp_new() slot and the tp_init() slot lies in the invariants
    they ensure.  The tp_new() slot should ensure only the most
    essential invariants, without which the C code that implements the
    objects would break.  The tp_init() slot should be used for
    overridable user-specific initializations.  Take for example the
    dictionary type.  The implementation has an internal pointer to a
    hash table which should never be NULL.  This invariant is taken
    care of by the tp_new() slot for dictionaries.  The dictionary
    tp_init() slot, on the other hand, could be used to give the
    dictionary an initial set of keys and values based on the
    arguments passed in.

    Note that for immutable object types, the initialization cannot be
    done by the tp_init() slot: this would provide the Python user
    with a way to change the initialization.  Therefore, immutable
    objects typically have an empty tp_init() implementation and do
    all their initialization in their tp_new() slot.

    You may wonder why the tp_new() slot shouldn't call the tp_init()
    slot itself.  The reason is that in certain circumstances (like
    support for persistent objects), it is important to be able to
    create an object of a particular type without initializing it any
    further than necessary.  This may conveniently be done by calling
    the tp_new() slot without calling tp_init().  It is also possible
    that tp_init() is not called, or called more than once -- its
    operation should be robust even in these anomalous cases.

    For some objects, tp_new() may return an existing object.  For
    example, the factory function for integers caches the integers -1
    throug 99.  This is permissible only when the type argument to
    tp_new() is the type that defined the tp_new() function (in the
    example, if type == &PyInt_Type), and when the tp_init() slot for
    this type does nothing.  If the type argument differs, the
    tp_new() call is initiated by by a derived type's tp_new() to
    create the object and initialize the base type portion of the
    object; in this case tp_new() should always return a new object
    (or raise an exception).

    Both tp_new() and tp_init() should receive exactly the same 'args'
    and 'kwds' arguments, and both should check that the arguments are
    acceptable, because they may be called independently.

    There's a third slot related to object creation: tp_alloc().  Its
    responsibility is to allocate the memory for the object,
    initialize the reference count (ob_refcnt) and the type pointer
    (ob_type), and initialize the rest of the object to all zeros.  It
    should also register the object with the garbage collection
    subsystem if the type supports garbage collection.  This slot
    exists so that derived types can override the memory allocation
    policy (like which heap is being used) separately from the
    initialization code.  The signature is:

        PyObject *tp_alloc(PyTypeObject *type, int nitems)

    The type argument is the type of the new object.  The nitems
    argument is normally zero, except for objects with a variable
    allocation size (basically strings, tuples, and longs).  The
    allocation size is given by the following expression:

        type->tp_basicsize  +  nitems * type->tp_itemsize

    The tp_alloc slot is only used for subclassable types.  The tp_new()
    function of the base class must call the tp_alloc() slot of the
    type passed in as its first argument.  It is the tp_new()
    function's responsibility to calculate the number of items.  The
    tp_alloc() slot will set the ob_size member of the new object if
    the type->tp_itemsize member is nonzero.

    (Note: in certain debugging compilation modes, the type structure
    used to have members named tp_alloc and a tp_free slot already,
    counters for the number of allocations and deallocations.  These
    are renamed to tp_allocs and tp_deallocs.)

    Standard implementations for tp_alloc() and tp_new() are
    available.  PyType_GenericAlloc() allocates an object from the
    standard heap and initializes it properly.  It uses the above
    formula to determine the amount of memory to allocate, and takes
    care of GC registration.  The only reason not to use this
    implementation would be to allocate objects from a different heap
    (as is done by some very small frequently used objects like ints
    and tuples).  PyType_GenericNew() adds very little: it just calls
    the type's tp_alloc() slot with zero for nitems.  But for mutable
    types that do all their initialization in their tp_init() slot,
    this may be just the ticket.


Preparing a type for subtyping

    The idea behind subtyping is very similar to that of single
    inheritance in C++.  A base type is described by a structure
    declaration (similar to the C++ class declaration) plus a type
    object (similar to the C++ vtable).  A derived type can extend the
    structure (but must leave the names, order and type of the members
    of the base structure unchanged) and can override certain slots in
    the type object, leaving others the same.  (Unlike C++ vtables,
    all Python type objects have the same memory layout.)

    The base type must do the following:

      - Add the flag value Py_TPFLAGS_BASETYPE to tp_flags.

      - Declare and use tp_new(), tp_alloc() and optional tp_init()
        slots.

      - Declare and use tp_dealloc() and tp_free().

      - Export its object structure declaration.

      - Export a subtyping-aware type-checking macro.

    The requirements and signatures for tp_new(), tp_alloc() and
    tp_init() have already been discussed above: tp_alloc() should
    allocate the memory and initialize it to mostly zeros; tp_new()
    should call the tp_alloc() slot and then proceed to do the
    minimally required initialization; tp_init() should be used for
    more extensive initialization of mutable objects.

    It should come as no surprise that there are similar conventions
    at the end of an object's lifetime.  The slots involved are
    tp_dealloc() (familiar to all who have ever implemented a Python
    extension type) and tp_free(), the new kid on the block.  (The
    names aren't quite symmetric; tp_free() corresponds to tp_alloc(),
    which is fine, but tp_dealloc() corresponds to tp_new().  Maybe
    the tp_dealloc slot should be renamed?)

    The tp_free() slot should be used to free the memory and
    unregister the object with the garbage collection subsystem, and
    can be overridden by a derived class; tp_dealloc() should
    deinitialize the object (usually by calling Py_XDECREF() for
    various sub-objects) and then call tp_free() to deallocate the
    memory.  The signature for tp_dealloc() is the same as it always
    was:

        void tp_dealloc(PyObject *object)

    The signature for tp_free() is the same:

        void tp_free(PyObject *object)

    (In a previous version of this PEP, there was also a role reserved
    for the tp_clear() slot.  This turned out to be a bad idea.)

    To be usefully subtyped in C, a type must export the structure
    declaration for its instances through a header file, as it is
    needed to derive a subtype.  The type object for the base type
    must also be exported.

    If the base type has a type-checking macro (like PyDict_Check()),
    this macro should be made to recognize subtypes.  This can be done
    by using the new PyObject_TypeCheck(object, type) macro, which
    calls a function that follows the base class links.

    The PyObject_TypeCheck() macro contains a slight optimization: it
    first compares object->ob_type directly to the type argument, and
    if this is a match, bypasses the function call.  This should make
    it fast enough for most situations.

    Note that this change in the type-checking macro means that C
    functions that require an instance of the base type may be invoked
    with instances of the derived type.  Before enabling subtyping of
    a particular type, its code should be checked to make sure that
    this won't break anything.  It has proved useful in the prototype
    to add another type-checking macro for the built-in Python object
    types, to check for exact type match too (for example,
    PyDict_Check(x) is true if x is an instance of dictionary or of a
    dictionary subclass, while PyDict_CheckExact(x) is true only if x
    is a dictionary).


Creating a subtype of a built-in type in C

    The simplest form of subtyping is subtyping in C.  It is the
    simplest form because we can require the C code to be aware of
    some of the problems, and it's acceptable for C code that doesn't
    follow the rules to dump core.  For added simplicity, it is
    limited to single inheritance.

    Let's assume we're deriving from a mutable base type whose
    tp_itemsize is zero.  The subtype code is not GC-aware, although
    it may inherit GC-awareness from the base type (this is
    automatic).  The base type's allocation uses the standard heap.

    The derived type begins by declaring a type structure which
    contains the base type's structure.  For example, here's the type
    structure for a subtype of the built-in list type:

    typedef struct {
        PyListObject list;
        int state;
    } spamlistobject;

    Note that the base type structure member (here PyListObject) must
    be the first member of the structure; any following members are
    additions.  Also note that the base type is not referenced via a
    pointer; the actual contents of its structure must be included!
    (The goal is for the memory layout of the beginning of the
    subtype instance to be the same as that of the base type
    instance.)

    Next, the derived type must declare a type object and initialize
    it.  Most of the slots in the type object may be initialized to
    zero, which is a signal that the base type slot must be copied
    into it.  Some slots that must be initialized properly:

      - The object header must be filled in as usual; the type should
        be &PyType_Type.

      - The tp_basicsize slot must be set to the size of the subtype
        instance struct (in the above example:
        sizeof(spamlistobject)).

      - The tp_base slot must be set to the address of the base type's
        type object.

      - If the derived slot defines any pointer members, the
        tp_dealloc slot function requires special attention, see
        below; otherwise, it can be set to zero, to inherit the base
        type's deallocation function.

      - The tp_flags slot must be set to the usual Py_TPFLAGS_DEFAULT
        value.

      - The tp_name slot must be set; it is recommended to set tp_doc
        as well (these are not inherited).

    If the subtype defines no additional structure members (it only
    defines new behavior, no new data), the tp_basicsize and the
    tp_dealloc slots may be left set to zero.

    The subtype's tp_dealloc slot deserves special attention.  If the
    derived type defines no additional pointer members that need to be
    DECREF'ed or freed when the object is deallocated, it can be set
    to zero.  Otherwise, the subtype's tp_dealloc() function must call
    Py_XDECREF() for any PyObject * members and the correct memory
    freeing function for any other pointers it owns, and then call the
    base class's tp_dealloc() slot.  This call has to be made via the
    base type's type structure, for example, when deriving from the
    standard list type:

        PyList_Type.tp_dealloc(self);

    If the subtype wants to use a different allocation heap than the
    base type, the subtype must override both the tp_alloc() and the
    tp_free() slots.  These will be called by the base class's
    tp_new() and tp_dealloc() slots, respectively.

    To complete the initialization of the type, PyType_InitDict() must
    be called.  This replaces slots initialized to zero in the subtype
    with the value of the corresponding base type slots.  (It also
    fills in tp_dict, the type's dictionary, and does various other
    initializations necessary for type objects.)

    A subtype is not usable until PyType_InitDict() is called for it;
    this is best done during module initialization, assuming the
    subtype belongs to a module.  An alternative for subtypes added to
    the Python core (which don't live in a particular module) would be
    to initialize the subtype in their constructor function.  It is
    allowed to call PyType_InitDict() more than once; the second and
    further calls have no effect.  To avoid unnecessary calls, a test
    for tp_dict==NULL can be made.

    (During initialization of the Python interpreter, some types are
    actually used before they are initialized.  As long as the slots
    that are actually needed are initialized, especially tp_dealloc,
    this works, but it is fragile and not recommended as a general
    practice.)

    To create a subtype instance, the subtype's tp_new() slot is
    called.  This should first call the base type's tp_new() slot and
    then initialize the subtype's additional data members.  To further
    initialize the instance, the tp_init() slot is typically called.
    Note that the tp_new() slot should *not* call the tp_init() slot;
    this is up to tp_new()'s caller (typically a factory function).
    There are circumstances where it is appropriate not to call
    tp_init().

    If a subtype defines a tp_init() slot, the tp_init() slot should
    normally first call the base type's tp_init() slot.

    (XXX There should be a paragraph or two about argument passing
    here.)


Subtyping in Python

    The next step is to allow subtyping of selected built-in types
    through a class statement in Python.  Limiting ourselves to single
    inheritance for now, here is what happens for a simple class
    statement:

    class C(B):
        var1 = 1
        def method1(self): pass
        # etc.

    The body of the class statement is executed in a fresh environment
    (basically, a new dictionary used as local namespace), and then C
    is created.  The following explains how C is created.

    Assume B is a type object.  Since type objects are objects, and
    every object has a type, B has a type.  Since B is itself a type,
    we also call its type its metatype.  B's metatype is accessible
    via type(B) or B.__class__ (the latter notation is new for types;
    it is introduced in PEP 252).  Let's say this metatype is M (for
    Metatype).  The class statement will create a new type, C.  Since
    C will be a type object just like B, we view the creation of C as
    an instantiation of the metatype, M.  The information that needs
    to be provided for the creation of a subclass is:

      - its name (in this example the string "C");

      - its bases (a singleton tuple containing B);

      - the results of executing the class body, in the form of a
        dictionary (for example {"var1": 1, "method1": <function
        method1 at ...>, ...}).

    The class statement will result in the following call:

        C = M("C", (B,), dict)

    where dict is the dictionary resulting from execution of the
    class body.  In other words, the metatype (M) is called.

    Note that even though the example has only one base, we still pass
    in a (singleton) sequence of bases; this makes the interface
    uniform with the multiple-inheritance case.

    In current Python, this is called the "Don Beaudry hook" after its
    inventor; it is an exceptional case that is only invoked when a
    base class is not a regular class.  For a regular base class (or
    when no base class is specified), current Python calls
    PyClass_New(), the C level factory function for classes, directly.

    Under the new system this is changed so that Python *always*
    determines a metatype and calls it as given above.  When one or
    more bases are given, the type of the first base is used as the
    metatype; when no base is given, a default metatype is chosen.  By
    setting the default metatype to PyClass_Type, the metatype of
    "classic" classes, the classic behavior of the class statement is
    retained.  This default can be changed per module by setting the
    global variable __metaclass__.

    There are two further refinements here.  First, a useful feature
    is to be able to specify a metatype directly.  If the class
    suite defines a variable __metaclass__, that is the metatype
    to call.  (Note that setting __metaclass__ at the module level
    only affects class statements without a base class and without an
    explicit __metaclass__ declaration; but setting __metaclass__ in a
    class suite overrides the default metatype unconditionally.)

    Second, with multiple bases, not all bases need to have the same
    metatype.  This is called a metaclass conflict [1].  Some
    metaclass conflicts can be resolved by searching through the set
    of bases for a metatype that derives from all other given
    metatypes.  If such a metatype cannot be found, an exception is
    raised and the class statement fails.

    This conflict resolution can be implemented by the metatype
    constructors: the class statement just calls the metatype of the first
    base (or that specified by the __metaclass__ variable), and this
    metatype's constructor looks for the most derived metatype.  If
    that is itself, it proceeds; otherwise, it calls that metatype's
    constructor.  (Ultimate flexibility: another metatype might choose
    to require that all bases have the same metatype, or that there's
    only one base class, or whatever.)

    (In [1], a new metaclass is automatically derived that is a
    subclass of all given metaclasses.  But since it is questionable
    in Python how conflicting method definitions of the various
    metaclasses should be merged, I don't think this is feasible.
    Should the need arise, the user can derive such a metaclass
    manually and specify it using the __metaclass__ variable.  It is
    also possible to have a new metaclass that does this.)

    Note that calling M requires that M itself has a type: the
    meta-metatype.  And the meta-metatype has a type, the
    meta-meta-metatype.  And so on.  This is normally cut short at
    some level by making a metatype be its own metatype.  This is
    indeed what happens in Python: the ob_type reference in
    PyType_Type is set to &PyType_Type.  In the absence of third party
    metatypes, PyType_Type is the only metatype in the Python
    interpreter.

    (In a previous version of this PEP, there was one additional
    meta-level, and there was a meta-metatype called "turtle".  This
    turned out to be unnecessary.)

    In any case, the work for creating C is done by M's tp_new() slot.
    It allocates space for an "extended" type structure, containing:
    the type object; the auxiliary structures (as_sequence etc.); the
    string object containing the type name (to ensure that this object
    isn't deallocated while the type object is still referencing it); and
    some auxiliary storage (to be described later).  It initializes this
    storage to zeros except for a few crucial slots (for example, tp_name
    is set to point to the type name) and then sets the tp_base slot to
    point to B.  Then PyType_InitDict() is called to inherit B's slots.
    Finally, C's tp_dict slot is updated with the contents of the
    namespace dictionary (the third argument to the call to M).


Multiple inheritance

    The Python class statement supports multiple inheritance, and we
    will also support multiple inheritance involving built-in types.

    However, there are some restrictions.  The C runtime architecture
    doesn't make it feasible to have a meaningful subtype of two
    different built-in types except in a few degenerate cases.
    Changing the C runtime to support fully general multiple
    inheritance would be too much of an upheaval of the code base.

    The main problem with multiple inheritance from different built-in
    types stems from the fact that the C implementation of built-in
    types accesses structure members directly; the C compiler
    generates an offset relative to the object pointer and that's
    that.  For example, the list and dictionary type structures each
    declare a number of different but overlapping structure members.
    A C function accessing an object expecting a list won't work when
    passed a dictionary, and vice versa, and there's not much we could
    do about this without rewriting all code that accesses lists and
    dictionaries.  This would be too much work, so we won't do this.

    The problem with multiple inheritance is caused by conflicting
    structure member allocations.  Classes defined in Python normally
    don't store their instance variables in structure members: they
    are stored in an instance dictionary.  This is the key to a
    partial solution.  Suppose we have the following two classes:

      class A(dictionary):
          def foo(self): pass

      class B(dictionary):
          def bar(self): pass

      class C(A, B): pass

    (Here, 'dictionary' is the type of built-in dictionary objects,
    a.k.a. type({}) or {}.__class__ or types.DictType.)  If we look at
    the structure layout, we find that an A instance has the layout
    of a dictionary followed by the __dict__ pointer, and a B instance
    has the same layout; since there are no structure member layout
    conflicts, this is okay.

    Here's another example:

      class X(object):
          def foo(self): pass

      class Y(dictionary):
          def bar(self): pass

      class Z(X, Y): pass

    (Here, 'object' is the base for all built-in types; its structure
    layout only contains the ob_refcnt and ob_type members.)  This
    example is more complicated, because the __dict__ pointer for X
    instances has a different offset than that for Y instances.  Where
    is the __dict__ pointer for Z instances?  The answer is that the
    offset for the __dict__ pointer is not hardcoded, it is stored in
    the type object.

    Suppose on a particular machine an 'object' structure is 8 bytes
    long, and a 'dictionary' struct is 60 bytes, and an object pointer
    is 4 bytes.  Then an X structure is 12 bytes (an object structure
    followed by a __dict__ pointer), and a Y structure is 64 bytes (a
    dictionary structure followed by a __dict__ pointer).  The Z
    structure has the same layout as the Y structure in this example.
    Each type object (X, Y and Z) has a "__dict__ offset" which is
    used to find the __dict__ pointer.  Thus, the recipe for looking
    up an instance variable is:

      1. get the type of the instance
      2. get the __dict__ offset from the type object
      3. add the __dict__ offset to the instance pointer
      4. look in the resulting address to find a dictionary reference
      5. look up the instance variable name in that dictionary

    Of course, this recipe can only be implemented in C, and I have
    left out some details.  But this allows us to use multiple
    inheritance patterns similar to the ones we can use with classic
    classes.

    XXX I should write up the complete algorithm here to determine
    base class compatibility, but I can't be bothered right now.  Look
    at best_base() in typeobject.c in the implementation mentioned
    below.


MRO: Method resolution order (the lookup rule)

    With multiple inheritance comes the question of method resolution
    order: the order in which a class or type and its bases are
    searched looking for a method of a given name.

    In classic Python, the rule is given by the following recursive
    function, also known as the left-to-right depth-first rule:

      def classic_lookup(cls, name):
          if cls.__dict__.has_key(name):
              return cls.__dict__[name]
          for base in cls.__bases__:
              try:
                  return classic_lookup(base, name)
              except AttributeError:
                  pass
          raise AttributeError, name

    The problem with this becomes apparent when we consider a "diamond
    diagram":

                class A:
                  ^ ^  def save(self): ...
                 /   \
                /     \
               /       \
              /         \
          class B     class C:
              ^         ^  def save(self): ...
               \       /
                \     /
                 \   /
                  \ /
                class D

    Arrows point from a subtype to its base type(s).  This particular
    diagram means B and C derive from A, and D derives from B and C
    (and hence also, indirectly, from A).

    Assume that C overrides the method save(), which is defined in the
    base A.  (C.save() probably calls A.save() and then saves some of
    its own state.)  B and D don't override save().  When we invoke
    save() on a D instance, which method is called?  According to the
    classic lookup rule, A.save() is called, ignoring C.save()!

    This is not good.  It probably breaks C (its state doesn't get
    saved), defeating the whole purpose of inheriting from C in the
    first place.

    Why was this not a problem in classic Python?  Diamond diagrams
    are rarely found in classic Python class hierarchies.  Most class
    hierarchies use single inheritance, and multiple inheritance is
    usually confined to mix-in classes.  In fact, the problem shown
    here is probably the reason why multiple inheritance is unpopular
    in classic Python.

    Why will this be a problem in the new system?  The 'object' type
    at the top of the type hierarchy defines a number of methods that
    can usefully be extended by subtypes, for example __getattr__().

    (Aside: in classic Python, the __getattr__() method is not really
    the implementation for the get-attribute operation; it is a hook
    that only gets invoked when an attribute cannot be found by normal
    means.  This has often been cited as a shortcoming -- some class
    designs have a legitimate need for a __getattr__() method that
    gets called for *all* attribute references.  But then of course
    this method has to be able to invoke the default implementation
    directly.  The most natural way is to make the default
    implementation available as object.__getattr__(self, name).)

    Thus, a classic class hierarchy like this:

          class B     class C:
              ^         ^  def __getattr__(self, name): ...
               \       /
                \     /
                 \   /
                  \ /
                class D

    will change into a diamond diagram under the new system:

                object:
                  ^ ^  __getattr__()
                 /   \
                /     \
               /       \
              /         \
          class B     class C:
              ^         ^  def __getattr__(self, name): ...
               \       /
                \     /
                 \   /
                  \ /
                class D

    and while in the original diagram C.__getattr__() is invoked,
    under the new system with the classic lookup rule,
    object.__getattr__() would be invoked!

    Fortunately, there's a lookup rule that's better.  It's a bit
    difficult to explain, but it does the right thing in the diamond
    diagram, and it is the same as the classic lookup rule when there
    are no diamonds in the inheritance graph (when it is a tree).

    The new lookup rule constructs a list of all classes in the
    inheritance diagram in the order in which they will be searched.
    This construction is done at class definition time to save time.
    To explain the new lookup rule, let's first consider what such a
    list would look like for the classic lookup rule.  Note that in
    the presence of diamonds the classic lookup visits some classes
    multiple times.  For example, in the ABCD diamond diagram above,
    the classic lookup rule visits the classes in this order:

      D, B, A, C, A

    Note how A occurs twice in the list.  The second occurrence is
    redundant, since anything that could be found there would already
    have been found when searching the first occurrence.

    We use this observation to explain our new lookup rule.  Using the
    classic lookup rule, construct the list of classes that would be
    searched, including duplicates.  Now for each class that occurs in
    the list multiple times, remove all occurrences except for the
    last.  The resulting list contains each ancestor class exactly
    once (including the most derived class, D in the example).

    Searching for methods in this order will do the right thing for
    the diamond diagram.  Because of the way the list is constructed,
    it does not change the search order in situations where no diamond
    is involved.

    Isn't this backwards incompatible?  Won't it break existing code?
    It would, if we changed the method resolution order for all
    classes.  However, in Python 2.2, the new lookup rule will only be
    applied to types derived from built-in types, which is a new
    feature.  Class statements without a base class create "classic
    classes", and so do class statements whose base classes are
    themselves classic classes.  For classic classes the classic
    lookup rule will be used. (To experiment with the new lookup rule
    for classic classes, you will be able to specify a different
    metaclass explicitly.)  We'll also provide a tool that analyzes a
    class hierarchy looking for methods that would be affected by a
    change in method resolution order.

    XXX Another way to explain the motivation for the new MRO, due to
    Damian Conway: you never use the method defined in a base class if
    it is defined in a derived class that you haven't explored yet
    (using the old search order).


XXX To be done

    Additional topics to be discussed in this PEP:

      - backwards compatibility issues!!!

      - class methods and static methods

      - cooperative methods and super()

      - mapping between type object slots (tp_foo) and special methods
        (__foo__) (actually, this may belong in PEP 252)

      - built-in names for built-in types (object, int, str, list etc.)

      - __dict__ and __dictoffset__

      - __slots__

      - the HEAPTYPE flag bit

      - GC support

      - API docs for all the new functions

      - how to use __new__

      - writing metaclasses (using mro() etc.)

      - high level user overview

      - open issues:

        - do we need __del__?

        - assignment to __dict__, __bases__

        - inconsistent naming
          (e.g. tp_dealloc/tp_new/tp_init/tp_alloc/tp_free)

        - add builtin alias 'dict' for 'dictionary'?

        - when subclasses of dict/list etc. are passed to system
          functions, the __getitem__ overrides (etc.) aren't always
          used


Implementation

    A prototype implementation of this PEP (and for PEP 252) is
    available from CVS, and in the series of Python 2.2 alpha and beta
    releases.  For some examples of the features described here, see
    the file Lib/test/test_descr.py and the extension module
    Modules/xxsubtype.c.


References

    [1] "Putting Metaclasses to Work", by Ira R. Forman and Scott
        H. Danforth, Addison-Wesley 1999.
        (http://www.aw.com/product/0,2627,0201433052,00.html)


Copyright

    This document has been placed in the public domain.


pep-0254 Making Classes Look More Like Types

PEP: 254
Title: Making Classes Look More Like Types
Version: $Revision$
Last-Modified: $Date$
Author: Guido van Rossum <guido at python.org>
Status: Rejected
Type: Standards Track
Created: 18-June-2001
Python-Version: 2.2
Post-History: 

Abstract

    This PEP has not been written yet.  Watch this space!

Status

    This PEP was a stub entry and eventually abandoned without having
    been filled-out.  Substantially most of the intended functionality
    was implemented in Py2.2 with new-style types and classes.


Copyright

    This document has been placed in the public domain.


pep-0255 Simple Generators

PEP: 255
Title: Simple Generators
Version: $Revision$
Last-Modified: $Date$
Author: Neil Schemenauer <nas at arctrix.com>, Tim Peters <tim at zope.com>, Magnus Lie Hetland <magnus at hetland.org>
Discussions-To:  <python-iterators at lists.sourceforge.net>
Status: Final
Type: Standards Track
Requires: 234
Created: 18-May-2001
Python-Version: 2.2
Post-History: 14-Jun-2001, 23-Jun-2001

Abstract

    This PEP introduces the concept of generators to Python, as well
    as a new statement used in conjunction with them, the "yield"
    statement.


Motivation

    When a producer function has a hard enough job that it requires
    maintaining state between values produced, most programming languages
    offer no pleasant and efficient solution beyond adding a callback
    function to the producer's argument list, to be called with each value
    produced.

    For example, tokenize.py in the standard library takes this approach:
    the caller must pass a "tokeneater" function to tokenize(), called
    whenever tokenize() finds the next token.  This allows tokenize to be
    coded in a natural way, but programs calling tokenize are typically
    convoluted by the need to remember between callbacks which token(s)
    were seen last.  The tokeneater function in tabnanny.py is a good
    example of that, maintaining a state machine in global variables, to
    remember across callbacks what it has already seen and what it hopes to
    see next.  This was difficult to get working correctly, and is still
    difficult for people to understand.  Unfortunately, that's typical of
    this approach.

    An alternative would have been for tokenize to produce an entire parse
    of the Python program at once, in a large list.  Then tokenize clients
    could be written in a natural way, using local variables and local
    control flow (such as loops and nested if statements) to keep track of
    their state.  But this isn't practical:  programs can be very large, so
    no a priori bound can be placed on the memory needed to materialize the
    whole parse; and some tokenize clients only want to see whether
    something specific appears early in the program (e.g., a future
    statement, or, as is done in IDLE, just the first indented statement),
    and then parsing the whole program first is a severe waste of time.

    Another alternative would be to make tokenize an iterator[1],
    delivering the next token whenever its .next() method is invoked.  This
    is pleasant for the caller in the same way a large list of results
    would be, but without the memory and "what if I want to get out early?"
    drawbacks.  However, this shifts the burden on tokenize to remember
    *its* state between .next() invocations, and the reader need only
    glance at tokenize.tokenize_loop() to realize what a horrid chore that
    would be.  Or picture a recursive algorithm for producing the nodes of
    a general tree structure:  to cast that into an iterator framework
    requires removing the recursion manually and maintaining the state of
    the traversal by hand.

    A fourth option is to run the producer and consumer in separate
    threads.  This allows both to maintain their states in natural ways,
    and so is pleasant for both.  Indeed, Demo/threads/Generator.py in the
    Python source distribution provides a usable synchronized-communication
    class for doing that in a general way.  This doesn't work on platforms
    without threads, though, and is very slow on platforms that do
    (compared to what is achievable without threads).

    A final option is to use the Stackless[2][3] variant implementation of
    Python instead, which supports lightweight coroutines.  This has much
    the same programmatic benefits as the thread option, but is much more
    efficient.  However, Stackless is a controversial rethinking of the
    Python core, and it may not be possible for Jython to implement the
    same semantics.  This PEP isn't the place to debate that, so suffice it
    to say here that generators provide a useful subset of Stackless
    functionality in a way that fits easily into the current CPython
    implementation, and is believed to be relatively straightforward for
    other Python implementations.

    That exhausts the current alternatives.  Some other high-level
    languages provide pleasant solutions, notably iterators in Sather[4],
    which were inspired by iterators in CLU; and generators in Icon[5], a
    novel language where every expression "is a generator".  There are
    differences among these, but the basic idea is the same:  provide a
    kind of function that can return an intermediate result ("the next
    value") to its caller, but maintaining the function's local state so
    that the function can be resumed again right where it left off.  A
    very simple example:

        def fib():
            a, b = 0, 1
            while 1:
                yield b
                a, b = b, a+b

    When fib() is first invoked, it sets a to 0 and b to 1, then yields b
    back to its caller.  The caller sees 1.  When fib is resumed, from its
    point of view the yield statement is really the same as, say, a print
    statement:  fib continues after the yield with all local state intact.
    a and b then become 1 and 1, and fib loops back to the yield, yielding
    1 to its invoker.  And so on.  From fib's point of view it's just
    delivering a sequence of results, as if via callback.  But from its
    caller's point of view, the fib invocation is an iterable object that
    can be resumed at will.  As in the thread approach, this allows both
    sides to be coded in the most natural ways; but unlike the thread
    approach, this can be done efficiently and on all platforms.  Indeed,
    resuming a generator should be no more expensive than a function call.

    The same kind of approach applies to many producer/consumer functions.
    For example, tokenize.py could yield the next token instead of invoking
    a callback function with it as argument, and tokenize clients could
    iterate over the tokens in a natural way:  a Python generator is a kind
    of Python iterator[1], but of an especially powerful kind.


Specification: Yield

    A new statement is introduced:

        yield_stmt:    "yield" expression_list

    "yield" is a new keyword, so a future statement[8] is needed to phase
    this in:  in the initial release, a module desiring to use generators
    must include the line

        from __future__ import generators

    near the top (see PEP 236[8]) for details).  Modules using the
    identifier "yield" without a future statement will trigger warnings.
    In the following release, yield will be a language keyword and the
    future statement will no longer be needed.

    The yield statement may only be used inside functions.  A function that
    contains a yield statement is called a generator function.  A generator
    function is an ordinary function object in all respects, but has the
    new CO_GENERATOR flag set in the code object's co_flags member.

    When a generator function is called, the actual arguments are bound to
    function-local formal argument names in the usual way, but no code in
    the body of the function is executed.  Instead a generator-iterator
    object is returned; this conforms to the iterator protocol[6], so in
    particular can be used in for-loops in a natural way.  Note that when
    the intent is clear from context, the unqualified name "generator" may
    be used to refer either to a generator-function or a generator-
    iterator.

    Each time the .next() method of a generator-iterator is invoked, the
    code in the body of the generator-function is executed until a yield
    or return statement (see below) is encountered, or until the end of
    the body is reached.

    If a yield statement is encountered, the state of the function is
    frozen, and the value of expression_list is returned to .next()'s
    caller.  By "frozen" we mean that all local state is retained,
    including the current bindings of local variables, the instruction
    pointer, and the internal evaluation stack:  enough information is
    saved so that the next time .next() is invoked, the function can
    proceed exactly as if the yield statement were just another external
    call.

    Restriction:  A yield statement is not allowed in the try clause of a
    try/finally construct.  The difficulty is that there's no guarantee
    the generator will ever be resumed, hence no guarantee that the finally
    block will ever get executed; that's too much a violation of finally's
    purpose to bear.

    Restriction:  A generator cannot be resumed while it is actively
    running:

        >>> def g():
        ...     i = me.next()
        ...     yield i
        >>> me = g()
        >>> me.next()
        Traceback (most recent call last):
         ...
          File "<string>", line 2, in g
        ValueError: generator already executing


Specification: Return

    A generator function can also contain return statements of the form:

        "return"

    Note that an expression_list is not allowed on return statements
    in the body of a generator (although, of course, they may appear in
    the bodies of non-generator functions nested within the generator).

    When a return statement is encountered, control proceeds as in any
    function return, executing the appropriate finally clauses (if any
    exist).  Then a StopIteration exception is raised, signalling that the
    iterator is exhausted.   A StopIteration exception is also raised if
    control flows off the end of the generator without an explict return.

    Note that return means "I'm done, and have nothing interesting to
    return", for both generator functions and non-generator functions.

    Note that return isn't always equivalent to raising StopIteration:  the
    difference lies in how enclosing try/except constructs are treated.
    For example,

        >>> def f1():
        ...     try:
        ...         return
        ...     except:
        ...        yield 1
        >>> print list(f1())
        []

    because, as in any function, return simply exits, but

        >>> def f2():
        ...     try:
        ...         raise StopIteration
        ...     except:
        ...         yield 42
        >>> print list(f2())
        [42]

    because StopIteration is captured by a bare "except", as is any
    exception.


Specification: Generators and Exception Propagation

    If an unhandled exception-- including, but not limited to,
    StopIteration --is raised by, or passes through, a generator function,
    then the exception is passed on to the caller in the usual way, and
    subsequent attempts to resume the generator function raise
    StopIteration.  In other words, an unhandled exception terminates a
    generator's useful life.

    Example (not idiomatic but to illustrate the point):

    >>> def f():
    ...     return 1/0
    >>> def g():
    ...     yield f()  # the zero division exception propagates
    ...     yield 42   # and we'll never get here
    >>> k = g()
    >>> k.next()
    Traceback (most recent call last):
      File "<stdin>", line 1, in ?
      File "<stdin>", line 2, in g
      File "<stdin>", line 2, in f
    ZeroDivisionError: integer division or modulo by zero
    >>> k.next()  # and the generator cannot be resumed
    Traceback (most recent call last):
      File "<stdin>", line 1, in ?
    StopIteration
    >>>


Specification: Try/Except/Finally

    As noted earlier, yield is not allowed in the try clause of a try/
    finally construct.  A consequence is that generators should allocate
    critical resources with great care.  There is no restriction on yield
    otherwise appearing in finally clauses, except clauses, or in the try
    clause of a try/except construct:

    >>> def f():
    ...     try:
    ...         yield 1
    ...         try:
    ...             yield 2
    ...             1/0
    ...             yield 3  # never get here
    ...         except ZeroDivisionError:
    ...             yield 4
    ...             yield 5
    ...             raise
    ...         except:
    ...             yield 6
    ...         yield 7     # the "raise" above stops this
    ...     except:
    ...         yield 8
    ...     yield 9
    ...     try:
    ...         x = 12
    ...     finally:
    ...         yield 10
    ...     yield 11
    >>> print list(f())
    [1, 2, 4, 5, 8, 9, 10, 11]
    >>>


Example

        # A binary tree class.
        class Tree:

            def __init__(self, label, left=None, right=None):
                self.label = label
                self.left = left
                self.right = right

            def __repr__(self, level=0, indent="    "):
                s = level*indent + `self.label`
                if self.left:
                    s = s + "\n" + self.left.__repr__(level+1, indent)
                if self.right:
                    s = s + "\n" + self.right.__repr__(level+1, indent)
                return s

            def __iter__(self):
                return inorder(self)

        # Create a Tree from a list.
        def tree(list):
            n = len(list)
            if n == 0:
                return []
            i = n / 2
            return Tree(list[i], tree(list[:i]), tree(list[i+1:]))

        # A recursive generator that generates Tree labels in in-order.
        def inorder(t):
            if t:
                for x in inorder(t.left):
                    yield x
                yield t.label
                for x in inorder(t.right):
                    yield x

        # Show it off: create a tree.
        t = tree("ABCDEFGHIJKLMNOPQRSTUVWXYZ")
        # Print the nodes of the tree in in-order.
        for x in t:
            print x,
        print

        # A non-recursive generator.
        def inorder(node):
            stack = []
            while node:
                while node.left:
                    stack.append(node)
                    node = node.left
                yield node.label
                while not node.right:
                    try:
                        node = stack.pop()
                    except IndexError:
                        return
                    yield node.label
                node = node.right

        # Exercise the non-recursive generator.
        for x in t:
            print x,
        print

    Both output blocks display:

        A B C D E F G H I J K L M N O P Q R S T U V W X Y Z


Q & A

    Q. Why not a new keyword instead of reusing "def"?

    A. See BDFL Pronouncements section below.

    Q. Why a new keyword for "yield"?  Why not a builtin function instead?

    A. Control flow is much better expressed via keyword in Python, and
       yield is a control construct.  It's also believed that efficient
       implementation in Jython requires that the compiler be able to
       determine potential suspension points at compile-time, and a new
       keyword makes that easy.  The CPython referrence implementation also
       exploits it heavily, to detect which functions *are* generator-
       functions (although a new keyword in place of "def" would solve that
       for CPython -- but people asking the "why a new keyword?" question
       don't want any new keyword).

    Q: Then why not some other special syntax without a new keyword?  For
       example, one of these instead of "yield 3":

       return 3 and continue
       return and continue 3
       return generating 3
       continue return 3
       return >> , 3
       from generator return 3
       return >> 3
       return << 3
       >> 3
       << 3
       * 3

    A: Did I miss one <wink>?  Out of hundreds of messages, I counted three
       suggesting such an alternative, and extracted the above from them.
       It would be nice not to need a new keyword, but nicer to make yield
       very clear -- I don't want to have to *deduce* that a yield is
       occurring from making sense of a previously senseless sequence of
       keywords or operators.  Still, if this attracts enough interest,
       proponents should settle on a single consensus suggestion, and Guido
       will Pronounce on it.

    Q. Why allow "return" at all?  Why not force termination to be spelled
       "raise StopIteration"?

    A. The mechanics of StopIteration are low-level details, much like the
       mechanics of IndexError in Python 2.1:  the implementation needs to
       do *something* well-defined under the covers, and Python exposes
       these mechanisms for advanced users.  That's not an argument for
       forcing everyone to work at that level, though.  "return" means "I'm
       done" in any kind of function, and that's easy to explain and to use.
       Note that "return" isn't always equivalent to "raise StopIteration"
       in try/except construct, either (see the "Specification: Return"
       section).

    Q. Then why not allow an expression on "return" too?

    A. Perhaps we will someday.  In Icon, "return expr" means both "I'm
       done", and "but I have one final useful value to return too, and
       this is it".  At the start, and in the absence of compelling uses
       for "return expr", it's simply cleaner to use "yield" exclusively
       for delivering values.


BDFL Pronouncements

    Issue:  Introduce another new keyword (say, "gen" or "generator") in
    place of "def", or otherwise alter the syntax, to distinguish
    generator-functions from non-generator functions.

    Con:  In practice (how you think about them), generators *are*
    functions, but with the twist that they're resumable.  The mechanics of
    how they're set up is a comparatively minor technical issue, and
    introducing a new keyword would unhelpfully overemphasize the
    mechanics of how generators get started (a vital but tiny part of a
    generator's life).

    Pro:  In reality (how you think about them), generator-functions are
    actually factory functions that produce generator-iterators as if by
    magic.  In this respect they're radically different from non-generator
    functions, acting more like a constructor than a function, so reusing
    "def" is at best confusing.  A "yield" statement buried in the body is
    not enough warning that the semantics are so different.

    BDFL:  "def" it stays.  No argument on either side is totally
    convincing, so I have consulted my language designer's intuition.  It
    tells me that the syntax proposed in the PEP is exactly right - not too
    hot, not too cold.  But, like the Oracle at Delphi in Greek mythology,
    it doesn't tell me why, so I don't have a rebuttal for the arguments
    against the PEP syntax.  The best I can come up with (apart from
    agreeing with the rebuttals ... already made) is "FUD".  If this had
    been part of the language from day one, I very much doubt it would have
    made Andrew Kuchling's "Python Warts" page.


Reference Implementation

    The current implementation, in a preliminary state (no docs, but well
    tested and solid), is part of Python's CVS development tree[9].  Using
    this requires that you build Python from source.

    This was derived from an earlier patch by Neil Schemenauer[7].


Footnotes and References

    [1] PEP 234, Iterators, Yee, Van Rossum
        http://www.python.org/dev/peps/pep-0234/

    [2] http://www.stackless.com/

    [3] PEP 219, Stackless Python, McMillan
        http://www.python.org/dev/peps/pep-0219/

    [4] "Iteration Abstraction in Sather"
        Murer, Omohundro, Stoutamire and Szyperski
        http://www.icsi.berkeley.edu/~sather/Publications/toplas.html

    [5] http://www.cs.arizona.edu/icon/

    [6] The concept of iterators is described in PEP 234.  See [1] above.

    [7] http://python.ca/nas/python/generator.diff

    [8] PEP 236, Back to the __future__, Peters
        http://www.python.org/dev/peps/pep-0236/

    [9] To experiment with this implementation, check out Python from CVS
        according to the instructions at
            http://sf.net/cvs/?group_id=5470
        Note that the std test Lib/test/test_generators.py contains many
        examples, including all those in this PEP.


Copyright

    This document has been placed in the public domain.



pep-0256 Docstring Processing System Framework

PEP:256
Title:Docstring Processing System Framework
Version:$Revision$
Last-Modified:$Date$
Author:David Goodger <goodger at python.org>
Discussions-To:<doc-sig at python.org>
Status:Rejected
Type:Standards Track
Content-Type:text/x-rst
Created:01-Jun-2001
Post-History:13-Jun-2001

Rejection Notice

This proposal seems to have run out of steam.

Abstract

Python lends itself to inline documentation. With its built-in docstring syntax, a limited form of Literate Programming [4] is easy to do in Python. However, there are no satisfactory standard tools for extracting and processing Python docstrings. The lack of a standard toolset is a significant gap in Python's infrastructure; this PEP aims to fill the gap.

The issues surrounding docstring processing have been contentious and difficult to resolve. This PEP proposes a generic Docstring Processing System (DPS) framework, which separates out the components (program and conceptual), enabling the resolution of individual issues either through consensus (one solution) or through divergence (many). It promotes standard interfaces which will allow a variety of plug-in components (input context readers, markup parsers, and output format writers) to be used.

The concepts of a DPS framework are presented independently of implementation details.

Road Map to the Docstring PEPs

There are many aspects to docstring processing. The "Docstring PEPs" have broken up the issues in order to deal with each of them in isolation, or as close as possible. The individual aspects and associated PEPs are as follows:

  • Docstring syntax. PEP 287, "reStructuredText Docstring Format" [1], proposes a syntax for Python docstrings, PEPs, and other uses.
  • Docstring semantics consist of at least two aspects:
    • Conventions: the high-level structure of docstrings. Dealt with in PEP 257, "Docstring Conventions" [2].
    • Methodology: rules for the informational content of docstrings. Not addressed.
  • Processing mechanisms. This PEP (PEP 256) outlines the high-level issues and specification of an abstract docstring processing system (DPS). PEP 258, "Docutils Design Specification" [3], is an overview of the design and implementation of one DPS under development.
  • Output styles: developers want the documentation generated from their source code to look good, and there are many different ideas about what that means. PEP 258 touches on "Stylist Transforms". This aspect of docstring processing has yet to be fully explored.

By separating out the issues, we can form consensus more easily (smaller fights ;-), and accept divergence more readily.

Rationale

There are standard inline documentation systems for some other languages. For example, Perl has POD [5] ("Plain Old Documentation") and Java has Javadoc [6], but neither of these mesh with the Pythonic way. POD syntax is very explicit, but takes after Perl in terms of readability. Javadoc is HTML-centric; except for "@field" tags, raw HTML is used for markup. There are also general tools such as Autoduck [7] and Web [8] (Tangle & Weave), useful for multiple languages.

There have been many attempts to write auto-documentation systems for Python (not an exhaustive list):

These systems, each with different goals, have had varying degrees of success. A problem with many of the above systems was over-ambition combined with inflexibility. They provided a self-contained set of components: a docstring extraction system, a markup parser, an internal processing system and one or more output format writers with a fixed style. Inevitably, one or more aspects of each system had serious shortcomings, and they were not easily extended or modified, preventing them from being adopted as standard tools.

It has become clear (to this author, at least) that the "all or nothing" approach cannot succeed, since no monolithic self-contained system could possibly be agreed upon by all interested parties. A modular component approach designed for extension, where components may be multiply implemented, may be the only chance for success. Standard inter-component APIs will make the DPS components comprehensible without requiring detailed knowledge of the whole, lowering the barrier for contributions, and ultimately resulting in a rich and varied system.

Each of the components of a docstring processing system should be developed independently. A "best of breed" system should be chosen, either merged from existing systems, and/or developed anew. This system should be included in Python's standard library.

PyDoc & Other Existing Systems

PyDoc became part of the Python standard library as of release 2.1. It extracts and displays docstrings from within the Python interactive interpreter, from the shell command line, and from a GUI window into a web browser (HTML). Although a very useful tool, PyDoc has several deficiencies, including:

  • In the case of the GUI/HTML, except for some heuristic hyperlinking of identifier names, no formatting of the docstrings is done. They are presented within <p><small><tt> tags to avoid unwanted line wrapping. Unfortunately, the result is not attractive.
  • PyDoc extracts docstrings and structural information (class identifiers, method signatures, etc.) from imported module objects. There are security issues involved with importing untrusted code. Also, information from the source is lost when importing, such as comments, "additional docstrings" (string literals in non-docstring contexts; see PEP 258 [3]), and the order of definitions.

The functionality proposed in this PEP could be added to or used by PyDoc when serving HTML pages. The proposed docstring processing system's functionality is much more than PyDoc needs in its current form. Either an independent tool will be developed (which PyDoc may or may not use), or PyDoc could be expanded to encompass this functionality and become the docstring processing system (or one such system). That decision is beyond the scope of this PEP.

Similarly for other existing docstring processing systems, their authors may or may not choose compatibility with this framework. However, if this framework is accepted and adopted as the Python standard, compatibility will become an important consideration in these systems' future.

Specification

The docstring processing system framework is broken up as follows:

  1. Docstring conventions. Documents issues such as:

    • What should be documented where.
    • First line is a one-line synopsis.

    PEP 257 [2] documents some of these issues.

  2. Docstring processing system design specification. Documents issues such as:

    • High-level spec: what a DPS does.
    • Command-line interface for executable script.
    • System Python API.
    • Docstring extraction rules.
    • Readers, which encapsulate the input context.
    • Parsers.
    • Document tree: the intermediate internal data structure. The output of the Parser and Reader, and the input to the Writer all share the same data structure.
    • Transforms, which modify the document tree.
    • Writers for output formats.
    • Distributors, which handle output management (one file, many files, or objects in memory).

    These issues are applicable to any docstring processing system implementation. PEP 258 [3] documents these issues.

  3. Docstring processing system implementation.

  4. Input markup specifications: docstring syntax. PEP 287 [1] proposes a standard syntax.

  5. Input parser implementations.

  6. Input context readers ("modes": Python source code, PEP, standalone text file, email, etc.) and implementations.

  7. Stylists: certain input context readers may have associated stylists which allow for a variety of output document styles.

  8. Output formats (HTML, XML, TeX, DocBook, info, etc.) and writer implementations.

Components 1, 2/3/5, and 4 are the subject of individual companion PEPs. If there is another implementation of the framework or syntax/parser, additional PEPs may be required. Multiple implementations of each of components 6 and 7 will be required; the PEP mechanism may be overkill for these components.

Project Web Site

A SourceForge project has been set up for this work at http://docutils.sourceforge.net/.

Acknowledgements

This document borrows ideas from the archives of the Python Doc-SIG [16]. Thanks to all members past & present.

pep-0257 Docstring Conventions

PEP:257
Title:Docstring Conventions
Version:$Revision$
Last-Modified:$Date$
Author:David Goodger <goodger at python.org>, Guido van Rossum <guido at python.org>
Discussions-To:doc-sig at python.org
Status:Active
Type:Informational
Content-Type:text/x-rst
Created:29-May-2001
Post-History:13-Jun-2001

Abstract

This PEP documents the semantics and conventions associated with Python docstrings.

Rationale

The aim of this PEP is to standardize the high-level structure of docstrings: what they should contain, and how to say it (without touching on any markup syntax within docstrings). The PEP contains conventions, not laws or syntax.

"A universal convention supplies all of maintainability, clarity, consistency, and a foundation for good programming habits too. What it doesn't do is insist that you follow it against your will. That's Python!"

—Tim Peters on comp.lang.python, 2001-06-16

If you violate these conventions, the worst you'll get is some dirty looks. But some software (such as the Docutils [3] docstring processing system [1] [2]) will be aware of the conventions, so following them will get you the best results.

Specification

What is a Docstring?

A docstring is a string literal that occurs as the first statement in a module, function, class, or method definition. Such a docstring becomes the __doc__ special attribute of that object.

All modules should normally have docstrings, and all functions and classes exported by a module should also have docstrings. Public methods (including the __init__ constructor) should also have docstrings. A package may be documented in the module docstring of the __init__.py file in the package directory.

String literals occurring elsewhere in Python code may also act as documentation. They are not recognized by the Python bytecode compiler and are not accessible as runtime object attributes (i.e. not assigned to __doc__), but two types of extra docstrings may be extracted by software tools:

  1. String literals occurring immediately after a simple assignment at the top level of a module, class, or __init__ method are called "attribute docstrings".
  2. String literals occurring immediately after another docstring are called "additional docstrings".

Please see PEP 258, "Docutils Design Specification" [2], for a detailed description of attribute and additional docstrings.

XXX Mention docstrings of 2.2 properties.

For consistency, always use """triple double quotes""" around docstrings. Use r"""raw triple double quotes""" if you use any backslashes in your docstrings. For Unicode docstrings, use u"""Unicode triple-quoted strings""".

There are two forms of docstrings: one-liners and multi-line docstrings.

One-line Docstrings

One-liners are for really obvious cases. They should really fit on one line. For example:

def kos_root():
    """Return the pathname of the KOS root directory."""
    global _kos_root
    if _kos_root: return _kos_root
    ...

Notes:

  • Triple quotes are used even though the string fits on one line. This makes it easy to later expand it.

  • The closing quotes are on the same line as the opening quotes. This looks better for one-liners.

  • There's no blank line either before or after the docstring.

  • The docstring is a phrase ending in a period. It prescribes the function or method's effect as a command ("Do this", "Return that"), not as a description; e.g. don't write "Returns the pathname ...".

  • The one-line docstring should NOT be a "signature" reiterating the function/method parameters (which can be obtained by introspection). Don't do:

    def function(a, b):
        """function(a, b) -> list"""
    

    This type of docstring is only appropriate for C functions (such as built-ins), where introspection is not possible. However, the nature of the return value cannot be determined by introspection, so it should be mentioned. The preferred form for such a docstring would be something like:

    def function(a, b):
        """Do X and return a list."""
    

    (Of course "Do X" should be replaced by a useful description!)

Multi-line Docstrings

Multi-line docstrings consist of a summary line just like a one-line docstring, followed by a blank line, followed by a more elaborate description. The summary line may be used by automatic indexing tools; it is important that it fits on one line and is separated from the rest of the docstring by a blank line. The summary line may be on the same line as the opening quotes or on the next line. The entire docstring is indented the same as the quotes at its first line (see example below).

Insert a blank line after all docstrings (one-line or multi-line) that document a class -- generally speaking, the class's methods are separated from each other by a single blank line, and the docstring needs to be offset from the first method by a blank line.

The docstring of a script (a stand-alone program) should be usable as its "usage" message, printed when the script is invoked with incorrect or missing arguments (or perhaps with a "-h" option, for "help"). Such a docstring should document the script's function and command line syntax, environment variables, and files. Usage messages can be fairly elaborate (several screens full) and should be sufficient for a new user to use the command properly, as well as a complete quick reference to all options and arguments for the sophisticated user.

The docstring for a module should generally list the classes, exceptions and functions (and any other objects) that are exported by the module, with a one-line summary of each. (These summaries generally give less detail than the summary line in the object's docstring.) The docstring for a package (i.e., the docstring of the package's __init__.py module) should also list the modules and subpackages exported by the package.

The docstring for a function or method should summarize its behavior and document its arguments, return value(s), side effects, exceptions raised, and restrictions on when it can be called (all if applicable). Optional arguments should be indicated. It should be documented whether keyword arguments are part of the interface.

The docstring for a class should summarize its behavior and list the public methods and instance variables. If the class is intended to be subclassed, and has an additional interface for subclasses, this interface should be listed separately (in the docstring). The class constructor should be documented in the docstring for its __init__ method. Individual methods should be documented by their own docstring.

If a class subclasses another class and its behavior is mostly inherited from that class, its docstring should mention this and summarize the differences. Use the verb "override" to indicate that a subclass method replaces a superclass method and does not call the superclass method; use the verb "extend" to indicate that a subclass method calls the superclass method (in addition to its own behavior).

Do not use the Emacs convention of mentioning the arguments of functions or methods in upper case in running text. Python is case sensitive and the argument names can be used for keyword arguments, so the docstring should document the correct argument names. It is best to list each argument on a separate line. For example:

def complex(real=0.0, imag=0.0):
    """Form a complex number.

    Keyword arguments:
    real -- the real part (default 0.0)
    imag -- the imaginary part (default 0.0)
    """
    if imag == 0.0 and real == 0.0:
        return complex_zero
    ...

Unless the entire docstring fits on a line, place the closing quotes on a line by themselves. This way, Emacs' fill-paragraph command can be used on it.

Handling Docstring Indentation

Docstring processing tools will strip a uniform amount of indentation from the second and further lines of the docstring, equal to the minimum indentation of all non-blank lines after the first line. Any indentation in the first line of the docstring (i.e., up to the first newline) is insignificant and removed. Relative indentation of later lines in the docstring is retained. Blank lines should be removed from the beginning and end of the docstring.

Since code is much more precise than words, here is an implementation of the algorithm:

def trim(docstring):
    if not docstring:
        return ''
    # Convert tabs to spaces (following the normal Python rules)
    # and split into a list of lines:
    lines = docstring.expandtabs().splitlines()
    # Determine minimum indentation (first line doesn't count):
    indent = sys.maxint
    for line in lines[1:]:
        stripped = line.lstrip()
        if stripped:
            indent = min(indent, len(line) - len(stripped))
    # Remove indentation (first line is special):
    trimmed = [lines[0].strip()]
    if indent < sys.maxint:
        for line in lines[1:]:
            trimmed.append(line[indent:].rstrip())
    # Strip off trailing and leading blank lines:
    while trimmed and not trimmed[-1]:
        trimmed.pop()
    while trimmed and not trimmed[0]:
        trimmed.pop(0)
    # Return a single string:
    return '\n'.join(trimmed)

The docstring in this example contains two newline characters and is therefore 3 lines long. The first and last lines are blank:

def foo():
    """
    This is the second line of the docstring.
    """

To illustrate:

>>> print repr(foo.__doc__)
'\n    This is the second line of the docstring.\n    '
>>> foo.__doc__.splitlines()
['', '    This is the second line of the docstring.', '    ']
>>> trim(foo.__doc__)
'This is the second line of the docstring.'

Once trimmed, these docstrings are equivalent:

def foo():
    """A multi-line
    docstring.
    """

def bar():
    """
    A multi-line
    docstring.
    """

Acknowledgements

The "Specification" text comes mostly verbatim from the Python Style Guide [4] essay by Guido van Rossum.

This document borrows ideas from the archives of the Python Doc-SIG [5]. Thanks to all members past and present.

pep-0258 Docutils Design Specification

PEP:258
Title:Docutils Design Specification
Version:$Revision$
Last-Modified:$Date$
Author:David Goodger <goodger at python.org>
Discussions-To:<doc-sig at python.org>
Status:Rejected
Type:Standards Track
Content-Type:text/x-rst
Requires:256 257
Created:31-May-2001
Post-History:13-Jun-2001

Rejection Notice

While this may serve as an interesting design document for the now-independent docutils, it is no longer slated for inclusion in the standard library.

Abstract

This PEP documents design issues and implementation details for Docutils, a Python Docstring Processing System (DPS). The rationale and high-level concepts of a DPS are documented in PEP 256, "Docstring Processing System Framework" [1]. Also see PEP 256 for a "Road Map to the Docstring PEPs".

Docutils is being designed modularly so that any of its components can be replaced easily. In addition, Docutils is not limited to the processing of Python docstrings; it processes standalone documents as well, in several contexts.

No changes to the core Python language are required by this PEP. Its deliverables consist of a package for the standard library and its documentation.

Specification

Docutils Project Model

Project components and data flow:

                 +---------------------------+
                 |        Docutils:          |
                 | docutils.core.Publisher,  |
                 | docutils.core.publish_*() |
                 +---------------------------+
                  /            |            \
                 /             |             \
        1,3,5   /        6     |              \ 7
       +--------+       +-------------+       +--------+
       | READER | ----> | TRANSFORMER | ====> | WRITER |
       +--------+       +-------------+       +--------+
        /     \\                                  |
       /       \\                                 |
 2    /      4  \\                             8  |
+-------+   +--------+                        +--------+
| INPUT |   | PARSER |                        | OUTPUT |
+-------+   +--------+                        +--------+

The numbers above each component indicate the path a document's data takes. Double-width lines between Reader & Parser and between Transformer & Writer indicate that data sent along these paths should be standard (pure & unextended) Docutils doc trees. Single-width lines signify that internal tree extensions or completely unrelated representations are possible, but they must be supported at both ends.

Publisher

The docutils.core module contains a "Publisher" facade class and several convenience functions: "publish_cmdline()" (for command-line front ends), "publish_file()" (for programmatic use with file-like I/O), and "publish_string()" (for programmatic use with string I/O). The Publisher class encapsulates the high-level logic of a Docutils system. The Publisher class has overall responsibility for processing, controlled by the Publisher.publish() method:

  1. Set up internal settings (may include config files & command-line options) and I/O objects.
  2. Call the Reader object to read data from the source Input object and parse the data with the Parser object. A document object is returned.
  3. Set up and apply transforms via the Transformer object attached to the document.
  4. Call the Writer object which translates the document to the final output format and writes the formatted data to the destination Output object. Depending on the Output object, the output may be returned from the Writer, and then from the publish() method.

Calling the "publish" function (or instantiating a "Publisher" object) with component names will result in default behavior. For custom behavior (customizing component settings), create custom component objects first, and pass them to the Publisher or publish_* convenience functions.

Readers

Readers understand the input context (where the data is coming from), send the whole input or discrete "chunks" to the parser, and provide the context to bind the chunks together back into a cohesive whole.

Each reader is a module or package exporting a "Reader" class with a "read" method. The base "Reader" class can be found in the docutils/readers/__init__.py module.

Most Readers will have to be told what parser to use. So far (see the list of examples below), only the Python Source Reader ("PySource"; still incomplete) will be able to determine the parser on its own.

Responsibilities:

  • Get input text from the source I/O.
  • Pass the input text to the parser, along with a fresh document tree root.

Examples:

  • Standalone (Raw/Plain): Just read a text file and process it. The reader needs to be told which parser to use.

    The "Standalone Reader" has been implemented in module docutils.readers.standalone.

  • Python Source: See Python Source Reader below. This Reader is currently in development in the Docutils sandbox.

  • Email: RFC-822 headers, quoted excerpts, signatures, MIME parts.

  • PEP: RFC-822 headers, "PEP xxxx" and "RFC xxxx" conversion to URIs. The "PEP Reader" has been implemented in module docutils.readers.pep; see PEP 287 and PEP 12.

  • Wiki: Global reference lookups of "wiki links" incorporated into transforms. (CamelCase only or unrestricted?) Lazy indentation?

  • Web Page: As standalone, but recognize meta fields as meta tags. Support for templates of some sort? (After <body>, before </body>?)

  • FAQ: Structured "question & answer(s)" constructs.

  • Compound document: Merge chapters into a book. Master manifest file?

Parsers

Parsers analyze their input and produce a Docutils document tree. They don't know or care anything about the source or destination of the data.

Each input parser is a module or package exporting a "Parser" class with a "parse" method. The base "Parser" class can be found in the docutils/parsers/__init__.py module.

Responsibilities: Given raw input text and a doctree root node, populate the doctree by parsing the input text.

Example: The only parser implemented so far is for the reStructuredText markup. It is implemented in the docutils/parsers/rst/ package.

The development and integration of other parsers is possible and encouraged.

Transformer

The Transformer class, in docutils/transforms/__init__.py, stores transforms and applies them to documents. A transformer object is attached to every new document tree. The Publisher calls Transformer.apply_transforms() to apply all stored transforms to the document tree. Transforms change the document tree from one form to another, add to the tree, or prune it. Transforms resolve references and footnote numbers, process interpreted text, and do other context-sensitive processing.

Some transforms are specific to components (Readers, Parser, Writers, Input, Output). Standard component-specific transforms are specified in the default_transforms attribute of component classes. After the Reader has finished processing, the Publisher calls Transformer.populate_from_components() with a list of components and all default transforms are stored.

Each transform is a class in a module in the docutils/transforms/ package, a subclass of docutils.tranforms.Transform. Transform classes each have a default_priority attribute which is used by the Transformer to apply transforms in order (low to high). The default priority can be overridden when adding transforms to the Transformer object.

Transformer responsibilities:

  • Apply transforms to the document tree, in priority order.
  • Store a mapping of component type name ('reader', 'writer', etc.) to component objects. These are used by certain transforms (such as "components.Filter") to determine suitability.

Transform responsibilities:

  • Modify a doctree in-place, either purely transforming one structure into another, or adding new structures based on the doctree and/or external data.

Examples of transforms (in the docutils/transforms/ package):

  • frontmatter.DocInfo: Conversion of document metadata (bibliographic information).
  • references.AnonymousHyperlinks: Resolution of anonymous references to corresponding targets.
  • parts.Contents: Generates a table of contents for a document.
  • document.Merger: Combining multiple populated doctrees into one. (Not yet implemented or fully understood.)
  • document.Splitter: Splits a document into a tree-structure of subdocuments, perhaps by section. It will have to transform references appropriately. (Neither implemented not remotely understood.)
  • components.Filter: Includes or excludes elements which depend on a specific Docutils component.

Writers

Writers produce the final output (HTML, XML, TeX, etc.). Writers translate the internal document tree structure into the final data format, possibly running Writer-specific transforms first.

By the time the document gets to the Writer, it should be in final form. The Writer's job is simply (and only) to translate from the Docutils doctree structure to the target format. Some small transforms may be required, but they should be local and format-specific.

Each writer is a module or package exporting a "Writer" class with a "write" method. The base "Writer" class can be found in the docutils/writers/__init__.py module.

Responsibilities:

  • Translate doctree(s) into specific output formats.
    • Transform references into format-native forms.
  • Write the translated output to the destination I/O.

Examples:

  • XML: Various forms, such as:
    • Docutils XML (an expression of the internal document tree, implemented as docutils.writers.docutils_xml).
    • DocBook (being implemented in the Docutils sandbox).
  • HTML (XHTML implemented as docutils.writers.html4css1).
  • PDF (a ReportLabs interface is being developed in the Docutils sandbox).
  • TeX (a LaTeX Writer is being implemented in the sandbox).
  • Docutils-native pseudo-XML (implemented as docutils.writers.pseudoxml, used for testing).
  • Plain text
  • reStructuredText?

Input/Output

I/O classes provide a uniform API for low-level input and output. Subclasses will exist for a variety of input/output mechanisms. However, they can be considered an implementation detail. Most applications should be satisfied using one of the convenience functions associated with the Publisher.

I/O classes are currently in the preliminary stages; there's a lot of work yet to be done. Issues:

  • How to represent multi-file input (files & directories) in the API?
  • How to represent multi-file output? Perhaps "Writer" variants, one for each output distribution type? Or Output objects with associated transforms?

Responsibilities:

  • Read data from the input source (Input objects) or write data to the output destination (Output objects).

Examples of input sources:

  • A single file on disk or a stream (implemented as docutils.io.FileInput).
  • Multiple files on disk (MultiFileInput?).
  • Python source files: modules and packages.
  • Python strings, as received from a client application (implemented as docutils.io.StringInput).

Examples of output destinations:

  • A single file on disk or a stream (implemented as docutils.io.FileOutput).
  • A tree of directories and files on disk.
  • A Python string, returned to a client application (implemented as docutils.io.StringOutput).
  • No output; useful for programmatic applications where only a portion of the normal output is to be used (implemented as docutils.io.NullOutput).
  • A single tree-shaped data structure in memory.
  • Some other set of data structures in memory.

Docutils Package Structure

  • Package "docutils".

    • Module "__init__.py" contains: class "Component", a base class for Docutils components; class "SettingsSpec", a base class for specifying runtime settings (used by docutils.frontend); and class "TransformSpec", a base class for specifying transforms.

    • Module "docutils.core" contains facade class "Publisher" and convenience functions. See Publisher above.

    • Module "docutils.frontend" provides runtime settings support, for programmatic use and front-end tools (including configuration file support, and command-line argument and option processing).

    • Module "docutils.io" provides a uniform API for low-level input and output. See Input/Output above.

    • Module "docutils.nodes" contains the Docutils document tree element class library plus tree-traversal Visitor pattern base classes. See Document Tree below.

    • Module "docutils.statemachine" contains a finite state machine specialized for regular-expression-based text filters and parsers. The reStructuredText parser implementation is based on this module.

    • Module "docutils.urischemes" contains a mapping of known URI schemes ("http", "ftp", "mail", etc.).

    • Module "docutils.utils" contains utility functions and classes, including a logger class ("Reporter"; see Error Handling below).

    • Package "docutils.parsers": markup parsers.

      • Function "get_parser_class(parser_name)" returns a parser module by name. Class "Parser" is the base class of specific parsers. (docutils/parsers/__init__.py)
      • Package "docutils.parsers.rst": the reStructuredText parser.
      • Alternate markup parsers may be added.

      See Parsers above.

    • Package "docutils.readers": context-aware input readers.

      • Function "get_reader_class(reader_name)" returns a reader module by name or alias. Class "Reader" is the base class of specific readers. (docutils/readers/__init__.py)
      • Module "docutils.readers.standalone" reads independent document files.
      • Module "docutils.readers.pep" reads PEPs (Python Enhancement Proposals).
      • Readers to be added for: Python source code (structure & docstrings), email, FAQ, and perhaps Wiki and others.

      See Readers above.

    • Package "docutils.writers": output format writers.

      • Function "get_writer_class(writer_name)" returns a writer module by name. Class "Writer" is the base class of specific writers. (docutils/writers/__init__.py)
      • Module "docutils.writers.html4css1" is a simple HyperText Markup Language document tree writer for HTML 4.01 and CSS1.
      • Module "docutils.writers.docutils_xml" writes the internal document tree in XML form.
      • Module "docutils.writers.pseudoxml" is a simple internal document tree writer; it writes indented pseudo-XML.
      • Writers to be added: HTML 3.2 or 4.01-loose, XML (various forms, such as DocBook), PDF, TeX, plaintext, reStructuredText, and perhaps others.

      See Writers above.

    • Package "docutils.transforms": tree transform classes.

      • Class "Transformer" stores transforms and applies them to document trees. (docutils/transforms/__init__.py)
      • Class "Transform" is the base class of specific transforms. (docutils/transforms/__init__.py)
      • Each module contains related transform classes.

      See Transforms above.

    • Package "docutils.languages": Language modules contain language-dependent strings and mappings. They are named for their language identifier (as defined in Choice of Docstring Format below), converting dashes to underscores.

      • Function "get_language(language_code)", returns matching language module. (docutils/languages/__init__.py)
      • Modules: en.py (English), de.py (German), fr.py (French), it.py (Italian), sk.py (Slovak), sv.py (Swedish).
      • Other languages to be added.
  • Third-party modules: "extras" directory. These modules are installed only if they're not already present in the Python installation.

    • extras/optparse.py and extras/textwrap.py provide option parsing and command-line help; from Greg Ward's http://optik.sf.net/ project, included for convenience.
    • extras/roman.py contains Roman numeral conversion routines.

Front-End Tools

The tools/ directory contains several front ends for common Docutils processing. See Docutils Front-End Tools [4] for details.

Document Tree

A single intermediate data structure is used internally by Docutils, in the interfaces between components; it is defined in the docutils.nodes module. It is not required that this data structure be used internally by any of the components, just between components as outlined in the diagram in the Docutils Project Model above.

Custom node types are allowed, provided that either (a) a transform converts them to standard Docutils nodes before they reach the Writer proper, or (b) the custom node is explicitly supported by certain Writers, and is wrapped in a filtered "pending" node. An example of condition (a) is the Python Source Reader (see below), where a "stylist" transform converts custom nodes. The HTML <meta> tag is an example of condition (b); it is supported by the HTML Writer but not by others. The reStructuredText "meta" directive creates a "pending" node, which contains knowledge that the embedded "meta" node can only be handled by HTML-compatible writers. The "pending" node is resolved by the docutils.transforms.components.Filter transform, which checks that the calling writer supports HTML; if it doesn't, the "pending" node (and enclosed "meta" node) is removed from the document.

The document tree data structure is similar to a DOM tree, but with specific node names (classes) instead of DOM's generic nodes. The schema is documented in an XML DTD (eXtensible Markup Language Document Type Definition), which comes in two parts:

The DTD defines a rich set of elements, suitable for many input and output formats. The DTD retains all information necessary to reconstruct the original input text, or a reasonable facsimile thereof.

See The Docutils Document Tree [7] for details (incomplete).

Error Handling

When the parser encounters an error in markup, it inserts a system message (DTD element "system_message"). There are five levels of system messages:

  • Level-0, "DEBUG": an internal reporting issue. There is no effect on the processing. Level-0 system messages are handled separately from the others.
  • Level-1, "INFO": a minor issue that can be ignored. There is little or no effect on the processing. Typically level-1 system messages are not reported.
  • Level-2, "WARNING": an issue that should be addressed. If ignored, there may be minor problems with the output. Typically level-2 system messages are reported but do not halt processing
  • Level-3, "ERROR": a major issue that should be addressed. If ignored, the output will contain unpredictable errors. Typically level-3 system messages are reported but do not halt processing
  • Level-4, "SEVERE": a critical error that must be addressed. Typically level-4 system messages are turned into exceptions which halt processing. If ignored, the output will contain severe errors.

Although the initial message levels were devised independently, they have a strong correspondence to VMS error condition severity levels [8]; the names in quotes for levels 1 through 4 were borrowed from VMS. Error handling has since been influenced by the log4j project [9].

Python Source Reader

The Python Source Reader ("PySource") is the Docutils component that reads Python source files, extracts docstrings in context, then parses, links, and assembles the docstrings into a cohesive whole. It is a major and non-trivial component, currently under experimental development in the Docutils sandbox. High-level design issues are presented here.

Processing Model

This model will evolve over time, incorporating experience and discoveries.

  1. The PySource Reader uses an Input class to read in Python packages and modules, into a tree of strings.
  2. The Python modules are parsed, converting the tree of strings into a tree of abstract syntax trees with docstring nodes.
  3. The abstract syntax trees are converted into an internal representation of the packages/modules. Docstrings are extracted, as well as code structure details. See AST Mining below. Namespaces are constructed for lookup in step 6.
  4. One at a time, the docstrings are parsed, producing standard Docutils doctrees.
  5. PySource assembles all the individual docstrings' doctrees into a Python-specific custom Docutils tree paralleling the package/module/class structure; this is a custom Reader-specific internal representation (see the Docutils Python Source DTD [10]). Namespaces must be merged: Python identifiers, hyperlink targets.
  6. Cross-references from docstrings (interpreted text) to Python identifiers are resolved according to the Python namespace lookup rules. See Identifier Cross-References below.
  7. A "Stylist" transform is applied to the custom doctree (by the Transformer), custom nodes are rendered using standard nodes as primitives, and a standard document tree is emitted. See Stylist Transforms below.
  8. Other transforms are applied to the standard doctree by the Transformer.
  9. The standard doctree is sent to a Writer, which translates the document into a concrete format (HTML, PDF, etc.).
  10. The Writer uses an Output class to write the resulting data to its destination (disk file, directories and files, etc.).

AST Mining

Abstract Syntax Tree mining code will be written (or adapted) that scans a parsed Python module, and returns an ordered tree containing the names, docstrings (including attribute and additional docstrings; see below), and additional info (in parentheses below) of all of the following objects:

  • packages
  • modules
  • module attributes (+ initial values)
  • classes (+ inheritance)
  • class attributes (+ initial values)
  • instance attributes (+ initial values)
  • methods (+ parameters & defaults)
  • functions (+ parameters & defaults)

(Extract comments too? For example, comments at the start of a module would be a good place for bibliographic field lists.)

In order to evaluate interpreted text cross-references, namespaces for each of the above will also be required.

See the python-dev/docstring-develop thread "AST mining", started on 2001-08-14.

Docstring Extraction Rules

  1. What to examine:

    1. If the "__all__" variable is present in the module being documented, only identifiers listed in "__all__" are examined for docstrings.
    2. In the absence of "__all__", all identifiers are examined, except those whose names are private (names begin with "_" but don't begin and end with "__").
    3. 1a and 1b can be overridden by runtime settings.
  2. Where:

    Docstrings are string literal expressions, and are recognized in the following places within Python modules:

    1. At the beginning of a module, function definition, class definition, or method definition, after any comments. This is the standard for Python __doc__ attributes.
    2. Immediately following a simple assignment at the top level of a module, class definition, or __init__ method definition, after any comments. See Attribute Docstrings below.
    3. Additional string literals found immediately after the docstrings in (a) and (b) will be recognized, extracted, and concatenated. See Additional Docstrings below.
    4. @@@ 2.2-style "properties" with attribute docstrings? Wait for syntax?
  3. How:

    Whenever possible, Python modules should be parsed by Docutils, not imported. There are several reasons:

    • Importing untrusted code is inherently insecure.
    • Information from the source is lost when using introspection to examine an imported module, such as comments and the order of definitions.
    • Docstrings are to be recognized in places where the byte-code compiler ignores string literal expressions (2b and 2c above), meaning importing the module will lose these docstrings.

    Of course, standard Python parsing tools such as the "parser" library module should be used.

    When the Python source code for a module is not available (i.e. only the .pyc file exists) or for C extension modules, to access docstrings the module can only be imported, and any limitations must be lived with.

Since attribute docstrings and additional docstrings are ignored by the Python byte-code compiler, no namespace pollution or runtime bloat will result from their use. They are not assigned to __doc__ or to any other attribute. The initial parsing of a module may take a slight performance hit.

Attribute Docstrings

(This is a simplified version of PEP 224 [2].)

A string literal immediately following an assignment statement is interpreted by the docstring extraction machinery as the docstring of the target of the assignment statement, under the following conditions:

  1. The assignment must be in one of the following contexts:

    1. At the top level of a module (i.e., not nested inside a compound statement such as a loop or conditional): a module attribute.
    2. At the top level of a class definition: a class attribute.
    3. At the top level of the "__init__" method definition of a class: an instance attribute. Instance attributes assigned in other methods are assumed to be implementation details. (@@@ __new__ methods?)
    4. A function attribute assignment at the top level of a module or class definition.

    Since each of the above contexts are at the top level (i.e., in the outermost suite of a definition), it may be necessary to place dummy assignments for attributes assigned conditionally or in a loop.

  2. The assignment must be to a single target, not to a list or a tuple of targets.

  3. The form of the target:

    1. For contexts 1a and 1b above, the target must be a simple identifier (not a dotted identifier, a subscripted expression, or a sliced expression).
    2. For context 1c above, the target must be of the form "self.attrib", where "self" matches the "__init__" method's first parameter (the instance parameter) and "attrib" is a simple identifier as in 3a.
    3. For context 1d above, the target must be of the form "name.attrib", where "name" matches an already-defined function or method name and "attrib" is a simple identifier as in 3a.

Blank lines may be used after attribute docstrings to emphasize the connection between the assignment and the docstring.

Examples:

g = 'module attribute (module-global variable)'
"""This is g's docstring."""

class AClass:

    c = 'class attribute'
    """This is AClass.c's docstring."""

    def __init__(self):
        """Method __init__'s docstring."""

        self.i = 'instance attribute'
        """This is self.i's docstring."""

def f(x):
    """Function f's docstring."""
    return x**2

f.a = 1
"""Function attribute f.a's docstring."""

Additional Docstrings

(This idea was adapted from PEP 216 [3].)

Many programmers would like to make extensive use of docstrings for API documentation. However, docstrings do take up space in the running program, so some programmers are reluctant to "bloat up" their code. Also, not all API documentation is applicable to interactive environments, where __doc__ would be displayed.

Docutils' docstring extraction tools will concatenate all string literal expressions which appear at the beginning of a definition or after a simple assignment. Only the first strings in definitions will be available as __doc__, and can be used for brief usage text suitable for interactive sessions; subsequent string literals and all attribute docstrings are ignored by the Python byte-code compiler and may contain more extensive API information.

Example:

def function(arg):
    """This is __doc__, function's docstring."""
    """
    This is an additional docstring, ignored by the byte-code
    compiler, but extracted by Docutils.
    """
    pass

Issue: from __future__ import

This would break "from __future__ import" statements introduced in Python 2.1 for multiple module docstrings (main docstring plus additional docstring(s)). The Python Reference Manual specifies:

A future statement must appear near the top of the module. The only lines that can appear before a future statement are:

  • the module docstring (if any),
  • comments,
  • blank lines, and
  • other future statements.

Resolution?

  1. Should we search for docstrings after a __future__ statement? Very ugly.
  2. Redefine __future__ statements to allow multiple preceding string literals?
  3. Or should we not even worry about this? There probably shouldn't be __future__ statements in production code, after all. Perhaps modules with __future__ statements will simply have to put up with the single-docstring limitation.

Choice of Docstring Format

Rather than force everyone to use a single docstring format, multiple input formats are allowed by the processing system. A special variable, __docformat__, may appear at the top level of a module before any function or class definitions. Over time or through decree, a standard format or set of formats should emerge.

A module's __docformat__ variable only applies to the objects defined in the module's file. In particular, the __docformat__ variable in a package's __init__.py file does not apply to objects defined in subpackages and submodules.

The __docformat__ variable is a string containing the name of the format being used, a case-insensitive string matching the input parser's module or package name (i.e., the same name as required to "import" the module or package), or a registered alias. If no __docformat__ is specified, the default format is "plaintext" for now; this may be changed to the standard format if one is ever established.

The __docformat__ string may contain an optional second field, separated from the format name (first field) by a single space: a case-insensitive language identifier as defined in RFC 1766. A typical language identifier consists of a 2-letter language code from ISO 639 [11] (3-letter codes used only if no 2-letter code exists; RFC 1766 is currently being revised to allow 3-letter codes). If no language identifier is specified, the default is "en" for English. The language identifier is passed to the parser and can be used for language-dependent markup features.

Identifier Cross-References

In Python docstrings, interpreted text is used to classify and mark up program identifiers, such as the names of variables, functions, classes, and modules. If the identifier alone is given, its role is inferred implicitly according to the Python namespace lookup rules. For functions and methods (even when dynamically assigned), parentheses ('()') may be included:

This function uses `another()` to do its work.

For class, instance and module attributes, dotted identifiers are used when necessary. For example (using reStructuredText markup):

class Keeper(Storer):

    """
    Extend `Storer`.  Class attribute `instances` keeps track
    of the number of `Keeper` objects instantiated.
    """

    instances = 0
    """How many `Keeper` objects are there?"""

    def __init__(self):
        """
        Extend `Storer.__init__()` to keep track of instances.

        Keep count in `Keeper.instances`, data in `self.data`.
        """
        Storer.__init__(self)
        Keeper.instances += 1

        self.data = []
        """Store data in a list, most recent last."""

    def store_data(self, data):
        """
        Extend `Storer.store_data()`; append new `data` to a
        list (in `self.data`).
        """
        self.data = data

Each of the identifiers quoted with backquotes ("`") will become references to the definitions of the identifiers themselves.

Stylist Transforms

Stylist transforms are specialized transforms specific to the PySource Reader. The PySource Reader doesn't have to make any decisions as to style; it just produces a logically constructed document tree, parsed and linked, including custom node types. Stylist transforms understand the custom nodes created by the Reader and convert them into standard Docutils nodes.

Multiple Stylist transforms may be implemented and one can be chosen at runtime (through a "--style" or "--stylist" command-line option). Each Stylist transform implements a different layout or style; thus the name. They decouple the context-understanding part of the Reader from the layout-generating part of processing, resulting in a more flexible and robust system. This also serves to "separate style from content", the SGML/XML ideal.

By keeping the piece of code that does the styling small and modular, it becomes much easier for people to roll their own styles. The "barrier to entry" is too high with existing tools; extracting the stylist code will lower the barrier considerably.

Project Web Site

A SourceForge project has been set up for this work at http://docutils.sourceforge.net/.

Acknowledgements

This document borrows ideas from the archives of the Python Doc-SIG [12]. Thanks to all members past & present.

pep-0259 Omit printing newline after newline

PEP: 259
Title: Omit printing newline after newline
Version: $Revision$
Last-Modified: $Date$
Author: Guido van Rossum <guido at python.org>
Status: Rejected
Type: Standards Track
Created: 11-Jun-2001
Python-Version: 2.2
Post-History: 11-Jun-2001

Abstract

    Currently, the print statement always appends a newline, unless a
    trailing comma is used.  This means that if we want to print data
    that already ends in a newline, we get two newlines, unless
    special precautions are taken.

    I propose to skip printing the newline when it follows a newline
    that came from data.

    In order to avoid having to add yet another magic variable to file
    objects, I propose to give the existing 'softspace' variable an
    extra meaning: a negative value will mean "the last data written
    ended in a newline so no space *or* newline is required."


Problem

    When printing data that resembles the lines read from a file using
    a simple loop, double-spacing occurs unless special care is taken:

        >>> for line in open("/etc/passwd").readlines():           
        ... print line 
        ... 
        root:x:0:0:root:/root:/bin/bash

        bin:x:1:1:bin:/bin:

        daemon:x:2:2:daemon:/sbin:

        (etc.)

        >>>

    While there are easy work-arounds, this is often noticed only
    during testing and requires an extra edit-test roundtrip; the
    fixed code is uglier and harder to maintain.


Proposed Solution

    In the PRINT_ITEM opcode in ceval.c, when a string object is
    printed, a check is already made that looks at the last character
    of that string.  Currently, if that last character is a whitespace
    character other than space, the softspace flag is reset to zero;
    this suppresses the space between two items if the first item is a
    string ending in newline, tab, etc. (but not when it ends in a
    space).  Otherwise the softspace flag is set to one.

    The proposal changes this test slightly so that softspace is set
    to:

        -1 -- if the last object written is a string ending in a
              newline

         0 -- if the last object written is a string ending in a
              whitespace character that's neither space nor newline

         1 -- in all other cases (including the case when the last
              object written is an empty string or not a string)

    Then, the PRINT_NEWLINE opcode, printing of the newline is
    suppressed if the value of softspace is negative; in any case the
    softspace flag is reset to zero.


Scope

    This only affects printing of 8-bit strings.  It doesn't affect
    Unicode, although that could be considered a bug in the Unicode
    implementation.  It doesn't affect other objects whose string
    representation happens to end in a newline character.


Risks

    This change breaks some existing code.  For example:

        print "Subject: PEP 259\n"
        print message_body

    In current Python, this produces a blank line separating the
    subject from the message body; with the proposed change, the body
    begins immediately below the subject.  This is not very robust
    code anyway; it is better written as

        print "Subject: PEP 259"
        print
        print message_body

    In the test suite, only test_StringIO (which explicitly tests for
    this feature) breaks.


Implementation

    A patch relative to current CVS is here:

        http://sourceforge.net/tracker/index.php?func=detail&aid=432183&group_id=5470&atid=305470


Rejected

    The user community unanimously rejected this, so I won't pursue
    this idea any further.  Frequently heard arguments against
    included:

    - It it likely to break thousands of CGI scripts.

    - Enough magic already (also: no more tinkering with 'print'
      please).


Copyright

    This document has been placed in the public domain.


pep-0260 Simplify xrange()

PEP: 260
Title: Simplify xrange()
Version: $Revision$
Last-Modified: $Date$
Author: Guido van Rossum <guido at python.org>
Status: Final
Type: Standards Track
Created: 26-Jun-2001
Python-Version: 2.2
Post-History: 26-Jun-2001

Abstract

    This PEP proposes to strip the xrange() object from some rarely
    used behavior like x[i:j] and x*n.


Problem

    The xrange() function has one idiomatic use:

        for i in xrange(...): ...

    However, the xrange() object has a bunch of rarely used behaviors
    that attempt to make it more sequence-like.  These are so rarely
    used that historically they have has serious bugs (e.g. off-by-one
    errors) that went undetected for several releases.

    I claim that it's better to drop these unused features.  This will
    simplify the implementation, testing, and documentation, and
    reduce maintenance and code size.


Proposed Solution

    I propose to strip the xrange() object to the bare minimum.  The
    only retained sequence behaviors are x[i], len(x), and repr(x).
    In particular, these behaviors will be dropped:

        x[i:j] (slicing)
        x*n, n*x (sequence-repeat)
        cmp(x1, x2) (comparisons)
	i in x (containment test)
        x.tolist() method
        x.start, x.stop, x.step attributes

    I also propose to change the signature of the PyRange_New() C API
    to remove the 4th argument (the repetition count).

    By implementing a custom iterator type, we could speed up the
    common use, but this is optional (the default sequence iterator
    does just fine).


Scope

    This PEP affects the xrange() built-in function and the
    PyRange_New() C API.


Risks

    Somebody's code could be relying on the extended code, and this
    code would break.  However, given that historically bugs in the
    extended code have gone undetected for so long, it's unlikely that
    much code is affected.


Transition

    For backwards compatibility, the existing functionality will still
    be present in Python 2.2, but will trigger a warning.  A year
    after Python 2.2 final is released (probably in 2.4) the
    functionality will be ripped out.


Copyright

    This document has been placed in the public domain.


pep-0261 Support for "wide" Unicode characters

PEP: 261
Title: Support for "wide" Unicode characters
Version: $Revision$
Last-Modified: $Date$
Author: Paul Prescod <paul at prescod.net>
Status: Final
Type: Standards Track
Created: 27-Jun-2001
Python-Version: 2.2
Post-History: 27-Jun-2001

Abstract

    Python 2.1 unicode characters can have ordinals only up to 2**16 -1.  
    This range corresponds to a range in Unicode known as the Basic
    Multilingual Plane. There are now characters in Unicode that live
    on other "planes". The largest addressable character in Unicode
    has the ordinal 17 * 2**16 - 1 (0x10ffff). For readability, we
    will call this TOPCHAR and call characters in this range "wide 
    characters".


Glossary

    Character 
        
        Used by itself, means the addressable units of a Python 
        Unicode string.

    Code point

        A code point is an integer between 0 and TOPCHAR.
        If you imagine Unicode as a mapping from integers to
        characters, each integer is a code point. But the 
        integers between 0 and TOPCHAR that do not map to
        characters are also code points. Some will someday 
        be used for characters. Some are guaranteed never 
        to be used for characters.

    Codec

        A set of functions for translating between physical
        encodings (e.g. on disk or coming in from a network)
        into logical Python objects.

    Encoding

        Mechanism for representing abstract characters in terms of
        physical bits and bytes. Encodings allow us to store
        Unicode characters on disk and transmit them over networks
        in a manner that is compatible with other Unicode software.

    Surrogate pair

        Two physical characters that represent a single logical
        character. Part of a convention for representing 32-bit
        code points in terms of two 16-bit code points.

    Unicode string

          A Python type representing a sequence of code points with
          "string semantics" (e.g. case conversions, regular
          expression compatibility, etc.) Constructed with the 
          unicode() function.


Proposed Solution

    One solution would be to merely increase the maximum ordinal 
    to a larger value. Unfortunately the only straightforward
    implementation of this idea is to use 4 bytes per character.
    This has the effect of doubling the size of most Unicode 
    strings. In order to avoid imposing this cost on every
    user, Python 2.2 will allow the 4-byte implementation as a
    build-time option. Users can choose whether they care about
    wide characters or prefer to preserve memory.

    The 4-byte option is called "wide Py_UNICODE". The 2-byte option
    is called "narrow Py_UNICODE".

    Most things will behave identically in the wide and narrow worlds.

    * unichr(i) for 0 <= i < 2**16 (0x10000) always returns a
      length-one string.

    * unichr(i) for 2**16 <= i <= TOPCHAR will return a
      length-one string on wide Python builds. On narrow builds it will 
      raise ValueError.

        ISSUE 

            Python currently allows \U literals that cannot be
            represented as a single Python character. It generates two
            Python characters known as a "surrogate pair". Should this
            be disallowed on future narrow Python builds?

        Pro:

            Python already the construction of a surrogate pair
            for a large unicode literal character escape sequence.
            This is basically designed as a simple way to construct
            "wide characters" even in a narrow Python build. It is also
            somewhat logical considering that the Unicode-literal syntax
            is basically a short-form way of invoking the unicode-escape
            codec.

        Con:

            Surrogates could be easily created this way but the user
            still needs to be careful about slicing, indexing, printing 
            etc. Therefore some have suggested that Unicode
            literals should not support surrogates.


        ISSUE 

            Should Python allow the construction of characters that do
            not correspond to Unicode code points?  Unassigned Unicode 
            code points should obviously be legal (because they could 
            be assigned at any time). But code points above TOPCHAR are 
            guaranteed never to be used by Unicode. Should we allow access 
            to them anyhow?

        Pro:

            If a Python user thinks they know what they're doing why
            should we try to prevent them from violating the Unicode
            spec? After all, we don't stop 8-bit strings from
            containing non-ASCII characters.

        Con:

            Codecs and other Unicode-consuming code will have to be
            careful of these characters which are disallowed by the
            Unicode specification.

    * ord() is always the inverse of unichr()

    * There is an integer value in the sys module that describes the
      largest ordinal for a character in a Unicode string on the current
      interpreter. sys.maxunicode is 2**16-1 (0xffff) on narrow builds
      of Python and TOPCHAR on wide builds.

        ISSUE: Should there be distinct constants for accessing
               TOPCHAR and the real upper bound for the domain of 
               unichr (if they differ)? There has also been a
               suggestion of sys.unicodewidth which can take the 
               values 'wide' and 'narrow'.

    * every Python Unicode character represents exactly one Unicode code 
      point (i.e. Python Unicode Character = Abstract Unicode character).

    * codecs will be upgraded to support "wide characters"
      (represented directly in UCS-4, and as variable-length sequences
      in UTF-8 and UTF-16). This is the main part of the implementation 
      left to be done.

    * There is a convention in the Unicode world for encoding a 32-bit
      code point in terms of two 16-bit code points. These are known
      as "surrogate pairs". Python's codecs will adopt this convention
      and encode 32-bit code points as surrogate pairs on narrow Python
      builds. 

        ISSUE 

            Should there be a way to tell codecs not to generate
            surrogates and instead treat wide characters as 
            errors?

        Pro:

            I might want to write code that works only with
            fixed-width characters and does not have to worry about
            surrogates.


        Con:

            No clear proposal of how to communicate this to codecs.

    * there are no restrictions on constructing strings that use 
      code points "reserved for surrogates" improperly. These are
      called "isolated surrogates". The codecs should disallow reading
      these from files, but you could construct them using string 
      literals or unichr().


Implementation

    There is a new define:

        #define Py_UNICODE_SIZE 2

    To test whether UCS2 or UCS4 is in use, the derived macro
    Py_UNICODE_WIDE should be used, which is defined when UCS-4 is in
    use.

    There is a new configure option:

        --enable-unicode=ucs2 configures a narrow Py_UNICODE, and uses
                              wchar_t if it fits
        --enable-unicode=ucs4 configures a wide Py_UNICODE, and uses
                              wchar_t if it fits
        --enable-unicode      same as "=ucs2"
        --disable-unicode     entirely remove the Unicode functionality.

    It is also proposed that one day --enable-unicode will just
    default to the width of your platforms wchar_t.

    Windows builds will be narrow for a while based on the fact that
    there have been few requests for wide characters, those requests
    are mostly from hard-core programmers with the ability to buy
    their own Python and Windows itself is strongly biased towards
    16-bit characters.


Notes

    This PEP does NOT imply that people using Unicode need to use a
    4-byte encoding for their files on disk or sent over the network. 
    It only allows them to do so. For example, ASCII is still a 
    legitimate (7-bit) Unicode-encoding.

    It has been proposed that there should be a module that handles
    surrogates in narrow Python builds for programmers. If someone 
    wants to implement that, it will be another PEP. It might also be 
    combined with features that allow other kinds of character-, 
    word- and line- based indexing.


Rejected Suggestions

    More or less the status-quo

        We could officially say that Python characters are 16-bit and
        require programmers to implement wide characters in their
        application logic by combining surrogate pairs. This is a heavy 
        burden because emulating 32-bit characters is likely to be
        very inefficient if it is coded entirely in Python. Plus these
        abstracted pseudo-strings would not be legal as input to the
        regular expression engine.

    "Space-efficient Unicode" type

        Another class of solution is to use some efficient storage
        internally but present an abstraction of wide characters to
        the programmer. Any of these would require a much more complex
        implementation than the accepted solution. For instance consider
        the impact on the regular expression engine. In theory, we could
        move to this implementation in the future without breaking Python
        code. A future Python could "emulate" wide Python semantics on
        narrow Python. Guido is not willing to undertake the
        implementation right now.

    Two types

        We could introduce a 32-bit Unicode type alongside the 16-bit
        type. There is a lot of code that expects there to be only a 
        single Unicode type.

    This PEP represents the least-effort solution. Over the next
    several years, 32-bit Unicode characters will become more common
    and that may either convince us that we need a more sophisticated 
    solution or (on the other hand) convince us that simply 
    mandating wide Unicode characters is an appropriate solution.
    Right now the two options on the table are do nothing or do
    this.


References

    Unicode Glossary: http://www.unicode.org/glossary/


Copyright

    This document has been placed in the public domain.


pep-0262 A Database of Installed Python Packages

PEP: 262
Title: A Database of Installed Python Packages
Version: $Revision$
Last-Modified: $Date$
Author: A.M. Kuchling <amk at amk.ca>
Status: Deferred
Type: Standards Track
Created: 08-Jul-2001
Post-History: 27-Mar-2002

Introduction

    This PEP describes a format for a database of the Python software
    installed on a system.

    (In this document, the term "distribution" is used to mean a set 
    of code that's developed and distributed together.  A "distribution"
    is the same as a Red Hat or Debian package, but the term "package"
    already has a meaning in Python terminology, meaning "a directory
    with an __init__.py file in it.")


Requirements

    We need a way to figure out what distributions, and what versions of
    those distributions, are installed on a system.  We want to provide
    features similar to CPAN, APT, or RPM.  Required use cases that
    should be supported are:
 
        * Is distribution X on a system?  
        * What version of distribution X is installed?
        * Where can the new version of distribution X be found?  (This can
          be defined as either "a home page where the user can go and
          find a download link", or "a place where a program can find
          the newest version?"  Both should probably be supported.)
        * What files did distribution X put on my system?
        * What distribution did the file x/y/z.py come from?
        * Has anyone modified x/y/z.py locally?
        * What other distributions does this software need?
        * What Python modules does this distribution provide?


Database Location

    The database lives in a bunch of files under
    <prefix>/lib/python<version>/install-db/.  This location will be
    called INSTALLDB through the remainder of this PEP.

    The structure of the database is deliberately kept simple; each
    file in this directory or its subdirectories (if any) describes a
    single distribution.  Binary packagings of Python software such as
    RPMs can then update Python's database by just installing the
    corresponding file into the INSTALLDB directory.

    The rationale for scanning subdirectories is that we can move to a
    directory-based indexing scheme if the database directory contains
    too many entries.  For example, this would let us transparently
    switch from INSTALLDB/Numeric to INSTALLDB/N/Nu/Numeric or some
    similar hashing scheme.


Database Contents

    Each file in INSTALLDB or its subdirectories describes a single
    distribution, and has the following contents:

        An initial line listing the sections in this file, separated
        by whitespace.  Currently this will always be 'PKG-INFO FILES
        REQUIRES PROVIDES'.  This is for future-proofing; if we add a
        new section, for example to list documentation files, then
        we'd add a DOCS section and list it in the contents.  Sections
        are always separated by blank lines.

    A distribution that uses the Distutils for installation should
    automatically update the database.  Distributions that roll their
    own installation will have to use the database's API to to
    manually add or update their own entry.  System package managers
    such as RPM or pkgadd can just create the new file in the
    INSTALLDB directory.

    Each section of the file is used for a different purpose.

    PKG-INFO section

        An initial set of RFC-822 headers containing the distribution
        information for a file, as described in PEP 241, "Metadata for
        Python Software Packages".

    FILES section 
   
        An entry for each file installed by the
        distribution. Generated files such as .pyc and .pyo files are
        on this list as well as the original .py files installed by a
        distribution; their checksums won't be stored or checked,
        though.

        Each file's entry is a single tab-delimited line that contains
        the following fields: 

            * The file's full path, as installed on the system.  

            * The file's size

            * The file's permissions.  On Windows, this field will always be 
              'unknown'

            * The owner and group of the file, separated by a tab.
              On Windows, these fields will both be 'unknown'.

            * A SHA1 digest of the file, encoded in hex.  For generated files
              such as *.pyc files, this field must contain the string "-",
              which indicates that the file's checksum should not be verified.


    REQUIRES section

    This section is a list of strings giving the services required for
    this module distribution to run properly.  This list includes the
    distribution name ("python-stdlib") and module names ("rfc822",
    "htmllib", "email", "email.Charset").  It will be specified 
    by an extra 'requires' argument to the distutils.core.setup()
    function.  For example:

        setup(..., requires=['xml.utils.iso8601', 

    Eventually there may be automated tools that look through all of
    the code and produce a list of requirements, but it's unlikely
    that these tools can handle all possible cases; a manual 
    way to specify requirements will always be necessary.


    PROVIDES section 

    This section is a list of strings giving the services provided by
    an installed distribution.  This list includes the distribution name
    ("python-stdlib") and module names ("rfc822", "htmllib", "email",
    "email.Charset").

    XXX should files be listed?  e.g. $PREFIX/lib/color-table.txt,
    to pick up data files, required scripts, etc.

    Eventually there may be an option to let module developers add
    their own strings to this section.  For example, you might add
    "XML parser" to this section, and other module distributions could
    then list "XML parser" as one of their dependencies to indicate
    that multiple different XML parsers can be used.  For now this
    ability isn't supported because it raises too many issues: do we
    need a central registry of legal strings, or just let people put
    whatever they like?  Etc., etc...  


API Description

    There's a single fundamental class, InstallationDatabase.  The
    code for it lives in distutils/install_db.py.  (XXX any
    suggestions for alternate locations in the standard library, or an
    alternate module name?)

    The InstallationDatabase returns instances of Distribution that contain
    all the information about an installed distribution.

    XXX Several of the fields in Distribution are duplicates of ones in
    distutils.dist.Distribution.  Probably they should be factored out
    into the Distribution class proposed here, but can this be done in a
    backward-compatible way?
    
    InstallationDatabase has the following interface:

class InstallationDatabase:

    def __init__ (self, path=None):
        """InstallationDatabase(path:string)
        Read the installation database rooted at the specified path.
        If path is None, INSTALLDB is used as the default.    
        """

    def get_distribution (self, distribution_name):
        """get_distribution(distribution_name:string) : Distribution
        Get the object corresponding to a single distribution.
        """

    def list_distributions (self):
        """list_distributions() : [Distribution]
        Return a list of all distributions installed on the system, 
        enumerated in no particular order.
        """

    def find_distribution (self, path):
        """find_file(path:string) : Distribution
        Search and return the distribution containing the file 'path'.  
        Returns None if the file doesn't belong to any distribution
        that the InstallationDatabase knows about.
        XXX should this work for directories?
        """

class Distribution:

    """Instance attributes:
    name : string
      Distribution name
    files : {string : (size:int, perms:int, owner:string, group:string,
                       digest:string)}
       Dictionary mapping the path of a file installed by this distribution 
       to information about the file.

    The following fields all come from PEP 241.

    version : distutils.version.Version
      Version of this distribution
    platform : [string]
    summary : string
    description : string
    keywords : string
    home_page : string    
    author : string
    author_email : string
    license : string
    """

    def add_file (self, path):
        """add_file(path:string):None
        Record the size, ownership, &c., information for an installed file.
        XXX as written, this would stat() the file.  Should the size/perms/
        checksum all be provided as parameters to this method instead?
        """

    def has_file (self, path):
        """has_file(path:string) : Boolean
        Returns true if the specified path belongs to a file in this
        distribution.
        """

     def check_file (self, path):
        """check_file(path:string) : Boolean
        Checks whether the file's size, checksum, and ownership match,
        returning true if they do.
        """
        
     

Deliverables

    A description of the database API, to be added to this PEP.
  
    Patches to the Distutils that 1) implement an InstallationDatabase
    class, 2) Update the database when a new distribution is installed.  3)
    add a simple package management tool, features to be added to this
    PEP.  (Or should that be a separate PEP?)  See [2] for the current 
    patch.


Open Issues

    PJE suggests the installation database "be potentially present on
    every directory in sys.path, with the contents merged in sys.path
    order.  This would allow home-directory or other
    alternate-location installs to work, and ease the process of a
    distutils install command writing the file." Nice feature: it does
    mean that package manager tools can take into account Python
    packages that a user has privately installed.

    AMK wonders: what does setup.py do if it's told to install
    packages to a directory not on sys.path?  Does it write an
    install-db directory to the directory it's told to write to, or
    does it do nothing?

    Should the package-database file itself be included in the files
    list?  (PJE would think yes, but of course it can't contain its
    own checksum.  AMK can't think of a use case where including the
    DB file matters.)

    PJE wonders about writing the package DB file
    *first*, before installing any other files, so that failed partial
    installations can both be backed out, and recognized as broken.
    This PEP may have to specify some algorithm for recognizing this
    situation.

    Should we guarantee the format of installation databases remains
    compatible across Python versions, or is it subject to arbitrary
    change?  Probably we need to guarantee compatibility.

    

Rejected Suggestions

    Instead of using one text file per distribution, one large text
    file or an anydbm file could be used.  This has been rejected for
    a few reasons.  First, performance is probably not an extremely
    pressing concern as the database is only used when installing or
    removing software, a relatively infrequent task.  Scalability also
    likely isn't a problem, as people may have hundreds of Python
    packages installed, but thousands or tens of thousands seems
    unlikely.  Finally, individual text files are compatible with
    installers such as RPM or DPKG because a binary packager can just
    drop the new database file into the database directory.  If one
    large text file or a binary file were used, the Python database
    would then have to be updated by running a postinstall script.

    On Windows, the permissions and owner/group of a file aren't
    stored.  Windows does in fact support ownership and access
    permissions, but reading and setting them requires the win32all
    extensions, and they aren't present in the basic Python installer
    for Windows.
  

References

    [1] Michael Muller's patch (posted to the Distutils-SIG around 28
        Dec 1999) generates a list of installed files.

    [2] A patch to implement this PEP will be tracked as 
        patch #562100 on SourceForge.  
        http://www.python.org/sf/562100 .
        Code implementing the installation database is currently in 
        Python CVS in the nondist/sandbox/pep262 directory.


Acknowledgements

    Ideas for this PEP originally came from postings by Greg Ward,
    Fred L. Drake Jr., Thomas Heller, Mats Wichmann, Phillip J. Eby,
    and others.

    Many changes and rewrites to this document were suggested by the
    readers of the Distutils SIG.   


Copyright

    This document has been placed in the public domain.


pep-0263 Defining Python Source Code Encodings

PEP: 0263
Title: Defining Python Source Code Encodings
Version: $Revision$
Last-Modified: $Date$
Author: Marc-AndrĂŠ Lemburg <mal at lemburg.com>, Martin von LĂświs <martin at v.loewis.de>
Status: Final
Type: Standards Track
Created: 06-Jun-2001
Python-Version: 2.3
Post-History: 

Abstract

    This PEP proposes to introduce a syntax to declare the encoding of
    a Python source file. The encoding information is then used by the
    Python parser to interpret the file using the given encoding. Most
    notably this enhances the interpretation of Unicode literals in
    the source code and makes it possible to write Unicode literals
    using e.g. UTF-8 directly in an Unicode aware editor.

Problem

    In Python 2.1, Unicode literals can only be written using the
    Latin-1 based encoding "unicode-escape". This makes the
    programming environment rather unfriendly to Python users who live
    and work in non-Latin-1 locales such as many of the Asian 
    countries. Programmers can write their 8-bit strings using the
    favorite encoding, but are bound to the "unicode-escape" encoding
    for Unicode literals.

Proposed Solution

    I propose to make the Python source code encoding both visible and
    changeable on a per-source file basis by using a special comment
    at the top of the file to declare the encoding.

    To make Python aware of this encoding declaration a number of
    concept changes are necessary with respect to the handling of
    Python source code data.

Defining the Encoding

    Python will default to ASCII as standard encoding if no other
    encoding hints are given.

    To define a source code encoding, a magic comment must
    be placed into the source files either as first or second
    line in the file, such as:

          # coding=<encoding name>

    or (using formats recognized by popular editors)

          #!/usr/bin/python
          # -*- coding: <encoding name> -*-

    or

          #!/usr/bin/python
          # vim: set fileencoding=<encoding name> :

    More precisely, the first or second line must match the regular
    expression "coding[:=]\s*([-\w.]+)". The first group of this
    expression is then interpreted as encoding name. If the encoding
    is unknown to Python, an error is raised during compilation. There
    must not be any Python statement on the line that contains the
    encoding declaration.

    To aid with platforms such as Windows, which add Unicode BOM marks
    to the beginning of Unicode files, the UTF-8 signature
    '\xef\xbb\xbf' will be interpreted as 'utf-8' encoding as well
    (even if no magic encoding comment is given).

    If a source file uses both the UTF-8 BOM mark signature and a
    magic encoding comment, the only allowed encoding for the comment
    is 'utf-8'.  Any other encoding will cause an error.

Examples

    These are some examples to clarify the different styles for
    defining the source code encoding at the top of a Python source
    file:

    1. With interpreter binary and using Emacs style file encoding
       comment:

          #!/usr/bin/python
          # -*- coding: latin-1 -*-
          import os, sys
          ...

          #!/usr/bin/python
          # -*- coding: iso-8859-15 -*-
          import os, sys
          ...

          #!/usr/bin/python
          # -*- coding: ascii -*-
          import os, sys
          ...

    2. Without interpreter line, using plain text:

          # This Python file uses the following encoding: utf-8
          import os, sys
          ...

    3. Text editors might have different ways of defining the file's
       encoding, e.g.

          #!/usr/local/bin/python
          # coding: latin-1
          import os, sys
          ...

    4. Without encoding comment, Python's parser will assume ASCII
       text:

          #!/usr/local/bin/python
          import os, sys
          ...

    5. Encoding comments which don't work:

       Missing "coding:" prefix:

          #!/usr/local/bin/python
          # latin-1
          import os, sys
          ...

       Encoding comment not on line 1 or 2:

          #!/usr/local/bin/python
          #
          # -*- coding: latin-1 -*-
          import os, sys
          ...

       Unsupported encoding:

          #!/usr/local/bin/python
          # -*- coding: utf-42 -*-
          import os, sys
          ...

Concepts

    The PEP is based on the following concepts which would have to be
    implemented to enable usage of such a magic comment:

    1. The complete Python source file should use a single encoding.
       Embedding of differently encoded data is not allowed and will
       result in a decoding error during compilation of the Python
       source code.

       Any encoding which allows processing the first two lines in the
       way indicated above is allowed as source code encoding, this
       includes ASCII compatible encodings as well as certain
       multi-byte encodings such as Shift_JIS. It does not include
       encodings which use two or more bytes for all characters like
       e.g. UTF-16. The reason for this is to keep the encoding
       detection algorithm in the tokenizer simple.

    2. Handling of escape sequences should continue to work as it does 
       now, but with all possible source code encodings, that is
       standard string literals (both 8-bit and Unicode) are subject to 
       escape sequence expansion while raw string literals only expand
       a very small subset of escape sequences.

    3. Python's tokenizer/compiler combo will need to be updated to
       work as follows:

       1. read the file

       2. decode it into Unicode assuming a fixed per-file encoding

       3. convert it into a UTF-8 byte string

       4. tokenize the UTF-8 content

       5. compile it, creating Unicode objects from the given Unicode data
          and creating string objects from the Unicode literal data
          by first reencoding the UTF-8 data into 8-bit string data
          using the given file encoding

       Note that Python identifiers are restricted to the ASCII
       subset of the encoding, and thus need no further conversion
       after step 4.

Implementation

    For backwards-compatibility with existing code which currently
    uses non-ASCII in string literals without declaring an encoding,
    the implementation will be introduced in two phases:

    1. Allow non-ASCII in string literals and comments, by internally
       treating a missing encoding declaration as a declaration of
       "iso-8859-1". This will cause arbitrary byte strings to
       correctly round-trip between step 2 and step 5 of the
       processing, and provide compatibility with Python 2.2 for
       Unicode literals that contain non-ASCII bytes.

       A warning will be issued if non-ASCII bytes are found in the
       input, once per improperly encoded input file.

    2. Remove the warning, and change the default encoding to "ascii".

    The builtin compile() API will be enhanced to accept Unicode as
    input. 8-bit string input is subject to the standard procedure for
    encoding detection as described above.

    If a Unicode string with a coding declaration is passed to compile(),
    a SyntaxError will be raised.

    SUZUKI Hisao is working on a patch; see [2] for details. A patch
    implementing only phase 1 is available at [1].

Phases

    Implementation of steps 1 and 2 above were completed in 2.3,
    except for changing the default encoding to "ascii".

    The default encoding was set to "ascii" in version 2.5.
   

Scope

    This PEP intends to provide an upgrade path from the current
    (more-or-less) undefined source code encoding situation to a more
    robust and portable definition.

References

    [1] Phase 1 implementation:
        http://python.org/sf/526840
    [2] Phase 2 implementation:
        http://python.org/sf/534304

History

    1.10 and above: see CVS history
    1.8: Added '.' to the coding RE.
    1.7: Added warnings to phase 1 implementation. Replaced the
         Latin-1 default encoding with the interpreter's default
         encoding. Added tweaks to compile().
    1.4 - 1.6: Minor tweaks
    1.3: Worked in comments by Martin v. Loewis: 
         UTF-8 BOM mark detection, Emacs style magic comment,
         two phase approach to the implementation

Copyright

    This document has been placed in the public domain.


pep-0264 Future statements in simulated shells

PEP: 264
Title: Future statements in simulated shells
Version: $Revision$
Last-Modified: $Date$
Author: Michael Hudson <mwh at python.net>
Status: Final
Type: Standards Track
Requires: 236
Created: 30-Jul-2001
Python-Version: 2.2
Post-History: 30-Jul-2001

Abstract

    As noted in PEP 236, there is no clear way for "simulated
    interactive shells" to simulate the behaviour of __future__
    statements in "real" interactive shells, i.e. have __future__
    statements' effects last the life of the shell.

    The PEP also takes the opportunity to clean up the other
    unresolved issue mentioned in PEP 236, the inability to stop
    compile() inheriting the effect of future statements affecting the
    code calling compile().

    This PEP proposes to address the first problem by adding an
    optional fourth argument to the builtin function "compile", adding
    information to the _Feature instances defined in __future__.py and
    adding machinery to the standard library modules "codeop" and
    "code" to make the construction of such shells easy.

    The second problem is dealt with by simply adding *another*
    optional argument to compile(), which if non-zero suppresses the
    inheriting of future statements' effects.


Specification

    I propose adding a fourth, optional, "flags" argument to the
    builtin "compile" function.  If this argument is omitted,
    there will be no change in behaviour from that of Python 2.1.

    If it is present it is expected to be an integer, representing
    various possible compile time options as a bitfield.  The
    bitfields will have the same values as the CO_* flags already used
    by the C part of Python interpreter to refer to future statements.

    compile() shall raise a ValueError exception if it does not
    recognize any of the bits set in the supplied flags.

    The flags supplied will be bitwise-"or"ed with the flags that
    would be set anyway, unless the new fifth optional argument is a
    non-zero intger, in which case the flags supplied will be exactly
    the set used.

    The above-mentioned flags are not currently exposed to Python.  I
    propose adding .compiler_flag attributes to the _Feature objects
    in __future__.py that contain the necessary bits, so one might
    write code such as:

        import __future__
        def compile_generator(func_def):
            return compile(func_def, "<input>", "suite",
                           __future__.generators.compiler_flag)

    A recent change means that these same bits can be used to tell if
    a code object was compiled with a given feature; for instance

        codeob.co_flags & __future__.generators.compiler_flag

    will be non-zero if and only if the code object "codeob" was
    compiled in an environment where generators were allowed.

    I will also add a .all_feature_flags attribute to the __future__
    module, giving a low-effort way of enumerating all the __future__
    options supported by the running interpreter.

    I also propose adding a pair of classes to the standard library
    module codeop.

    One - Compile - will sport a __call__ method which will act much
    like the builtin "compile" of 2.1 with the difference that after
    it has compiled a __future__ statement, it "remembers" it and
    compiles all subsequent code with the __future__ option in effect.

    It will do this by using the new features of the __future__ module
    mentioned above.

    Objects of the other class added to codeop - CommandCompiler -
    will do the job of the existing codeop.compile_command function,
    but in a __future__-aware way.

    Finally, I propose to modify the class InteractiveInterpreter in
    the standard library module code to use a CommandCompiler to
    emulate still more closely the behaviour of the default Python
    shell.


Backward Compatibility

    Should be very few or none; the changes to compile will make no
    difference to existing code, nor will adding new functions or
    classes to codeop.  Existing code using
    code.InteractiveInterpreter may change in behaviour, but only for
    the better in that the "real" Python shell will be being better
    impersonated.


Forward Compatibility

    The fiddling that needs to be done to Lib/__future__.py when
    adding a __future_ feature will be a touch more complicated.
    Everything else should just work.


Issues

    I hope the above interface is not too disruptive to implement for
    Jython.


Implementation

    A series of preliminary implementations are at:

        http://sourceforge.net/tracker/?func=detail&atid=305470&aid=449043&group_id=5470

    After light massaging by Tim Peters, they have now been checked in.


Copyright

    This document has been placed in the public domain.



pep-0265 Sorting Dictionaries by Value

PEP: 265
Title: Sorting Dictionaries by Value
Version: $Revision$
Last-Modified: $Date$
Author: Grant Griffin <g2 at iowegian.com>
Status: Rejected
Type: Standards Track
Created: 8-Aug-2001
Python-Version: 2.2
Post-History: 

Abstract

    This PEP suggests a "sort by value" operation for dictionaries.
    The primary benefit would be in terms of "batteries included"
    support for a common Python idiom which, in its current form, is
    both difficult for beginners to understand and cumbersome for all
    to implement.

BDFL Pronouncement

    This PEP is rejected because the need for it has been largely
    fulfilled by Py2.4's sorted() builtin function:

        >>> sorted(d.iteritems(), key=itemgetter(1), reverse=True)
        [('b', 23), ('d', 17), ('c', 5), ('a', 2), ('e', 1)]

    or for just the keys:

        sorted(d, key=d.__getitem__, reverse=True)
        ['b', 'd', 'c', 'a', 'e']

    Also, Python 2.5's heapq.nlargest() function addresses the common use
    case of finding only a few of the highest valued items:

        >>> nlargest(2, d.iteritems(), itemgetter(1))
        [('b', 23), ('d', 17)]


Motivation

    A common use of dictionaries is to count occurrences by setting
    the value of d[key] to 1 on its first occurrence, then increment
    the value on each subsequent occurrence.  This can be done several
    different ways, but the get() method is the most succinct:

            d[key] = d.get(key, 0) + 1

    Once all occurrences have been counted, a common use of the
    resulting dictionary is to print the occurrences in
    occurrence-sorted order, often with the largest value first.

    This leads to a need to sort a dictionary's items by value.  The
    canonical method of doing so in Python is to first use d.items()
    to get a list of the dictionary's items, then invert the ordering
    of each item's tuple from (key, value) into (value, key), then
    sort the list; since Python sorts the list based on the first item
    of the tuple, the list of (inverted) items is therefore sorted by
    value.  If desired, the list can then be reversed, and the tuples
    can be re-inverted back to (key, value).  (However, in my
    experience, the inverted tuple ordering is fine for most purposes,
    e.g. printing out the list.)

    For example, given an occurrence count of:

        >>> d = {'a':2, 'b':23, 'c':5, 'd':17, 'e':1}

    we might do:

        >>> items = [(v, k) for k, v in d.items()]
        >>> items.sort()
        >>> items.reverse()             # so largest is first
        >>> items = [(k, v) for v, k in items]

    resulting in:

        >>> items
        [('b', 23), ('d', 17), ('c', 5), ('a', 2), ('e', 1)]

    which shows the list in by-value order, largest first.  (In this
    case, 'b' was found to have the most occurrences.)

    This works fine, but is "hard to use" in two aspects.  First,
    although this idiom is known to veteran Pythoneers, it is not at
    all obvious to newbies -- either in terms of its algorithm
    (inverting the ordering of item tuples) or its implementation
    (using list comprehensions -- which are an advanced Python
    feature.)  Second, it requires having to repeatedly type a lot of
    "grunge", resulting in both tedium and mistakes.

    We therefore would rather Python provide a method of sorting
    dictionaries by value which would be both easy for newbies to
    understand (or, better yet, not to _have to_ understand) and
    easier for all to use.


Rationale

    As Tim Peters has pointed out, this sort of thing brings on the
    problem of trying to be all things to all people.  Therefore, we
    will limit its scope to try to hit "the sweet spot".  Unusual
    cases (e.g. sorting via a custom comparison function) can, of
    course, be handled "manually" using present methods.

    Here are some simple possibilities:

    The items() method of dictionaries can be augmented with new
    parameters having default values that provide for full
    backwards-compatibility:

        (1) items(sort_by_values=0, reversed=0)

    or maybe just:

        (2) items(sort_by_values=0)

    since reversing a list is easy enough.

    Alternatively, items() could simply let us control the (key, value) 
    order:

        (3) items(values_first=0)

    Again, this is fully backwards-compatible.  It does less work than
    the others, but it at least eases the most complicated/tricky part
    of the sort-by-value problem: inverting the order of item tuples.
    Using this is very simple:

        items = d.items(1)
        items.sort()
        items.reverse()         # (if desired)

    The primary drawback of the preceding three approaches is the
    additional overhead for the parameter-less "items()" case, due to
    having to process default parameters.  (However, if one assumes
    that items() gets used primarily for creating sort-by-value lists,
    this is not really a drawback in practice.)

    Alternatively, we might add a new dictionary method which somehow
    embodies "sorting".  This approach offers two advantages.  First,
    it avoids adding overhead to the items() method.  Second, it is
    perhaps more accessible to newbies: when they go looking for a
    method for sorting dictionaries, they hopefully run into this one,
    and they will not have to understand the finer points of tuple
    inversion and list sorting to achieve sort-by-value.

    To allow the four basic possibilities of sorting by key/value and in 
    forward/reverse order, we could add this method:

        (4) sorted_items(by_value=0, reversed=0)

    I believe the most common case would actually be "by_value=1,
    reversed=1", but the defaults values given here might lead to
    fewer surprises by users: sorted_items() would be the same as
    items() followed by sort().

    Finally (as a last resort), we could use:

        (5) items_sorted_by_value(reversed=0)


Implementation

    The proposed dictionary methods would necessarily be implemented
    in C.  Presumably, the implementation would be fairly simple since
    it involves just adding a few calls to Python's existing
    machinery.


Concerns

    Aside from the run-time overhead already addressed in
    possibilities 1 through 3, concerns with this proposal probably
    will fall into the categories of "feature bloat" and/or "code
    bloat".  However, I believe that several of the suggestions made
    here will result in quite minimal bloat, resulting in a good
    tradeoff between bloat and "value added".

    Tim Peters has noted that implementing this in C might not be
    significantly faster than implementing it in Python today.
    However, the major benefits intended here are "accessibility" and
    "ease of use", not "speed".  Therefore, as long as it is not
    noticeably slower (in the case of plain items(), speed need not be
    a consideration.


References

    A related thread called "counting occurrences" appeared on
    comp.lang.python in August, 2001.  This included examples of
    approaches to systematizing the sort-by-value problem by
    implementing it as reusable Python functions and classes.


Copyright

    This document has been placed in the public domain.



pep-0266 Optimizing Global Variable/Attribute Access

PEP: 266
Title: Optimizing Global Variable/Attribute Access
Version: $Revision$
Last-Modified: $Date$
Author: Skip Montanaro <skip at pobox.com>
Status: Withdrawn
Type: Standards Track
Created: 13-Aug-2001
Python-Version: 2.3
Post-History: 

Abstract

    The bindings for most global variables and attributes of other
    modules typically never change during the execution of a Python
    program, but because of Python's dynamic nature, code which
    accesses such global objects must run through a full lookup each
    time the object is needed.  This PEP proposes a mechanism that
    allows code that accesses most global objects to treat them as
    local objects and places the burden of updating references on the
    code that changes the name bindings of such objects.


Introduction

    Consider the workhorse function sre_compile._compile.  It is the
    internal compilation function for the sre module.  It consists
    almost entirely of a loop over the elements of the pattern being
    compiled, comparing opcodes with known constant values and
    appending tokens to an output list.  Most of the comparisons are
    with constants imported from the sre_constants module.  This means
    there are lots of LOAD_GLOBAL bytecodes in the compiled output of
    this module.  Just by reading the code it's apparent that the
    author intended LITERAL, NOT_LITERAL, OPCODES and many other
    symbols to be constants.  Still, each time they are involved in an
    expression, they must be looked up anew.

    Most global accesses are actually to objects that are "almost
    constants".  This includes global variables in the current module
    as well as the attributes of other imported modules.  Since they
    rarely change, it seems reasonable to place the burden of updating
    references to such objects on the code that changes the name
    bindings.  If sre_constants.LITERAL is changed to refer to another
    object, perhaps it would be worthwhile for the code that modifies
    the sre_constants module dict to correct any active references to
    that object.  By doing so, in many cases global variables and the
    attributes of many objects could be cached as local variables.  If
    the bindings between the names given to the objects and the
    objects themselves changes rarely, the cost of keeping track of
    such objects should be low and the potential payoff fairly large.

    In an attempt to gauge the effect of this proposal, I modified the
    Pystone benchmark program included in the Python distribution to
    cache global functions.  Its main function, Proc0, makes calls to
    ten different functions inside its for loop.  In addition, Func2
    calls Func1 repeatedly inside a loop.  If local copies of these 11
    global idenfiers are made before the functions' loops are entered,
    performance on this particular benchmark improves by about two per
    cent (from 5561 pystones to 5685 on my laptop).  It gives some
    indication that performance would be improved by caching most
    global variable access.  Note also that the pystone benchmark
    makes essentially no accesses of global module attributes, an
    anticipated area of improvement for this PEP.

Proposed Change

    I propose that the Python virtual machine be modified to include
    TRACK_OBJECT and UNTRACK_OBJECT opcodes.  TRACK_OBJECT would
    associate a global name or attribute of a global name with a slot
    in the local variable array and perform an initial lookup of the
    associated object to fill in the slot with a valid value.  The
    association it creates would be noted by the code responsible for
    changing the name-to-object binding to cause the associated local
    variable to be updated.  The UNTRACK_OBJECT opcode would delete
    any association between the name and the local variable slot.


Threads

    Operation of this code in threaded programs will be no different
    than in unthreaded programs.  If you need to lock an object to
    access it, you would have had to do that before TRACK_OBJECT would
    have been executed and retain that lock until after you stop using
    it.

    FIXME: I suspect I need more here.


Rationale

    Global variables and attributes rarely change.  For example, once
    a function imports the math module, the binding between the name
    "math" and the module it refers to aren't likely to change.
    Similarly, if the function that uses the math module refers to its
    "sin" attribute, it's unlikely to change.  Still, every time the
    module wants to call the math.sin function, it must first execute
    a pair of instructions:

        LOAD_GLOBAL     math
        LOAD_ATTR       sin

    If the client module always assumed that math.sin was a local
    constant and it was the responsibility of "external forces"
    outside the function to keep the reference correct, we might have
    code like this:

        TRACK_OBJECT       math.sin
        ...
        LOAD_FAST          math.sin
        ...
        UNTRACK_OBJECT     math.sin

    If the LOAD_FAST was in a loop the payoff in reduced global loads
    and attribute lookups could be significant.

    This technique could, in theory, be applied to any global variable
    access or attribute lookup.  Consider this code:

        l = []
        for i in range(10):
            l.append(math.sin(i))
        return l

    Even though l is a local variable, you still pay the cost of
    loading l.append ten times in the loop.  The compiler (or an
    optimizer) could recognize that both math.sin and l.append are
    being called in the loop and decide to generate the tracked local
    code, avoiding it for the builtin range() function because it's
    only called once during loop setup.  Performance issues related to
    accessing local variables make tracking l.append less attractive
    than tracking globals such as math.sin.

    According to a post to python-dev by Marc-Andre Lemburg [1],
    LOAD_GLOBAL opcodes account for over 7% of all instructions
    executed by the Python virtual machine.  This can be a very
    expensive instruction, at least relative to a LOAD_FAST
    instruction, which is a simple array index and requires no extra
    function calls by the virtual machine.  I believe many LOAD_GLOBAL
    instructions and LOAD_GLOBAL/LOAD_ATTR pairs could be converted to
    LOAD_FAST instructions.

    Code that uses global variables heavily often resorts to various
    tricks to avoid global variable and attribute lookup.  The
    aforementioned sre_compile._compile function caches the append
    method of the growing output list.  Many people commonly abuse
    functions' default argument feature to cache global variable
    lookups.  Both of these schemes are hackish and rarely address all
    the available opportunities for optimization.  (For example,
    sre_compile._compile does not cache the two globals that it uses
    most frequently: the builtin len function and the global OPCODES
    array that it imports from sre_constants.py.


Questions

    Q.  What about threads?  What if math.sin changes while in cache?

    A.  I believe the global interpreter lock will protect values from
        being corrupted.  In any case, the situation would be no worse
        than it is today.  If one thread modified math.sin after another
        thread had already executed "LOAD_GLOBAL math", but before it
        executed "LOAD_ATTR sin", the client thread would see the old
        value of math.sin.

        The idea is this.  I use a multi-attribute load below as an
        example, not because it would happen very often, but because by
        demonstrating the recursive nature with an extra call hopefully
        it will become clearer what I have in mind.  Suppose a function
        defined in module foo wants to access spam.eggs.ham and that
        spam is a module imported at the module level in foo:

            import spam
            ...
            def somefunc():
                ...
                x = spam.eggs.ham

        Upon entry to somefunc, a TRACK_GLOBAL instruction will be
        executed:

            TRACK_GLOBAL spam.eggs.ham n

        "spam.eggs.ham" is a string literal stored in the function's
        constants array.  "n" is a fastlocals index.  "&fastlocals[n]"
        is a reference to slot "n" in the executing frame's fastlocals
        array, the location in which the spam.eggs.ham reference will
        be stored.  Here's what I envision happening:

        1. The TRACK_GLOBAL instruction locates the object referred to
           by the name "spam" and finds it in its module scope.  It
           then executes a C function like

               _PyObject_TrackName(m, "spam.eggs.ham", &fastlocals[n])

           where "m" is the module object with an attribute "spam".

        2. The module object strips the leading "spam." stores the
           necessary information ("eggs.ham" and &fastlocals[n]) in
           case its binding for the name "eggs" changes.  It then
           locates the object referred to by the key "eggs" in its
           dict and recursively calls

               _PyObject_TrackName(eggs, "eggs.ham", &fastlocals[n])

        3. The eggs object strips the leading "eggs.", stores the
           ("ham", &fastlocals[n]) info, locates the object in its
           namespace called "ham" and calls _PyObject_TrackName once
           again:

               _PyObject_TrackName(ham, "ham", &fastlocals[n])

        4. The "ham" object strips the leading string (no "." this
           time, but that's a minor point), sees that the result is
           empty, then uses its own value (self, probably) to update
           the location it was handed:

               Py_XDECREF(&fastlocals[n]);
               &fastlocals[n] = self;
               Py_INCREF(&fastlocals[n]);

        At this point, each object involved in resolving
        "spam.eggs.ham" knows which entry in its namespace needs to be
        tracked and what location to update if that name changes.
        Furthermore, if the one name it is tracking in its local
        storage changes, it can call _PyObject_TrackName using the new
        object once the change has been made.  At the bottom end of
        the food chain, the last object will always strip a name, see
        the empty string and know that its value should be stuffed
        into the location it's been passed.

        When the object referred to by the dotted expression
        "spam.eggs.ham" is going to go out of scope, an
        "UNTRACK_GLOBAL spam.eggs.ham n" instruction is executed.  It
        has the effect of deleting all the tracking information that
        TRACK_GLOBAL established.

        The tracking operation may seem expensive, but recall that the
        objects being tracked are assumed to be "almost constant", so
        the setup cost will be traded off against hopefully multiple
        local instead of global loads.  For globals with attributes
        the tracking setup cost grows but is offset by avoiding the
        extra LOAD_ATTR cost.  The TRACK_GLOBAL instruction needs to
        perform a PyDict_GetItemString for the first name in the chain
        to determine where the top-level object resides.  Each object
        in the chain has to store a string and an address somewhere,
        probably in a dict that uses storage locations as keys
        (e.g. the &fastlocals[n]) and strings as values.  (This dict
        could possibly be a central dict of dicts whose keys are
        object addresses instead of a per-object dict.)  It shouldn't
        be the other way around because multiple active frames may
        want to track "spam.eggs.ham", but only one frame will want to
        associate that name with one of its fast locals slots.


Unresolved Issues

    Threading -

    What about this (dumb) code?

        l = []
        lock = threading.Lock()
        ...
        def fill_l():
            for i in range(1000):
                lock.acquire()
                l.append(math.sin(i))
                lock.release()
        ...
        def consume_l():
            while 1:
                lock.acquire()
                if l:
                    elt = l.pop()
                lock.release()
                fiddle(elt)

    It's not clear from a static analysis of the code what the lock is
    protecting.  (You can't tell at compile-time that threads are even
    involved can you?)  Would or should it affect attempts to track
    "l.append" or "math.sin" in the fill_l function?

    If we annotate the code with mythical track_object and untrack_object
    builtins (I'm not proposing this, just illustrating where stuff would
    go!), we get

        l = []
        lock = threading.Lock()
        ...
        def fill_l():
            track_object("l.append", append)
            track_object("math.sin", sin)
            for i in range(1000):
                lock.acquire()
                append(sin(i))
                lock.release()
            untrack_object("math.sin", sin)
            untrack_object("l.append", append)
        ...
        def consume_l():
            while 1:
                lock.acquire()
                if l:
                    elt = l.pop()
                lock.release()
                fiddle(elt)

    Is that correct both with and without threads (or at least equally
    incorrect with and without threads)?

    Nested Scopes -

    The presence of nested scopes will affect where TRACK_GLOBAL finds
    a global variable, but shouldn't affect anything after that.  (I
    think.)

    Missing Attributes -

    Suppose I am tracking the object referred to by "spam.eggs.ham"
    and "spam.eggs" is rebound to an object that does not have a "ham"
    attribute.  It's clear this will be an AttributeError if the
    programmer attempts to resolve "spam.eggs.ham" in the current
    Python virtual machine, but suppose the programmer has anticipated
    this case:

        if hasattr(spam.eggs, "ham"):
            print spam.eggs.ham
        elif hasattr(spam.eggs, "bacon"):
            print spam.eggs.bacon
        else:
            print "what? no meat?"

    You can't raise an AttributeError when the tracking information is
    recalculated.  If it does not raise AttributeError and instead
    lets the tracking stand, it may be setting the programmer up for a
    very subtle error.

    One solution to this problem would be to track the shortest
    possible root of each dotted expression the function refers to
    directly.  In the above example, "spam.eggs" would be tracked, but
    "spam.eggs.ham" and "spam.eggs.bacon" would not.

    Who does the dirty work? -

    In the Questions section I postulated the existence of a
    _PyObject_TrackName function.  While the API is fairly easy to
    specify, the implementation behind-the-scenes is not so obvious.
    A central dictionary could be used to track the name/location
    mappings, but it appears that all setattr functions might need to
    be modified to accommodate this new functionality.

    If all types used the PyObject_GenericSetAttr function to set
    attributes that would localize the update code somewhat.  They
    don't however (which is not too surprising), so it seems that all
    getattrfunc and getattrofunc functions will have to be updated.
    In addition, this would place an absolute requirement on C
    extension module authors to call some function when an attribute
    changes value (PyObject_TrackUpdate?).

    Finally, it's quite possible that some attributes will be set by
    side effect and not by any direct call to a setattr method of some
    sort.  Consider a device interface module that has an interrupt
    routine that copies the contents of a device register into a slot
    in the object's struct whenever it changes.  In these situations,
    more extensive modifications would have to be made by the module
    author.  To identify such situations at compile time would be
    impossible.  I think an extra slot could be added to PyTypeObjects
    to indicate if an object's code is safe for global tracking.  It
    would have a default value of 0 (Py_TRACKING_NOT_SAFE).  If an
    extension module author has implemented the necessary tracking
    support, that field could be initialized to 1 (Py_TRACKING_SAFE).
    _PyObject_TrackName could check that field and issue a warning if
    it is asked to track an object that the author has not explicitly
    said was safe for tracking.

Discussion

    Jeremy Hylton has an alternate proposal on the table [2].  His
    proposal seeks to create a hybrid dictionary/list object for use
    in global name lookups that would make global variable access look
    more like local variable access.  While there is no C code
    available to examine, the Python implementation given in his
    proposal still appears to require dictionary key lookup.  It
    doesn't appear that his proposal could speed local variable
    attribute lookup, which might be worthwhile in some situations if
    potential performance burdens could be addressed.


Backwards Compatibility

    I don't believe there will be any serious issues of backward
    compatibility.  Obviously, Python bytecode that contains
    TRACK_OBJECT opcodes could not be executed by earlier versions of
    the interpreter, but breakage at the bytecode level is often
    assumed between versions.


Implementation

    TBD.  This is where I need help.  I believe there should be either
    a central name/location registry or the code that modifies object
    attributes should be modified, but I'm not sure the best way to go
    about this.  If you look at the code that implements the
    STORE_GLOBAL and STORE_ATTR opcodes, it seems likely that some
    changes will be required to PyDict_SetItem and PyObject_SetAttr or
    their String variants.  Ideally, there'd be a fairly central place
    to localize these changes.  If you begin considering tracking
    attributes of local variables you get into issues of modifying
    STORE_FAST as well, which could be a problem, since the name
    bindings for local variables are changed much more frequently.  (I
    think an optimizer could avoid inserting the tracking code for the
    attributes for any local variables where the variable's name
    binding changes.)


Performance

    I believe (though I have no code to prove it at this point), that
    implementing TRACK_OBJECT will generally not be much more
    expensive than a single LOAD_GLOBAL instruction or a
    LOAD_GLOBAL/LOAD_ATTR pair.  An optimizer should be able to avoid
    converting LOAD_GLOBAL and LOAD_GLOBAL/LOAD_ATTR to the new scheme
    unless the object access occurred within a loop.  Further down the
    line, a register-oriented replacement for the current Python
    virtual machine [3] could conceivably eliminate most of the
    LOAD_FAST instructions as well.

    The number of tracked objects should be relatively small.  All
    active frames of all active threads could conceivably be tracking
    objects, but this seems small compared to the number of functions
    defined in a given application.


References

    [1] http://mail.python.org/pipermail/python-dev/2000-July/007609.html

    [2] http://www.zope.org/Members/jeremy/CurrentAndFutureProjects/FastGlobalsPEP

    [3] http://www.musi-cal.com/~skip/python/rattlesnake20010813.tar.gz


Copyright

    This document has been placed in the public domain.



pep-0267 Optimized Access to Module Namespaces

PEP: 267
Title: Optimized Access to Module Namespaces
Version: $Revision$
Last-Modified: $Date$
Author: Jeremy Hylton <jeremy at alum.mit.edu>
Status: Deferred
Type: Standards Track
Created: 23-May-2001
Python-Version: 2.2
Post-History: 

Deferral

    While this PEP is a nice idea, no-one has yet emerged to do the work of
    hashing out the differences between this PEP, PEP 266 and PEP 280.
    Hence, it is being deferred.


Abstract

    This PEP proposes a new implementation of global module namespaces
    and the builtin namespace that speeds name resolution.  The
    implementation would use an array of object pointers for most
    operations in these namespaces.  The compiler would assign indices
    for global variables and module attributes at compile time.

    The current implementation represents these namespaces as
    dictionaries.  A global name incurs a dictionary lookup each time
    it is used; a builtin name incurs two dictionary lookups, a failed
    lookup in the global namespace and a second lookup in the builtin
    namespace.

    This implementation should speed Python code that uses
    module-level functions and variables.  It should also eliminate
    awkward coding styles that have evolved to speed access to these
    names.

    The implementation is complicated because the global and builtin
    namespaces can be modified dynamically in ways that are impossible
    for the compiler to detect.  (Example: A module's namespace is
    modified by a script after the module is imported.)  As a result,
    the implementation must maintain several auxiliary data structures
    to preserve these dynamic features.


Introduction

    This PEP proposes a new implementation of attribute access for
    module objects that optimizes access to module variables known at
    compile time.  The module will store these variables in an array
    and provide an interface to lookup attributes using array offsets.
    For globals, builtins, and attributes of imported modules, the
    compiler will generate code that uses the array offsets for fast
    access.

    [describe the key parts of the design: dlict, compiler support,
    stupid name trick workarounds, optimization of other module's
    globals]

    The implementation will preserve existing semantics for module
    namespaces, including the ability to modify module namespaces at
    runtime in ways that affect the visibility of builtin names.


DLict design

    The namespaces are implemented using a data structure that has
    sometimes gone under the name dlict.  It is a dictionary that has
    numbered slots for some dictionary entries.  The type must be
    implemented in C to achieve acceptable performance.  The new
    type-class unification work should make this fairly easy.  The
    DLict will presumably be a subclass of dictionary with an
    alternate storage module for some keys.

    A Python implementation is included here to illustrate the basic
    design:

        """A dictionary-list hybrid"""

        import types

        class DLict:
            def __init__(self, names):
                assert isinstance(names, types.DictType)
                self.names = {}
                self.list = [None] * size
                self.empty = [1] * size
                self.dict = {}
                self.size = 0

            def __getitem__(self, name):
                i = self.names.get(name)
                if i is None:
                    return self.dict[name]
                if self.empty[i] is not None:
                    raise KeyError, name
                return self.list[i]

            def __setitem__(self, name, val):
                i = self.names.get(name)
                if i is None:
                    self.dict[name] = val
                else:
                    self.empty[i] = None
                    self.list[i] = val
                    self.size += 1

            def __delitem__(self, name):
                i = self.names.get(name)
                if i is None:
                    del self.dict[name]
                else:
                    if self.empty[i] is not None:
                        raise KeyError, name
                    self.empty[i] = 1
                    self.list[i] = None
                    self.size -= 1

            def keys(self):
                if self.dict:
                    return self.names.keys() + self.dict.keys()
                else:
                    return self.names.keys()

            def values(self):
                if self.dict:
                    return self.names.values() + self.dict.values()
                else:
                    return self.names.values()

            def items(self):
                if self.dict:
                    return self.names.items()
                else:
                    return self.names.items() + self.dict.items()

            def __len__(self):
                return self.size + len(self.dict)

            def __cmp__(self, dlict):
                c = cmp(self.names, dlict.names)
                if c != 0:
                    return c
                c = cmp(self.size, dlict.size)
                if c != 0:
                    return c
                for i in range(len(self.names)):
                    c = cmp(self.empty[i], dlict.empty[i])
                    if c != 0:
                        return c
                    if self.empty[i] is None:
                        c = cmp(self.list[i], dlict.empty[i])
                        if c != 0:
                            return c
                return cmp(self.dict, dlict.dict)

            def clear(self):
                self.dict.clear()
                for i in range(len(self.names)):
                    if self.empty[i] is None:
                        self.empty[i] = 1
                        self.list[i] = None

            def update(self):
                pass

            def load(self, index):
                """dlict-special method to support indexed access"""
                if self.empty[index] is None:
                    return self.list[index]
                else:
                    raise KeyError, index # XXX might want reverse mapping

            def store(self, index, val):
                """dlict-special method to support indexed access"""
                self.empty[index] = None
                self.list[index] = val

            def delete(self, index):
                """dlict-special method to support indexed access"""
                self.empty[index] = 1
                self.list[index] = None


Compiler issues

    The compiler currently collects the names of all global variables
    in a module.  These are names bound at the module level or bound
    in a class or function body that declares them to be global.

    The compiler would assign indices for each global name and add the
    names and indices of the globals to the module's code object.
    Each code object would then be bound irrevocably to the module it
    was defined in.  (Not sure if there are some subtle problems with
    this.)

    For attributes of imported modules, the module will store an
    indirection record.  Internally, the module will store a pointer
    to the defining module and the offset of the attribute in the
    defining module's global variable array.  The offset would be
    initialized the first time the name is looked up.


Runtime model

    The PythonVM will be extended with new opcodes to access globals
    and module attributes via a module-level array.

    A function object would need to point to the module that defined
    it in order to provide access to the module-level global array.

    For module attributes stored in the dlict (call them static
    attributes), the get/delattr implementation would need to track
    access to these attributes using the old by-name interface.  If a
    static attribute is updated dynamically, e.g.

        mod.__dict__["foo"] = 2

    The implementation would need to update the array slot instead of
    the backup dict.


Backwards compatibility

    The dlict will need to maintain meta-information about whether a
    slot is currently used or not.  It will also need to maintain a
    pointer to the builtin namespace.  When a name is not currently
    used in the global namespace, the lookup will have to fail over to
    the builtin namespace.

    In the reverse case, each module may need a special accessor
    function for the builtin namespace that checks to see if a global
    shadowing the builtin has been added dynamically.  This check
    would only occur if there was a dynamic change to the module's
    dlict, i.e. when a name is bound that wasn't discovered at
    compile-time.

    These mechanisms would have little if any cost for the common case
    whether a module's global namespace is not modified in strange
    ways at runtime.  They would add overhead for modules that did
    unusual things with global names, but this is an uncommon practice
    and probably one worth discouraging.

    It may be desirable to disable dynamic additions to the global
    namespace in some future version of Python.  If so, the new
    implementation could provide warnings.
    

Related PEPs

    PEP 266, Optimizing Global Variable/Attribute Access, proposes a
    different mechanism for optimizing access to global variables as
    well as attributes of objects.  The mechanism uses two new opcodes
    TRACK_OBJECT and UNTRACK_OBJECT to create a slot in the local
    variables array that aliases the global or object attribute.  If
    the object being aliases is rebound, the rebind operation is
    responsible for updating the aliases.

    The objecting tracking approach applies to a wider range of
    objects than just module.  It may also have a higher runtime cost,
    because each function that uses a global or object attribute must
    execute extra opcodes to register its interest in an object and
    unregister on exit; the cost of registration is unclear, but
    presumably involves a dynamically resizable data structure to hold
    a list of callbacks.

    The implementation proposed here avoids the need for registration,
    because it does not create aliases.  Instead it allows functions
    that reference a global variable or module attribute to retain a
    pointer to the location where the original binding is stored.  A
    second advantage is that the initial lookup is performed once per
    module rather than once per function call.


Copyright

    This document has been placed in the public domain.



pep-0268 Extended HTTP functionality and WebDAV

PEP:268
Title:Extended HTTP functionality and WebDAV
Version:$Revision$
Last-Modified:$Date$
Author:gstein at lyra.org (Greg Stein)
Status:Rejected
Type:Standards Track
Content-Type:text/x-rst
Created:20-Aug-2001
Python-Version:2.x
Post-History:21-Aug-2001

Rejection Notice

This PEP has been rejected. It has failed to generate sufficient community support in the six years since its proposal.

Abstract

This PEP discusses new modules and extended functionality for Python's HTTP support. Notably, the addition of authenticated requests, proxy support, authenticated proxy usage, and WebDAV [1] capabilities.

Rationale

Python has been quite popular as a result of its "batteries included" positioning. One of the most heavily used protocols, HTTP (see RFC 2616), has been included with Python for years (httplib). However, this support has not kept up with the full needs and requirements of many HTTP-based applications and systems. In addition, new protocols based on HTTP, such as WebDAV and XML-RPC, are becoming useful and are seeing increasing usage. Supplying this functionality meets Python's "batteries included" role and also keeps Python at the leading edge of new technologies.

While authentication and proxy support are two very notable features missing from Python's core HTTP processing, they are minimally handled as part of Python's URL handling (urllib and urllib2). However, applications that need fine-grained or sophisticated HTTP handling cannot make use of the features while they reside in urllib. Refactoring these features into a location where they can be directly associated with an HTTP connection will improve their utility for both urllib and for sophisticated applications.

The motivation for this PEP was from several people requesting these features directly, and from a number of feature requests on SourceForge. Since the exact form of the modules to be provided and the classes/architecture used could be subject to debate, this PEP was created to provide a focal point for those discussions.

Specification

Two modules will be added to the standard library: httpx (HTTP extended functionality), and davlib (WebDAV library).

[ suggestions for module names are welcome; davlib has some precedence, but something like webdav might be desirable ]

HTTP Authentication

The httpx module will provide a mixin for performing HTTP authentication (for both proxy and origin server authentication). This mixin (httpx.HandleAuthentication) can be combined with the HTTPConnection and the HTTPSConnection classes (the mixin may possibly work with the HTTP and HTTPS compatibility classes, but that is not a requirement).

The mixin will delegate the authentication process to one or more "authenticator" objects, allowing multiple connections to share authenticators. The use of a separate object allows for a long term connection to an authentication system (e.g. LDAP). An authenticator for the Basic and Digest mechanisms (see RFC 2617) will be provided. User-supplied authenticator subclasses can be registered and used by the connections.

A "credentials" object (httpx.Credentials) is also associated with the mixin, and stores the credentials (e.g. username and password) needed by the authenticators. Subclasses of Credentials can be created to hold additional information (e.g. NT domain).

The mixin overrides the getresponse() method to detect 401 (Unauthorized) and 407 (Proxy Authentication Required) responses. When this is found, the response object, the connection, and the credentials are passed to the authenticator corresponding with the authentication scheme specified in the response (multiple authenticators are tried in decreasing order of security if multiple schemes are in the response). Each authenticator can examine the response headers and decide whether and how to resend the request with the correct authentication headers. If no authenticator can successfully handle the authentication, then an exception is raised.

Resending a request, with the appropriate credentials, is one of the more difficult portions of the authentication system. The difficulty arises in recording what was sent originally: the request line, the headers, and the body. By overriding putrequest, putheader, and endheaders, we can capture all but the body. Once the endheaders method is called, then we capture all calls to send() (until the next putrequest method call) to hold the body content. The mixin will have a configurable limit for the amount of data to hold in this fashion (e.g. only hold up to 100k of body content). Assuming that the entire body has been stored, then we can resend the request with the appropriate authentication information.

If the body is too large to be stored, then the getresponse() simply returns the response object, indicating the 401 or 407 error. Since the authentication information has been computed and cached (into the Credentials object; see below), the caller can simply regenerate the request. The mixin will attach the appropriate credentials.

A "protection space" (see RFC 2617, section 1.2) is defined as a tuple of the host, port, and authentication realm. When a request is initially sent to an HTTP server, we do not know the authentication realm (the realm is only returned when authentication fails). However, we do have the path from the URL, and that can be useful in determining the credentials to send to the server. The Basic authentication scheme is typically set up hierarchically: the credentials for /path can be tried for /path/subpath. The Digest authentication scheme has explicit support for the hierarchical setup. The httpx.Credentials object will store credentials for multiple protection spaces, and can be looked up in two differents ways:

  1. looked up using (host, port, path) -- this lookup scheme is used when generating a request for a path where we don't know the authentication realm.
  2. looked up using (host, port, realm) -- this mechanism is used during the authentication process when the server has specified that the Request-URI resides within a specific realm.

The HandleAuthentication mixin will override putrequest() to automatically insert credentials, if available. The URL from the putrequest is used to determine the appropriate authentication information to use.

It is also important to note that two sets of credentials are used, and stored by the mixin. One set for any proxy that may be used, and one used for the target origin server. Since proxies do not have paths, the protection spaces in the proxy credentials will always use "/" for storing and looking up via a path.

Proxy Handling

The httpx module will provide a mixin for using a proxy to perform HTTP(S) operations. This mixin (httpx.UseProxy) can be combined with the HTTPConnection and the HTTPSConnection classes (the mixin may possibly work with the HTTP and HTTPS compatibility classes, but that is not a requirement).

The mixin will record the (host, port) of the proxy to use. XXX will be overridden to use this host/port combination for connections and to rewrite request URLs into the absoluteURIs referring to the origin server (these URIs are passed to the proxy server).

Proxy authentication is handled by the httpx.HandleAuthentication class since a user may directly use HTTP(S)Connection to speak with proxies.

WebDAV Features

The davlib module will provide a mixin for sending WebDAV requests to a WebDAV-enabled server. This mixin (davlib.DAVClient) can be combined with the HTTPConnection and the HTTPSConnection classes (the mixin may possibly work with the HTTP and HTTPS compatibility classes, but that is not a requirement).

The mixin provides methods to perform the various HTTP methods defined by HTTP in RFC 2616, and by WebDAV in RFC 2518.

A custom response object is used to decode 207 (Multi-Status) responses. The response object will use the standard library's xml package to parse the multistatus XML information, producing a simple structure of objects to hold the multistatus data. Multiple parsing schemes will be tried/used, in order of decreasing speed.

Reference Implementation

The actual (future/final) implementation is being developed in the /nondist/sandbox/Lib directory, until it is accepted and moved into the main Lib directory.

pep-0269 Pgen Module for Python

PEP: 269
Title: Pgen Module for Python
Version: $Revision$
Last-Modified: $Date$
Author: Jonathan Riehl <jriehl at spaceship.com>
Status: Deferred
Type: Standards Track
Created: 24-Aug-2001
Python-Version: 2.2
Post-History: 

Abstract

    Much like the parser module exposes the Python parser, this PEP
    proposes that the parser generator used to create the Python
    parser, pgen, be exposed as a module in Python.


Rationale

    Through the course of Pythonic history, there have been numerous
    discussions about the creation of a Python compiler [1].  These
    have resulted in several implementations of Python parsers, most
    notably the parser module currently provided in the Python
    standard library[2] and Jeremy Hylton's compiler module[3].
    However, while multiple language changes have been proposed
    [4][5], experimentation with the Python syntax has lacked the
    benefit of a Python binding to the actual parser generator used to
    build Python.

    By providing a Python wrapper analogous to Fred Drake Jr.'s parser
    wrapper, but targeted at the pgen library, the following
    assertions are made:

    1. Reference implementations of syntax changes will be easier to
       develop.  Currently, a reference implementation of a syntax
       change would require the developer to use the pgen tool from
       the command line.  The resulting parser data structure would
       then either have to be reworked to interface with a custom
       CPython implementation, or wrapped as a C extension module.

    2. Reference implementations of syntax changes will be easier to
       distribute.  Since the parser generator will be available in
       Python, it should follow that the resulting parser will
       accessible from Python.  Therefore, reference implementations
       should be available as pure Python code, versus using custom
       versions of the existing CPython distribution, or as compilable
       extension modules.

    3. Reference implementations of syntax changes will be easier to
       discuss with a larger audience.  This somewhat falls out of the
       second assertion, since the community of Python users is most
       likely larger than the community of CPython developers.

    4. Development of small languages in Python will be further
       enhanced, since the additional module will be a fully
       functional LL(1) parser generator.


Specification

    The proposed module will be called pgen.  The pgen module will
    contain the following functions:

    parseGrammarFile (fileName) -> AST
        The parseGrammarFile() function will read the file pointed to
        by fileName and create an AST object.  The AST nodes will
        contain the nonterminal, numeric values of the parser
        generator meta-grammar.  The output AST will be an instance of
        the AST extension class as provided by the parser module.
        Syntax errors in the input file will cause the SyntaxError
        exception to be raised.

    parseGrammarString (text) -> AST
        The parseGrammarString() function will follow the semantics of
        the parseGrammarFile(), but accept the grammar text as a
        string for input, as opposed to the file name.

    buildParser (grammarAst) -> DFA
        The buildParser() function will accept an AST object for input
        and return a DFA (deterministic finite automaton) data
        structure.  The DFA data structure will be a C extension
        class, much like the AST structure is provided in the parser
        module.  If the input AST does not conform to the nonterminal
        codes defined for the pgen meta-grammar, buildParser() will
        throw a ValueError exception.

    parseFile (fileName, dfa, start) -> AST
        The parseFile() function will essentially be a wrapper for the
        PyParser_ParseFile() C API function.  The wrapper code will
        accept the DFA C extension class, and the file name.  An AST
        instance that conforms to the lexical values in the token
        module and the nonterminal values contained in the DFA will be
        output.

    parseString (text, dfa, start) -> AST
        The parseString() function will operate in a similar fashion
        to the parseFile() function, but accept the parse text as an
        argument.  Much like parseFile() will wrap the
        PyParser_ParseFile() C API function, parseString() will wrap
        the PyParser_ParseString() function.

    symbolToStringMap (dfa) -> dict
        The symbolToStringMap() function will accept a DFA instance
        and return a dictionary object that maps from the DFA's
        numeric values for its nonterminals to the string names of the
        nonterminals as found in the original grammar specification
        for the DFA.

    stringToSymbolMap (dfa) -> dict
        The stringToSymbolMap() function output a dictionary mapping
        the nonterminal names of the input DFA to their corresponding
        numeric values.

    Extra credit will be awarded if the map generation functions and
    parsing functions are also methods of the DFA extension class.


Implementation Plan

    A cunning plan has been devised to accomplish this enhancement:

    1. Rename the pgen functions to conform to the CPython naming
       standards.  This action may involve adding some header files to
       the Include subdirectory.

    2. Move the pgen C modules in the Makefile.pre.in from unique pgen
       elements to the Python C library.

    3. Make any needed changes to the parser module so the AST
       extension class understands that there are AST types it may not
       understand.  Cursory examination of the AST extension class
       shows that it keeps track of whether the tree is a suite or an
       expression.

    3. Code an additional C module in the Modules directory.  The C
       extension module will implement the DFA extension class and the
       functions outlined in the previous section.

    4. Add the new module to the build process.  Black magic, indeed.


Limitations

    Under this proposal, would be designers of Python 3000 will still
    be constrained to Python's lexical conventions.  The addition,
    subtraction or modification of the Python lexer is outside the
    scope of this PEP.


Reference Implementation

    No reference implementation is currently provided. A patch
    was provided at some point in
    http://sourceforge.net/tracker/index.php?func=detail&aid=599331&group_id=5470&atid=305470
    but that patch is no longer maintained.


References

    [1] The (defunct) Python Compiler-SIG
        http://www.python.org/sigs/compiler-sig/

    [2] Parser Module Documentation
        http://docs.python.org/library/parser.html

    [3] Hylton, Jeremy.
        http://docs.python.org/library/compiler.html

    [4] Pelletier, Michel. "Python Interface Syntax", PEP-245.
        http://www.python.org/dev/peps/pep-0245/

    [5] The Python Types-SIG
        http://www.python.org/sigs/types-sig/


Copyright

    This document has been placed in the public domain.



pep-0270 uniq method for list objects

PEP: 270
Title: uniq method for list objects
Version: $Revision$
Last-Modified: $Date$
Author: Jason Petrone <jp at demonseed.net>
Status: Rejected
Type: Standards Track
Created: 21-Aug-2001
Python-Version: 2.2
Post-History: 

Notice

    This PEP is withdrawn by the author.  He writes:

        Removing duplicate elements from a list is a common task, but
        there are only two reasons I can see for making it a built-in.
        The first is if it could be done much faster, which isn't the
        case.  The second is if it makes it significantly easier to
        write code.  The introduction of sets.py eliminates this
        situation since creating a sequence without duplicates is just
        a matter of choosing a different data structure: a set instead
        of a list.

    As described in PEP 218, sets are being added to the standard
    library for Python 2.3.


Abstract

    This PEP proposes adding a method for removing duplicate elements to
    the list object.


Rationale

    Removing duplicates from a list is a common task.  I think it is
    useful and general enough to belong as a method in list objects.
    It also has potential for faster execution when implemented in C,
    especially if optimization using hashing or sorted cannot be used.

    On comp.lang.python there are many, many, posts[1] asking about
    the best way to do this task.  Its a little tricky to implement
    optimally and it would be nice to save people the trouble of
    figuring it out themselves.


Considerations

    Tim Peters suggests trying to use a hash table, then trying to
    sort, and finally falling back on brute force[2].  Should uniq
    maintain list order at the expense of speed?

    Is it spelled 'uniq' or 'unique'? 


Reference Implementation

    I've written the brute force version.  Its about 20 lines of code
    in listobject.c.  Adding support for hash table and sorted
    duplicate removal would only take another hour or so.


References

    [1] http://groups.google.com/groups?as_q=duplicates&as_ugroup=comp.lang.python  

    [2] Tim Peters unique() entry in the Python cookbook:
        http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/52560/index_txt


Copyright

    This document has been placed in the public domain.



pep-0271 Prefixing sys.path by command line option

PEP: 271
Title: Prefixing sys.path by command line option
Version: $Revision$
Last-Modified: $Date$
Author: Frédéric B. Giacometti <fred at arakne.com>
Status: Rejected
Type: Standards Track
Created: 15-Aug-2001
Python-Version: 2.2
Post-History: 

Abstract

    At present, setting the PYTHONPATH environment variable is the
    only method for defining additional Python module search
    directories.

    This PEP introduces the '-P' valued option to the python command
    as an alternative to PYTHONPATH.


Rationale

    On Unix:

        python -P $SOMEVALUE

    will be equivalent to

        env PYTHONPATH=$SOMEVALUE python

    On Windows 2K:

        python -P %SOMEVALUE%

    will (almost) be equivalent to

        set __PYTHONPATH=%PYTHONPATH% && set PYTHONPATH=%SOMEVALUE%\
            && python && set PYTHONPATH=%__PYTHONPATH%
        

Other Information

    This option is equivalent to the 'java -classpath' option.


When to use this option

    This option is intended to ease and make more robust the use of
    Python in test or build scripts, for instance.


Reference Implementation

    A patch implementing this is available from SourceForge:

    http://sourceforge.net/tracker/download.php?group_id=5470&atid=305470&file_id=6916&aid=429614
  
    with the patch discussion at:

    http://sourceforge.net/tracker/?func=detail&atid=305470&aid=429614&group_id=5470


Copyright

    This document has been placed in the public domain.



pep-0272 API for Block Encryption Algorithms v1.0

PEP: 272
Title: API for Block Encryption Algorithms v1.0
Version: $Revision$
Last-Modified: $Date$
Author: A.M. Kuchling <amk at amk.ca>
Status: Final
Type: Informational
Created: 18-Sep-2001
Post-History: 17-Apr-2002, 29-May-2002

Abstract

    This document specifies a standard API for secret-key block
    encryption algorithms such as DES or Rijndael, making it easier to
    switch between different algorithms and implementations.  


Introduction

    Encryption algorithms transform their input data (called
    plaintext) in some way that is dependent on a variable key,
    producing ciphertext.  The transformation can easily be reversed
    if and only if one knows the key.  The key is a sequence of bits
    chosen from some very large space of possible keys.  There are two
    classes of encryption algorithms: block ciphers and stream ciphers.

    Block ciphers encrypt multibyte inputs of a fixed size (frequently
    8 or 16 bytes long), and can be operated in various feedback
    modes.  The feedback modes supported in this specification are:

        Number    Constant      Description
        1         MODE_ECB      Electronic Code Book     
        2         MODE_CBC      Cipher Block Chaining
        3         MODE_CFB      Cipher Feedback 
        5         MODE_OFB      Output Feedback
        6         MODE_CTR      Counter

    These modes are to be implemented as described in NIST publication
    SP 800-38A [1].  Descriptions of the first three feedback modes can
    also be found in Bruce Schneier's book _Applied
    Cryptography_ [2]. 

    (The numeric value 4 is reserved for MODE_PGP, a variant of CFB
    described in RFC 2440: "OpenPGP Message Format" [3]. This mode
    isn't considered important enough to make it worth requiring it
    for all block encryption ciphers, though supporting it is a nice
    extra feature.)

    In a strict formal sense, stream ciphers encrypt data bit-by-bit;
    practically, stream ciphers work on a character-by-character
    basis.  This PEP only aims at specifying an interface for block
    ciphers, though stream ciphers can support the interface described
    here by fixing 'block_size' to 1.  Feedback modes also don't make
    sense for stream ciphers, so the only reasonable feedback mode
    would be ECB mode.


Specification

    Encryption modules can add additional functions, methods, and
    attributes beyond those described in this PEP, but all of the
    features described in this PEP must be present for a module to 
    claim compliance with it.  

    Secret-key encryption modules should define one function:

    new(key, mode, [IV], **kwargs)

    Returns a ciphering object, using the secret key contained in the
    string 'key', and using the feedback mode 'mode', which must be
    one of the constants from the table above. 

    If 'mode' is MODE_CBC or MODE_CFB, 'IV' must be provided and must
    be a string of the same length as the block size.  Not providing a
    value of 'IV' will result in a ValueError exception being raised.

    Depending on the algorithm, a module may support additional
    keyword arguments to this function.  Some keyword arguments are
    specified by this PEP, and modules are free to add additional
    keyword arguments.  If a value isn't provided for a given keyword,
    a secure default value should be used.  For example, if an
    algorithm has a selectable number of rounds between 1 and 16, and
    1-round encryption is insecure and 8-round encryption is believed
    secure, the default value for 'rounds' should be 8 or more.
    (Module implementors can choose a very slow but secure value, too,
    such as 16 in this example.  This decision is left up to the
    implementor.)

    The following table lists keyword arguments defined by this PEP:
    
      Keyword                 Meaning
    counter               Callable object that returns counter blocks 
                          (see below; CTR mode only) 

    rounds                Number of rounds of encryption to use

    segment_size          Size of data and ciphertext segments, 
                          measured in bits (see below; CFB mode only)
                               
    The Counter feedback mode requires a sequence of input blocks,
    called counters, that are used to produce the output.  When 'mode'
    is MODE_CTR, the 'counter' keyword argument must be provided, and
    its value must be a callable object, such as a function or method.
    Successive calls to this callable object must return a sequence of
    strings that are of the length 'block_size' and that never
    repeats.  (Appendix B of the NIST publication gives a way to
    generate such a sequence, but that's beyond the scope of this
    PEP.)
    
    The CFB mode operates on segments of the plaintext and ciphertext
    that are 'segment_size' bits long.  Therefore, when using this
    mode, the input and output strings must be a multiple of
    'segment_size' bits in length.  'segment_size' must be an integer
    between 1 and block_size*8, inclusive.  (The factor of 8 comes
    from 'block_size' being measured in bytes and not in bits).  The
    default value for this parameter should be block_size*8.
    Implementors are allowed to constrain 'segment_size' to be a
    multiple of 8 for simplicity, but they're encouraged to support
    arbitrary values for generality.

    Secret-key encryption modules should define two variables:

    block_size

        An integer value; the size of the blocks encrypted by this
        module, measured in bytes.  For all feedback modes, the length
        of strings passed to the encrypt() and decrypt() must be a
        multiple of the block size.  

    key_size

        An integer value; the size of the keys required by this
        module, measured in bytes.  If key_size is None, then the
        algorithm accepts variable-length keys.  This may mean the
        module accepts keys of any random length, or that there are a
        few different possible lengths, e.g. 16, 24, or 32 bytes.  You
        cannot pass a key of length 0 (that is, the null string '') as
        a variable-length key.

    Cipher objects should have two attributes:

    block_size

        An integer value equal to the size of the blocks encrypted by
        this object.  For algorithms with a variable block size, this
        value is equal to the block size selected for this object.

    IV
    
        Contains the initial value which will be used to start a
        cipher feedback mode; it will always be a string exactly one
        block in length.  After encrypting or decrypting a string,
        this value is updated to reflect the modified feedback text.
        It is read-only, and cannot be assigned a new value.

    Cipher objects require the following methods:

    decrypt(string)

        Decrypts 'string', using the key-dependent data in the object
        and with the appropriate feedback mode.  The string's length
        must be an exact multiple of the algorithm's block size or, in
        CFB mode, of the segment size.  Returns a string containing
        the plaintext.

    encrypt(string)

        Encrypts a non-empty string, using the key-dependent data in
        the object, and with the appropriate feedback mode.  The
        string's length must be an exact multiple of the algorithm's
        block size or, in CFB mode, of the segment size.  Returns a
        string containing the ciphertext.

    Here's an example, using a module named 'DES':

    >>> import DES
    >>> obj = DES.new('abcdefgh', DES.MODE_ECB)
    >>> plaintext = "Guido van Rossum is a space alien."
    >>> len(plaintext)
    34
    >>> obj.encrypt(plaintext)
    Traceback (innermost last):
      File "<stdin>", line 1, in ?
    ValueError: Strings for DES must be a multiple of 8 in length
    >>> ciphertext = obj.encrypt(plain+'XXXXXX')   # Add padding
    >>> ciphertext
    '\021,\343Nq\214DY\337T\342pA\372\255\311s\210\363,\300j\330\250\312\347\342I\3215w\03561\303dgb/\006'
    >>> obj.decrypt(ciphertext)
    'Guido van Rossum is a space alien.XXXXXX'


References

    [1] NIST publication SP 800-38A, "Recommendation for Block Cipher
    Modes of Operation" (http://csrc.nist.gov/encryption/modes/)

    [2] Applied Cryptography

    [3] RFC2440: "OpenPGP Message Format" (http://rfc2440.x42.com,
    http://www.faqs.org/rfcs/rfc2440.html)


Changes

    2002-04: Removed references to stream ciphers; retitled PEP;
    prefixed feedback mode constants with MODE_; removed PGP feedback
    mode; added CTR and OFB feedback modes; clarified where numbers 
    are measured in bytes and where in bits.

    2002-09: Clarified the discussion of key length by using
    "variable-length keys" instead of "arbitrary-length".


Acknowledgements

    Thanks to the readers of the python-crypto list for their comments on
    this PEP.


Copyright

    This document has been placed in the public domain.



pep-0273 Import Modules from Zip Archives

PEP: 273
Title: Import Modules from Zip Archives
Version: $Revision$
Last-Modified: $Date$
Author: James C. Ahlstrom <jim at interet.com>
Status: Final
Type: Standards Track
Created: 11-Oct-2001
Python-Version: 2.3
Post-History: 26-Oct-2001

Abstract

    This PEP adds the ability to import Python modules
    *.py, *.py[co] and packages from zip archives.  The
    same code is used to speed up normal directory imports
    provided os.listdir is available.


Note

    Zip imports were added to Python 2.3, but the final implementation
    uses an approach different from the one described in this PEP.
    The 2.3 implementation is SourceForge patch #652586, which adds
    new import hooks described in PEP 302.  

    The rest of this PEP is therefore only of historical interest.

   

Specification

    Currently, sys.path is a list of directory names as strings.  If
    this PEP is implemented, an item of sys.path can be a string
    naming a zip file archive.  The zip archive can contain a
    subdirectory structure to support package imports.  The zip
    archive satisfies imports exactly as a subdirectory would.

    The implementation is in C code in the Python core and works on
    all supported Python platforms.

    Any files may be present in the zip archive, but only files
    *.py and *.py[co] are available for import.  Zip import of
    dynamic modules (*.pyd, *.so) is disallowed.

    Just as sys.path currently has default directory names, a default
    zip archive name is added too.  Otherwise there is no way to
    import all Python library files from an archive.


Subdirectory Equivalence

    The zip archive must be treated exactly as a subdirectory tree so
    we can support package imports based on current and future rules.
    All zip data is taken from the Central Directory, the data must be
    correct, and brain dead zip files are not accommodated.

    Suppose sys.path contains "/A/B/SubDir" and "/C/D/E/Archive.zip",
    and we are trying to import modfoo from the Q package.  Then
    import.c will generate a list of paths and extensions and will
    look for the file.  The list of generated paths does not change
    for zip imports.  Suppose import.c generates the path
    "/A/B/SubDir/Q/R/modfoo.pyc".  Then it will also generate the path
    "/C/D/E/Archive.zip/Q/R/modfoo.pyc".  Finding the SubDir path is
    exactly equivalent to finding "Q/R/modfoo.pyc" in the archive.

    Suppose you zip up /A/B/SubDir/* and all its subdirectories.  Then
    your zip file will satisfy imports just as your subdirectory did.

    Well, not quite.  You can't satisfy dynamic modules from a zip
    file.  Dynamic modules have extensions like .dll, .pyd, and .so.
    They are operating system dependent, and probably can't be loaded
    except from a file.  It might be possible to extract the dynamic
    module from the zip file, write it to a plain file and load it.
    But that would mean creating temporary files, and dealing with all
    the dynload_*.c, and that's probably not a good idea.

    When trying to import *.pyc, if it is not available then
    *.pyo will be used instead.  And vice versa when looking for *.pyo.
    If neither *.pyc nor *.pyo is available, or if the magic numbers
    are invalid, then *.py will be compiled and used to satisfy the
    import, but the compiled file will not be saved.  Python would
    normally write it to the same directory as *.py, but surely we
    don't want to write to the zip file.  We could write to the
    directory of the zip archive, but that would clutter it up, not
    good if it is /usr/bin for example.

    Failing to write the compiled files will make zip imports very slow,
    and the user will probably not figure out what is wrong.  So it
    is best to put *.pyc and *.pyo in the archive with the *.py.


Efficiency

    The only way to find files in a zip archive is linear search.  So
    for each zip file in sys.path, we search for its names once, and
    put the names plus other relevant data into a static Python
    dictionary.  The key is the archive name from sys.path joined with
    the file name (including any subdirectories) within the archive.
    This is exactly the name generated by import.c, and makes lookup
    easy.

    This same mechanism is used to speed up directory (non-zip) imports.
    See below.


zlib

    Compressed zip archives require zlib for decompression.  Prior to
    any other imports, we attempt an import of zlib.  Import of
    compressed files will fail with a message "missing zlib" unless
    zlib is available.


Booting

    Python imports site.py itself, and this imports os, nt, ntpath,
    stat, and UserDict.  It also imports sitecustomize.py which may
    import more modules.  Zip imports must be available before site.py
    is imported.

    Just as there are default directories in sys.path, there must be
    one or more default zip archives too.

    The problem is what the name should be.  The name should be linked
    with the Python version, so the Python executable can correctly
    find its corresponding libraries even when there are multiple
    Python versions on the same machine.

    We add one name to sys.path.  On Unix, the directory is
    sys.prefix + "/lib", and the file name is
    "python%s%s.zip" % (sys.version[0], sys.version[2]).
    So for Python 2.2 and prefix /usr/local, the path
    /usr/local/lib/python2.2/ is already on sys.path, and
    /usr/local/lib/python22.zip would be added.
    On Windows, the file is the full path to python22.dll, with
    "dll" replaced by "zip".  The zip archive name is always inserted
    as the second item in sys.path.  The first is the directory of the
    main.py (thanks Tim).


Directory Imports

    The static Python dictionary used to speed up zip imports can be
    used to speed up normal directory imports too.  For each item in
    sys.path that is not a zip archive, we call os.listdir, and add
    the directory contents to the dictionary.  Then instead of calling
    fopen() in a double loop, we just check the dictionary.  This
    greatly speeds up imports.  If os.listdir doesn't exist, the
    dictionary is not used.


Benchmarks

    Case  Original 2.2a3    Using os.listdir   Zip Uncomp  Zip Compr
    ---- -----------------  -----------------  ----------  ----------
      1  3.2 2.5 3.2->1.02  2.3 2.5 2.3->0.87  1.66->0.93  1.5->1.07
      2  2.8 3.9 3.0->1.32  Same as Case 1.
      3  5.7 5.7 5.7->5.7   2.1 2.1 2.1->1.8   1.25->0.99  1.19->1.13
      4  9.4 9.4 9.3->9.35  Same as Case 3.

    Case 1: Local drive C:, sys.path has its default value.
    Case 2: Local drive C:, directory with files is at the end of sys.path.
    Case 3: Network  drive, sys.path has its default value.
    Case 4: Network  drive, directory with files is at the end of sys.path.

    Benchmarks were performed on a Pentium 4 clone, 1.4 GHz, 256 Meg.
    The machine was running Windows 2000 with a Linux/Samba network server.
    Times are in seconds, and are the time to import about 100 Lib modules.
    Case 2 and 4 have the "correct" directory moved to the end of sys.path.
    "Uncomp" means uncompressed zip archive, "Compr" means compressed.

    Initial times are after a re-boot of the system; the time after
    "->" is the time after repeated runs.  Times to import from C:
    after a re-boot are rather highly variable for the "Original" case,
    but are more realistic.


Custom Imports

    The logic demonstrates the ability to import using default searching
    until a needed Python module (in this case, os) becomes available.
    This can be used to bootstrap custom importers.  For example, if
    "importer()" in __init__.py exists, then it could be used for imports.
    The "importer()" can freely import os and other modules, and these
    will be satisfied from the default mechanism.  This PEP does not
    define any custom importers, and this note is for information only.


Implementation

    A C implementation is available as SourceForge patch 492105.
    Superceded by patch 652586 and current CVS.
    http://python.org/sf/492105

    A newer version (updated for recent CVS by Paul Moore) is 645650.
    Superceded by patch 652586 and current CVS.
    http://python.org/sf/645650

    A competing implementation by Just van Rossum is 652586, which is
    the basis for the final implementation of PEP 302.  PEP 273 has
    been implemented using PEP 302's import hooks.
    http://python.org/sf/652586


Copyright

    This document has been placed in the public domain.



pep-0274 Dict Comprehensions

PEP: 274
Title: Dict Comprehensions
Version: $Revision$
Last-Modified: $Date$
Author: Barry Warsaw <barry at python.org>
Status: Final
Type: Standards Track
Created: 25-Oct-2001
Python-Version: 2.7, 3.0 (originally 2.3)
Post-History: 29-Oct-2001

Abstract

    PEP 202 introduces a syntactical extension to Python called the
    "list comprehension"[1].  This PEP proposes a similar syntactical
    extension called the "dictionary comprehension" or "dict
    comprehension" for short.  You can use dict comprehensions in ways
    very similar to list comprehensions, except that they produce
    Python dictionary objects instead of list objects.


Resolution

    This PEP was originally written for inclusion in Python 2.3.  It
    was withdrawn after observation that substantially all of its
    benefits were subsumed by generator expressions coupled with the
    dict() constructor.

    However, Python 2.7 and 3.0 introduces this exact feature, as well
    as the closely related set comprehensions.  On 2012-04-09, the PEP
    was changed to reflect this reality by updating its Status to
    Accepted, and updating the Python-Version field.  The Open
    Questions section was also removed since these have been long
    resolved by the current implementation.


Proposed Solution

    Dict comprehensions are just like list comprehensions, except that
    you group the expression using curly braces instead of square
    braces.  Also, the left part before the `for' keyword expresses
    both a key and a value, separated by a colon.  The notation is
    specifically designed to remind you of list comprehensions as
    applied to dictionaries.


Rationale

    There are times when you have some data arranged as a sequences of
    length-2 sequences, and you want to turn that into a dictionary.
    In Python 2.2, the dict() constructor accepts an argument that is
    a sequence of length-2 sequences, used as (key, value) pairs to
    initialize a new dictionary object.

    However, the act of turning some data into a sequence of length-2
    sequences can be inconvenient or inefficient from a memory or
    performance standpoint.  Also, for some common operations, such as
    turning a list of things into a set of things for quick duplicate
    removal or set inclusion tests, a better syntax can help code
    clarity.

    As with list comprehensions, an explicit for loop can always be
    used (and in fact was the only way to do it in earlier versions of
    Python).  But as with list comprehensions, dict comprehensions can
    provide a more syntactically succinct idiom that the traditional
    for loop.


Semantics

    The semantics of dict comprehensions can actually be demonstrated
    in stock Python 2.2, by passing a list comprehension to the
    built-in dictionary constructor:

    >>> dict([(i, chr(65+i)) for i in range(4)])

    is semantically equivalent to

    >>> {i : chr(65+i) for i in range(4)}

    The dictionary constructor approach has two distinct disadvantages
    from the proposed syntax though.  First, it isn't as legible as a
    dict comprehension.  Second, it forces the programmer to create an
    in-core list object first, which could be expensive.


Examples

    >>> print {i : chr(65+i) for i in range(4)}
    {0 : 'A', 1 : 'B', 2 : 'C', 3 : 'D'}

    >>> print {k : v for k, v in someDict.iteritems()} == someDict.copy()
    1

    >>> print {x.lower() : 1 for x in list_of_email_addrs}
    {'barry@zope.com'   : 1, 'barry@python.org' : 1, 'guido@python.org' : 1}

    >>> def invert(d):
    ...     return {v : k for k, v in d.iteritems()}
    ...
    >>> d = {0 : 'A', 1 : 'B', 2 : 'C', 3 : 'D'}
    >>> print invert(d)
    {'A' : 0, 'B' : 1, 'C' : 2, 'D' : 3}

    >>> {(k, v): k+v for k in range(4) for v in range(4)}
    ... {(3, 3): 6, (3, 2): 5, (3, 1): 4, (0, 1): 1, (2, 1): 3,
         (0, 2): 2, (3, 0): 3, (0, 3): 3, (1, 1): 2, (1, 0): 1,
         (0, 0): 0, (1, 2): 3, (2, 0): 2, (1, 3): 4, (2, 2): 4, (
         2, 3): 5}


Implementation

    All implementation details were resolved in the Python 2.7 and 3.0
    time-frame. 


References

    [1] PEP 202, List Comprehensions
        http://www.python.org/dev/peps/pep-0202/


Copyright

    This document has been placed in the public domain.



pep-0275 Switching on Multiple Values

PEP: 0275
Title: Switching on Multiple Values
Version: $Revision$
Last-Modified: $Date$
Author: Marc-AndrĂŠ Lemburg <mal at lemburg.com>
Status: Rejected
Type: Standards Track
Created: 10-Nov-2001
Python-Version: 2.6
Post-History: 

Rejection Notice

    A similar PEP for Python 3000, PEP 3103 [2], was already rejected,
    so this proposal has no chance of being accepted either.

Abstract

    This PEP proposes strategies to enhance Python's performance
    with respect to handling switching on a single variable having
    one of multiple possible values.

Problem

    Up to Python 2.5, the typical way of writing multi-value switches 
    has been to use long switch constructs of the following type:

    if x == 'first state':
        ...
    elif x == 'second state':
        ...
    elif x == 'third state':
        ...
    elif x == 'fourth state':
        ...
    else:
        # default handling
        ...

    This works fine for short switch constructs, since the overhead of
    repeated loading of a local (the variable x in this case) and
    comparing it to some constant is low (it has a complexity of O(n)
    on average). However, when using such a construct to write a state
    machine such as is needed for writing parsers the number of
    possible states can easily reach 10 or more cases.

    The current solution to this problem lies in using a dispatch
    table to find the case implementing method to execute depending on
    the value of the switch variable (this can be tuned to have a
    complexity of O(1) on average, e.g. by using perfect hash
    tables). This works well for state machines which require complex
    and lengthy processing in the different case methods. It does not
    perform well for ones which only process one or two instructions
    per case, e.g.

    def handle_data(self, data):
        self.stack.append(data)
 
    A nice example of this is the state machine implemented in
    pickle.py which is used to serialize Python objects. Other
    prominent cases include XML SAX parsers and Internet protocol
    handlers.

Proposed Solutions

    This PEP proposes two different but not necessarily conflicting
    solutions:

    1. Adding an optimization to the Python compiler and VM
       which detects the above if-elif-else construct and
       generates special opcodes for it which use an read-only
       dictionary for storing jump offsets.

    2. Adding new syntax to Python which mimics the C style
       switch statement.

    The first solution has the benefit of not relying on adding new
    keywords to the language, while the second looks cleaner. Both
    involve some run-time overhead to assure that the switching
    variable is immutable and hashable.

    Both solutions use a dictionary lookup to find the right
    jump location, so they both share the same problem space in
    terms of requiring that both the switch variable and the
    constants need to be compatible to the dictionary implementation
    (hashable, comparable, a==b => hash(a)==hash(b)).

Solution 1: Optimizing if-elif-else

     Implementation:

         It should be possible for the compiler to detect an
         if-elif-else construct which has the following signature:

                      if x == 'first':...
                      elif x == 'second':...
                      else:...

         i.e. the left hand side always references the same variable,
         the right hand side a hashable immutable builtin type.  The
         right hand sides need not be all of the same type, but they
         should be comparable to the type of the left hand switch
         variable.

         The compiler could then setup a read-only (perfect) hash
         table, store it in the constants and add an opcode SWITCH in
         front of the standard if-elif-else byte code stream which
         triggers the following run-time behaviour:

         At runtime, SWITCH would check x for being one of the
         well-known immutable types (strings, unicode, numbers) and
         use the hash table for finding the right opcode snippet. If
         this condition is not met, the interpreter should revert to
         the standard if-elif-else processing by simply skipping the
         SWITCH opcode and procedding with the usual if-elif-else byte
         code stream.

     Issues:

         The new optimization should not change the current Python
         semantics (by reducing the number of __cmp__ calls and adding
         __hash__ calls in if-elif-else constructs which are affected
         by the optimiztation). To assure this, switching can only
         safely be implemented either if a "from __future__" style
         flag is used, or the switching variable is one of the builtin
         immutable types: int, float, string, unicode, etc. (not
         subtypes, since it's not clear whether these are still
         immutable or not)

         To prevent post-modifications of the jump-table dictionary
         (which could be used to reach protected code), the jump-table
         will have to be a read-only type (e.g. a read-only
         dictionary).

         The optimization should only be used for if-elif-else
         constructs which have a minimum number of n cases (where n is
         a number which has yet to be defined depending on performance
         tests).

Solution 2: Adding a switch statement to Python

     New Syntax:

         switch EXPR:
             case CONSTANT:
                 SUITE  
             case CONSTANT:
                 SUITE  
             ...
             else:
                 SUITE  

         (modulo indentation variations)

         The "else" part is optional. If no else part is given and
         none of the defined cases matches, no action is taken and 
         the switch statement is ignored. This is in line with the
         current if-behaviour. A user who wants to signal this
         situation using an exception can define an else-branch
         which then implements the intended action.

         Note that the constants need not be all of the same type, but 
         they should be comparable to the type of the switch variable.

     Implementation:

         The compiler would have to compile this into byte code
         similar to this:

          def whatis(x):
              switch(x):
                  case 'one': 
                      print '1'
                  case 'two': 
                      print '2'
                  case 'three': 
                      print '3'
                  else: 
                      print "D'oh!"

         into (ommitting POP_TOP's and SET_LINENO's):

           6  LOAD_FAST         0 (x)
           9  LOAD_CONST        1 (switch-table-1)
          12  SWITCH            26 (to 38)

          14  LOAD_CONST        2 ('1')
          17  PRINT_ITEM
          18  PRINT_NEWLINE
          19  JUMP 43

          22  LOAD_CONST        3 ('2')
          25  PRINT_ITEM
          26  PRINT_NEWLINE
          27  JUMP 43

          30  LOAD_CONST        4 ('3')
          33  PRINT_ITEM
          34  PRINT_NEWLINE
          35  JUMP 43

          38  LOAD_CONST        5 ("D'oh!")
          41  PRINT_ITEM
          42  PRINT_NEWLINE

        >>43  LOAD_CONST        0 (None)
          46  RETURN_VALUE
        
        Where the 'SWITCH' opcode would jump to 14, 22, 30 or 38
        depending on 'x'.

        Thomas Wouters has written a patch which demonstrates the
        above. You can download it from [1].

    Issues:

        The switch statement should not implement fall-through
        behaviour (as does the switch statement in C). Each case
        defines a complete and independent suite; much like in a
        if-elif-else statement. This also enables using break in
        switch statments inside loops.

        If the interpreter finds that the switch variable x is
        not hashable, it should raise a TypeError at run-time
        pointing out the problem.

        There have been other proposals for the syntax which reuse
        existing keywords and avoid adding two new ones ("switch" and
        "case"). Others have argued that the keywords should use new
        terms to avoid confusion with the C keywords of the same name
        but slightly different semantics (e.g. fall-through without
        break). Some of the proposed variants:

            case EXPR:
                of CONSTANT:
                    SUITE  
                of CONSTANT:
                    SUITE  
                else:
                    SUITE  

            case EXPR:
                if CONSTANT:
                     SUITE  
                if CONSTANT:
                    SUITE  
                else:
                    SUITE  

            when EXPR:
                in CONSTANT_TUPLE:
                    SUITE  
                in CONSTANT_TUPLE:
                    SUITE  
                ...
            else:
                 SUITE  
        
        The switch statement could be extended to allow multiple
        values for one section (e.g. case 'a', 'b', 'c': ...). Another
        proposed extension would allow ranges of values (e.g. case
        10..14: ...). These should probably be post-poned, but already
        kept in mind when designing and implementing a first version.

Examples:

    The following examples all use a new syntax as proposed by
    solution 2. However, all of these examples would work with
    solution 1 as well.

         switch EXPR:                   switch x:
             case CONSTANT:                 case "first":
                 SUITE                          print x
             case CONSTANT:                 case "second":
                 SUITE                          x = x**2
             ...                                print x
             else:                          else:
                 SUITE                          print "whoops!"


         case EXPR:                     case x:
             of CONSTANT:                   of "first":
                 SUITE                          print x
             of CONSTANT:                   of "second":
                 SUITE                          print x**2
             else:                          else:
                 SUITE                          print "whoops!"


         case EXPR:                     case state:
             if CONSTANT:                   if "first":
                  SUITE                         state = "second"
             if CONSTANT:                   if "second":
                 SUITE                          state = "third"
             else:                          else:
                 SUITE                          state = "first"


         when EXPR:                     when state:
             in CONSTANT_TUPLE:             in ("first", "second"):
                 SUITE                          print state
             in CONSTANT_TUPLE:                 state = next_state(state)
                 SUITE                      in ("seventh",):
             ...                                print "done"
         else:                                  break    # out of loop!
              SUITE                     else:
                                            print "middle state"
                                            state = next_state(state)

    Here's another nice application found by Jack Jansen (switching
    on argument types):

         switch type(x).__name__:
             case 'int':
                 SUITE
             case 'string':
                 SUITE

Scope

     XXX Explain "from __future__ import switch"

Credits

    Martin von LĂświs (issues with the optimization idea)
    Thomas Wouters (switch statement + byte code compiler example)
    Skip Montanaro (dispatching ideas, examples)
    Donald Beaudry (switch syntax)
    Greg Ewing (switch syntax)
    Jack Jansen (type switching examples)

References

    [1] https://sourceforge.net/tracker/index.php?func=detail&aid=481118&group_id=5470&atid=305470
    [2] http://www.python.org/dev/peps/pep-3103

Copyright

    This document has been placed in the public domain.


pep-0276 Simple Iterator for ints

PEP: 276
Title: Simple Iterator for ints
Version: $Revision$
Last-Modified: $Date$
Author: Jim Althoff <james_althoff at i2.com>
Status: Rejected
Type: Standards Track
Created: 12-Nov-2001
Python-Version: 2.3
Post-History: 

Abstract

    Python 2.1 added new functionality to support iterators[1].
    Iterators have proven to be useful and convenient in many coding
    situations.  It is noted that the implementation of Python's
    for-loop control structure uses the iterator protocol as of
    release 2.1.  It is also noted that Python provides iterators for
    the following builtin types: lists, tuples, dictionaries, strings,
    and files.  This PEP proposes the addition of an iterator for the
    builtin type int (types.IntType).  Such an iterator would simplify
    the coding of certain for-loops in Python.

BDFL Pronouncement

    This PEP was rejected on 17 June 2005 with a note to python-dev.

    Much of the original need was met by the enumerate() function which
    was accepted for Python 2.3.

    Also, the proposal both allowed and encouraged misuses such as:

        >>> for i in 3: print i
        0
        1
        2

    Likewise, it was not helpful that the proposal would disable the
    syntax error in statements like:

        x, = 1

Specification

    Define an iterator for types.intType (i.e., the builtin type
    "int") that is returned from the builtin function "iter" when
    called with an instance of types.intType as the argument.

    The returned iterator has the following behavior:

    - Assume that object i is an instance of types.intType (the
      builtin type int) and that i > 0

    - iter(i) returns an iterator object

    - said iterator object iterates through the sequence of ints
      0,1,2,...,i-1

    Example:

        iter(5) returns an iterator object that iterates through the
        sequence of ints 0,1,2,3,4

    - if i <= 0, iter(i) returns an "empty" iterator, i.e., one that
      throws StopIteration upon the first call of its "next" method

    In other words, the conditions and semantics of said iterator is
    consistent with the conditions and semantics of the range() and
    xrange() functions.

    Note that the sequence 0,1,2,...,i-1 associated with the int i is
    considered "natural" in the context of Python programming because
    it is consistent with the builtin indexing protocol of sequences
    in Python.  Python lists and tuples, for example, are indexed
    starting at 0 and ending at len(object)-1 (when using positive
    indices).  In other words, such objects are indexed with the
    sequence 0,1,2,...,len(object)-1


Rationale

    A common programming idiom is to take a collection of objects and
    apply some operation to each item in the collection in some
    established sequential order.  Python provides the "for in"
    looping control structure for handling this common idiom.  Cases
    arise, however, where it is necessary (or more convenient) to
    access each item in an "indexed" collection by iterating through
    each index and accessing each item in the collection using the
    corresponding index.

    For example, one might have a two-dimensional "table" object where one
    requires the application of some operation to the first column of
    each row in the table.  Depending on the implementation of the table
    it might not be possible to access first each row and then each
    column as individual objects.  It might, rather, be possible to
    access a cell in the table using a row index and a column index.
    In such a case it is necessary to use an idiom where one iterates
    through a sequence of indices (indexes) in order to access the
    desired items in the table.  (Note that the commonly used
    DefaultTableModel class in Java-Swing-Jython has this very protocol).

    Another common example is where one needs to process two or more
    collections in parallel.  Another example is where one needs to
    access, say, every second item in a collection.

    There are many other examples where access to items in a
    collection is facilitated by a computation on an index thus
    necessitating access to the indices rather than direct access to
    the items themselves.

    Let's call this idiom the "indexed for-loop" idiom.  Some
    programming languages provide builtin syntax for handling this
    idiom.  In Python the common convention for implementing the
    indexed for-loop idiom is to use the builtin range() or xrange()
    function to generate a sequence of indices as in, for example:

       for rowcount in range(table.getRowCount()):
           print table.getValueAt(rowcount, 0)

    or

       for rowcount in xrange(table.getRowCount()):
           print table.getValueAt(rowcount, 0)

    From time to time there are discussions in the Python community
    about the indexed for-loop idiom.  It is sometimes argued that the
    need for using the range() or xrange() function for this design
    idiom is:

    - Not obvious (to new-to-Python programmers),

    - Error prone (easy to forget, even for experienced Python
      programmers)

    - Confusing and distracting for those who feel compelled to understand
      the differences and recommended usage of xrange() vis-a-vis range()

    - Unwieldy, especially when combined with the len() function,
      i.e., xrange(len(sequence))

    - Not as convenient as equivalent mechanisms in other languages,

    - Annoying, a "wart", etc.

    And from time to time proposals are put forth for ways in which
    Python could provide a better mechanism for this idiom.  Recent
    examples include PEP 204, "Range Literals", and PEP 212, "Loop
    Counter Iteration".

    Most often, such proposal include changes to Python's syntax and
    other "heavyweight" changes.

    Part of the difficulty here is that advocating new syntax implies
    a comprehensive solution for "general indexing" that has to
    include aspects like:

    - starting index value

    - ending index value

    - step value

    - open intervals versus closed intervals versus half opened intervals

    Finding a new syntax that is comprehensive, simple, general,
    Pythonic, appealing to many, easy to implement, not in conflict
    with existing structures, not excessively overloading of existing
    structures, etc. has proven to be more difficult than one might
    anticipate.

    The proposal outlined in this PEP tries to address the problem by
    suggesting a simple "lightweight" solution that helps the most
    common case by using a proven mechanism that is already available
    (as of Python 2.1): namely, iterators.

    Because for-loops already use "iterator" protocol as of Python
    2.1, adding an iterator for types.IntType as proposed in this PEP
    would enable by default the following shortcut for the indexed
    for-loop idiom:

       for rowcount in table.getRowCount():
           print table.getValueAt(rowcount, 0)

    The following benefits for this approach vis-a-vis the current
    mechanism of using the range() or xrange() functions are claimed
    to be:

    - Simpler,

    - Less cluttered,

    - Focuses on the problem at hand without the need to resort to
      secondary implementation-oriented functions (range() and
      xrange())

    And compared to other proposals for change:

    - Requires no new syntax

    - Requires no new keywords

    - Takes advantage of the new and well-established iterator mechanism

    And generally:

    -  Is consistent with iterator-based "convenience" changes already
       included (as of Python 2.1) for other builtin types such as:
       lists, tuples, dictionaries, strings, and files.


Backwards Compatibility

    The proposed mechanism is generally backwards compatible as it
    calls for neither new syntax nor new keywords.  All existing,
    valid Python programs should continue to work unmodified.

    However, this proposal is not perfectly backwards compatible in
    the sense that certain statements that are currently invalid
    would, under the current proposal, become valid.

    Tim Peters has pointed out two such examples:

    1) The common case where one forgets to include range() or
       xrange(), for example:

          for rowcount in table.getRowCount():
              print table.getValueAt(rowcount, 0)

       in Python 2.2 raises a TypeError exception.

       Under the current proposal, the above statement would be valid
       and would work as (presumably) intended.  Presumably, this is a
       good thing.

       As noted by Tim, this is the common case of the "forgotten
       range" mistake (which one currently corrects by adding a call
       to range() or xrange()).

    2) The (hopefully) very uncommon case where one makes a typing
       mistake when using tuple unpacking.  For example:

           x, = 1

       in Python 2.2 raises a TypeError exception.

       Under the current proposal, the above statement would be valid
       and would set x to 0.  The PEP author has no data as to how
       common this typing error is nor how difficult it would be to
       catch such an error under the current proposal.  He imagines
       that it does not occur frequently and that it would be
       relatively easy to correct should it happen.


Issues:

    Extensive discussions concerning PEP 276 on the Python interest
    mailing list suggests a range of opinions: some in favor, some
    neutral, some against.  Those in favor tend to agree with the
    claims above of the usefulness, convenience, ease of learning,
    and simplicity of a simple iterator for integers.

    Issues with PEP 276 include:

    - Using range/xrange is fine as is.

      Response: Some posters feel this way.  Other disagree.

    - Some feel that iterating over the sequence "0, 1, 2, ..., n-1"
      for an integer n is not intuitive.  "for i in 5:" is considered
      (by some) to be "non-obvious", for example.  Some dislike this
      usage because it doesn't have "the right feel".  Some dislike it
      because they believe that this type of usage forces one to view
      integers as a sequences and this seems wrong to them.  Some
      dislike it because they prefer to view for-loops as dealing
      with explicit sequences rather than with arbitrary iterators.

      Response: Some like the proposed idiom and see it as simple,
      elegant, easy to learn, and easy to use.  Some are neutral on
      this issue.  Others, as noted, dislike it.

    - Is it obvious that iter(5) maps to the sequence 0,1,2,3,4?

      Response: Given, as noted above, that Python has a strong
      convention for indexing sequences starting at 0 and stopping at
      (inclusively) the index whose value is one less than the length
      of the sequence, it is argued that the proposed sequence is
      reasonably intuitive to the Python programmer while being useful
      and practical.  More importantly, it is argued that once learned
      this convention is very easy to remember.  Note that the doc
      string for the range function makes a reference to the
      natural and useful association between range(n) and the indices
      for a list whose length is n.

    - Possible ambiguity

          for i in 10: print i

      might be mistaken for

     for i in (10,): print i

      Response: This is exactly the same situation with strings in
      current Python (replace 10 with 'spam' in the above, for
      example).

    - Too general: in the newest releases of Python there are
      contexts -- as with for-loops -- where iterators are called
      implicitly.  Some fear that having an iterator invoked for
      an integer in one of the context (excluding for-loops) might
      lead to unexpected behavior and bugs.  The "x, = 1" example
      noted above is an a case in point.

      Response: From the author's perspective the examples of the
      above that were identified in the PEP 276 discussions did
      not appear to be ones that would be accidentally misused
      in ways that would lead to subtle and hard-to-detect errors.

      In addition, it seems that there is a way to deal with this
      issue by using a variation of what is outlined in the
      specification section of this proposal.  Instead of adding
      an __iter__ method to class int, change the for-loop handling
      code to convert (in essense) from

          for i in n:  # when isinstance(n,int) is 1

      to

          for i in xrange(n):

      This approach gives the same results in a for-loop as an
      __iter__ method would but would prevent iteration on integer
      values in any other context.  Lists and tuples, for example,
      don't have __iter__ and are handled with special code.
      Integer values would be one more special case.

    - "i in n" seems very unnatural.

      Response: Some feel that "i in len(mylist)" would be easily
      understandable and useful.  Some don't like it, particularly
      when a literal is used as in "i in 5".  If the variant
      mentioned in the response to the previous issue is implemented,
      this issue is moot.  If not, then one could also address this
      issue by defining a __contains__ method in class int that would
      always raise a TypeError.  This would then make the behavior of
      "i in n" identical to that of current Python.

    - Might dissuade newbies from using the indexed for-loop idiom when
      the standard "for item in collection:" idiom is clearly better.

      Response: The standard idiom is so nice when it fits that it
      needs neither extra "carrot" nor "stick".  On the other hand,
      one does notice cases of overuse/misuse of the standard idiom
      (due, most likely, to the awkwardness of the indexed for-loop
      idiom), as in:

       for item in sequence:
           print sequence.index(item)

    - Why not propose even bigger changes?

    The majority of disagreement with PEP 276 came from those who
    favor much larger changes to Python to address the more general
    problem of specifying a sequence of integers where such
    a specification is general enough to handle the starting value,
    ending value, and stepping value of the sequence and also
    addresses variations of open, closed, and half-open (half-closed)
    integer intervals.  Many suggestions of such were discussed.

    These include:

    - adding Haskell-like notation for specifying a sequence of
      integers in a literal list,

    - various uses of slicing notation to specify sequences,

    - changes to the syntax of for-in loops to allow the use of
      relational operators in the loop header,

    - creation of an integer-interval class along with methods that
      overload relational operators or division operators
      to provide "slicing" on integer-interval objects,

    - and more.

    It should be noted that there was much debate but not an
    overwhelming concensus for any of these larger-scale suggestions.

    Clearly, PEP 276 does not propose such a large-scale change
    and instead focuses on a specific problem area.  Towards the
    end of the discussion period, several posters expressed favor
    for the narrow focus and simplicity of PEP 276 vis-a-vis the more
    ambitious suggestions that were advanced.  There did appear to be
    concensus for the need for a PEP for any such larger-scale,
    alternative suggestion.  In light of this recognition, details of
    the various alternative suggestions are not discussed here further.


Implementation

    An implementation is not available at this time but is expected
    to be straightforward.  The author has implemented a subclass of
    int with an __iter__ method (written in Python) as a means to test
    out the ideas in this proposal, however.


References

    [1] PEP 234, Iterators
    http://www.python.org/dev/peps/pep-0234/

    [2] PEP 204, Range Literals
    http://www.python.org/dev/peps/pep-0204/

    [3] PEP 212, Loop Counter Iteration
    http://www.python.org/dev/peps/pep-0212/


Copyright

    This document has been placed in the public domain.



pep-0277 Unicode file name support for Windows NT

PEP: 277
Title: Unicode file name support for Windows NT
Version: $Revision$
Last-Modified: $Date$
Author: Neil Hodgson <neilh at scintilla.org>
Status: Final
Type: Standards Track
Created: 11-Jan-2002
Python-Version: 2.3
Post-History: 

Abstract

    This PEP discusses supporting access to all files possible on
    Windows NT by passing Unicode file names directly to the system's
    wide-character functions.


Rationale

    Python 2.2 on Win32 platforms converts Unicode file names passed
    to open and to functions in the os module into the 'mbcs' encoding
    before passing the result to the operating system.  This is often
    successful in the common case where the script is operating with
    the locale set to the same value as when the file was created.
    Most machines are set up as one locale and rarely if ever changed
    from this locale.  For some users, locale is changed more often
    and on servers there are often files saved by users using
    different locales.

    On Windows NT and descendent operating systems, including Windows
    2000 and Windows XP, wide-character APIs are available that
    provide direct access to all file names, including those that are
    not representable using the current locale.  The purpose of this
    proposal is to provide access to these wide-character APIs through
    the standard Python file object and posix module and so provide
    access to all files on Windows NT.


Specification

    On Windows platforms which provide wide-character file APIs, when
    Unicode arguments are provided to file APIs, wide-character calls
    are made instead of the standard C library and posix calls.

    The Python file object is extended to use a Unicode file name
    argument directly rather than converting it.  This affects the
    file object constructor file(filename[, mode[, bufsize]]) and also
    the open function which is an alias of this constructor.  When a
    Unicode filename argument is used here then the name attribute of
    the file object will be Unicode.  The representation of a file
    object, repr(f) will display Unicode file names as an escaped
    string in a similar manner to the representation of Unicode
    strings.

    The posix module contains functions that take file or directory
    names: chdir, listdir, mkdir, open, remove, rename, rmdir, stat,
    and _getfullpathname.  These will use Unicode arguments directly
    rather than converting them.  For the rename function, this
    behaviour is triggered when either of the arguments is Unicode and
    the other argument converted to Unicode using the default
    encoding.

    The listdir function currently returns a list of strings.  Under
    this proposal, it will return a list of Unicode strings when its
    path argument is Unicode.


Restrictions

    On the consumer Windows operating systems, Windows 95, Windows 98,
    and Windows ME, there are no wide-character file APIs so behaviour
    is unchanged under this proposal.  It may be possible in the
    future to extend this proposal to cover these operating systems as
    the VFAT-32 file system used by them does support Unicode file
    names but access is difficult and so implementing this would
    require much work.  The "Microsoft Layer for Unicode" could be a
    starting point for implementing this.

    Python can be compiled with the size of Unicode characters set to
    4 bytes rather than 2 by defining PY_UNICODE_TYPE to be a 4 byte
    type and Py_UNICODE_SIZE to be 4.  As the Windows API does not
    accept 4 byte characters, the features described in this proposal
    will not work in this mode so the implementation falls back to the
    current 'mbcs' encoding technique. This restriction could be lifted
    in the future by performing extra conversions using
    PyUnicode_AsWideChar but for now that would add too much
    complexity for a very rarely used feature.


Reference Implementation

    An experimental implementation is available from
    [2] http://scintilla.sourceforge.net/winunichanges.zip

    [3] An updated version is available at
        http://python.org/sf/594001


References

    [1] Microsoft Windows APIs
        http://msdn.microsoft.com/


Copyright

    This document has been placed in the public domain.



pep-0278 Universal Newline Support

PEP: 278
Title: Universal Newline Support
Version: $Revision$
Last-Modified: $Date$
Author: Jack Jansen <jack at cwi.nl>
Status: Final
Type: Standards Track
Created: 14-Jan-2002
Python-Version: 2.3
Post-History: 

Abstract

    This PEP discusses a way in which Python can support I/O on files
    which have a newline format that is not the native format on the
    platform, so that Python on each platform can read and import
    files with CR (Macintosh), LF (Unix) or CR LF (Windows) line
    endings.

    It is more and more common to come across files that have an end
    of line that does not match the standard on the current platform:
    files downloaded over the net, remotely mounted filesystems on a
    different platform, Mac OS X with its double standard of Mac and
    Unix line endings, etc.
    
    Many tools such as editors and compilers already handle this
    gracefully, it would be good if Python did so too.


Specification

    Universal newline support is enabled by default,
    but can be disabled during the configure of Python.
    
    In a Python with universal newline support the feature is
    automatically enabled for all import statements and execfile()
    calls. There is no special support for eval() or exec.
    
    In a Python with universal newline support open() the mode
    parameter can also be "U", meaning "open for input as a text file
    with universal newline interpretation".  Mode "rU" is also allowed,
    for symmetry with "rb". Mode "U" cannot be
    combined with other mode flags such as "+". Any line ending in the
    input file will be seen as a '\n' in Python, so little other code has
    to change to handle universal newlines.
    
    Conversion of newlines happens in all calls that read data: read(),
    readline(), readlines(), etc.
    
    There is no special support for output to file with a different
    newline convention, and so mode "wU" is also illegal.
    
    A file object that has been opened in universal newline mode gets
    a new attribute "newlines" which reflects the newline convention
    used in the file.  The value for this attribute is one of None (no
    newline read yet), "\r", "\n", "\r\n" or a tuple containing all the
    newline types seen.

    

Rationale

    Universal newline support is implemented in C, not in Python.
    This is done because we want files with a foreign newline
    convention to be import-able, so a Python Lib directory can be
    shared over a remote file system connection, or between MacPython
    and Unix-Python on Mac OS X.  For this to be feasible the
    universal newline convention needs to have a reasonably small
    impact on performance, which means a Python implementation is not
    an option as it would bog down all imports. And because of files
    with multiple newline conventions, which Visual C++ and other
    Windows tools will happily produce, doing a quick check for the
    newlines used in a file (handing off the import to C code if a
    platform-local newline is seen) will not work.  Finally, a C
    implementation also allows tracebacks and such (which open the
    Python source module) to be handled easily.
    
    There is no output implementation of universal newlines, Python
    programs are expected to handle this by themselves or write files
    with platform-local convention otherwise.  The reason for this is
    that input is the difficult case, outputting different newlines to
    a file is already easy enough in Python.
    
    Also, an output implementation would be much more difficult than an
    input implementation, surprisingly: a lot of output is done through
    PyXXX_Print() methods, and at this point the file object is not
    available anymore, only a FILE *. So, an output implementation would
    need to somehow go from the FILE* to the file object, because that
    is where the current newline delimiter is stored.

    The input implementation has no such problem: there are no cases in
    the Python source tree where files are partially read from C,
    partially from Python, and such cases are expected to be rare in
    extension modules. If such cases exist the only problem is that the
    newlines attribute of the file object is not updated during the
    fread() or fgets() calls that are done direct from C.

    A partial output implementation, where strings passed to fp.write()
    would be converted to use fp.newlines as their line terminator but
    all other output would not is far too surprising, in my view.

    Because there is no output support for universal newlines there is
    also no support for a mode "rU+": the surprise factor of the
    previous paragraph would hold to an even stronger degree.

    There is no support for universal newlines in strings passed to
    eval() or exec. It is envisioned that such strings always have the
    standard \n line feed, if the strings come from a file that file can
    be read with universal newlines.

    I think there are no special issues with unicode. utf-16 shouldn't
    pose any new problems, as such files need to be opened in binary
    mode anyway. Interaction with utf-8 is fine too: values 0x0a and 0x0d
    cannot occur as part of a multibyte sequence.

    Universal newline files should work fine with iterators and
    xreadlines() as these eventually call the normal file
    readline/readlines methods.

    
    While universal newlines are automatically enabled for import they
    are not for opening, where you have to specifically say open(...,
    "U"). This is open to debate, but here are a few reasons for this
    design:

    - Compatibility.  Programs which already do their own
      interpretation of \r\n in text files would break. Examples of such
      programs would be editors which warn you when you open a file with
      a different newline convention. If universal newlines was made the
      default such an editor would silently convert your line endings to
      the local convention on save. Programs which open binary files as
      text files on Unix would also break (but it could be argued they
      deserve it :-).
      
    - Interface clarity.  Universal newlines are only supported for
      input files, not for input/output files, as the semantics would
      become muddy.  Would you write Mac newlines if all reads so far
      had encountered Mac newlines?  But what if you then later read a
      Unix newline?
    
    The newlines attribute is included so that programs that really
    care about the newline convention, such as text editors, can
    examine what was in a file.  They can then save (a copy of) the
    file with the same newline convention (or, in case of a file with
    mixed newlines, ask the user what to do, or output in platform
    convention).
    
    Feedback is explicitly solicited on one item in the reference
    implementation: whether or not the universal newlines routines
    should grab the global interpreter lock.  Currently they do not,
    but this could be considered living dangerously, as they may
    modify fields in a FileObject.  But as these routines are
    replacements for fgets() and fread() as well it may be difficult
    to decide whether or not the lock is held when the routine is
    called.  Moreover, the only danger is that if two threads read the
    same FileObject at the same time an extraneous newline may be seen
    or the "newlines" attribute may inadvertently be set to mixed.  I
    would argue that if you read the same FileObject in two threads
    simultaneously you are asking for trouble anyway.
    
    Note that no globally accessible pointers are manipulated in the
    fgets() or fread() replacement routines, just some integer-valued
    flags, so the chances of core dumps are zero (he said:-).
    
    Universal newline support can be disabled during configure because it does
    have a small performance penalty, and moreover the implementation has
    not been tested on all concievable platforms yet. It might also be silly
    on some platforms (WinCE or Palm devices, for instance). If universal
    newline support is not enabled then file objects do not have the "newlines"
    attribute, so testing whether the current Python has it can be done with a
    simple

        if hasattr(open, 'newlines'):
            print 'We have universal newline support'

    Note that this test uses the open() function rather than the file
    type so that it won't fail for versions of Python where the file
    type was not available (the file type was added to the built-in
    namespace in the same release as the universal newline feature was
    added).

    Additionally, note that this test fails again on Python versions
    >= 2.5, when open() was made a function again and is not synonymous
    with the file type anymore.

    

Reference Implementation

    A reference implementation is available in SourceForge patch
    #476814: http://www.python.org/sf/476814


References

    None.


Copyright

    This document has been placed in the public domain.



pep-0279 The enumerate() built-in function

PEP: 279
Title: The enumerate() built-in function
Version: $Revision$
Last-Modified: $Date$
Author: Raymond Hettinger <python at rcn.com>
Status: Final
Type: Standards Track
Created: 30-Jan-2002
Python-Version: 2.3
Post-History: 

Abstract

    This PEP introduces a new built-in function, enumerate() to
    simplify a commonly used looping idiom.  It provides all iterable
    collections with the same advantage that iteritems() affords to
    dictionaries -- a compact, readable, reliable index notation.


Rationale

    Python 2.2 introduced the concept of an iterable interface as
    proposed in PEP 234 [3].  The iter() factory function was provided
    as common calling convention and deep changes were made to use
    iterators as a unifying theme throughout Python.  The unification
    came in the form of establishing a common iterable interface for
    mappings, sequences, and file objects.

    Generators, as proposed in PEP 255 [1], were introduced as a means
    for making it easier to create iterators, especially ones with
    complex internal execution or variable states.  The availability
    of generators makes it possible to improve on the loop counter
    ideas in PEP 212 [2].  Those ideas provided a clean syntax for
    iteration with indices and values, but did not apply to all
    iterable objects.  Also, that approach did not have the memory
    friendly benefit provided by generators which do not evaluate the
    entire sequence all at once.

    The new proposal is to add a built-in function, enumerate() which
    was made possible once iterators and generators became available.
    It provides all iterables with the same advantage that iteritems()
    affords to dictionaries -- a compact, readable, reliable index
    notation.  Like zip(), it is expected to become a commonly used
    looping idiom.

    This suggestion is designed to take advantage of the existing
    implementation and require little additional effort to
    incorporate.  It is backwards compatible and requires no new
    keywords.  The proposal will go into Python 2.3 when generators
    become final and are not imported from __future__.


BDFL Pronouncements

    The new built-in function is ACCEPTED.  


Specification for a new built-in:

    def enumerate(collection):
        'Generates an indexed series:  (0,coll[0]), (1,coll[1]) ...'     
        i = 0
        it = iter(collection)
        while 1:
            yield (i, it.next())
            i += 1

    Note A: PEP 212 Loop Counter Iteration [2] discussed several
    proposals for achieving indexing.  Some of the proposals only work
    for lists unlike the above function which works for any generator,
    xrange, sequence, or iterable object.  Also, those proposals were
    presented and evaluated in the world prior to Python 2.2 which did
    not include generators.  As a result, the non-generator version in
    PEP 212 had the disadvantage of consuming memory with a giant list
    of tuples.  The generator version presented here is fast and
    light, works with all iterables, and allows users to abandon the
    sequence in mid-stream with no loss of computation effort.

    There are other PEPs which touch on related issues: integer
    iterators, integer for-loops, and one for modifying the arguments
    to range and xrange.  The enumerate() proposal does not preclude
    the other proposals and it still meets an important need even if
    those are adopted -- the need to count items in any iterable.  The
    other proposals give a means of producing an index but not the
    corresponding value.  This is especially problematic if a sequence
    is given which doesn't support random access such as a file
    object, generator, or sequence defined with __getitem__.

    Note B: Almost all of the PEP reviewers welcomed the function but
    were divided as to whether there should be any built-ins.  The
    main argument for a separate module was to slow the rate of
    language inflation.  The main argument for a built-in was that the
    function is destined to be part of a core programming style,
    applicable to any object with an iterable interface.  Just as
    zip() solves the problem of looping over multiple sequences, the
    enumerate() function solves the loop counter problem.

    If only one built-in is allowed, then enumerate() is the most
    important general purpose tool, solving the broadest class of
    problems while improving program brevity, clarity and reliability.

    Note C:  Various alternative names were discussed:

        iterindexed()-- five syllables is a mouthful
        index()      -- nice verb but could be confused the .index() method
        indexed()    -- widely liked however adjectives should be avoided
        indexer()    -- noun did not read well in a for-loop
        count()      -- direct and explicit but often used in other contexts
        itercount()  -- direct, explicit and hated by more than one person
        iteritems()  -- conflicts with key:value concept for dictionaries
        itemize()    -- confusing because amap.items() != list(itemize(amap))
        enum()       -- pithy; less clear than enumerate; too similar to enum
                        in other languages where it has a different meaning

    All of the names involving 'count' had the further disadvantage of
    implying that the count would begin from one instead of zero.

    All of the names involving 'index' clashed with usage in database
    languages where indexing implies a sorting operation rather than
    linear sequencing.

    Note D: This function was originally proposed with optional start
    and stop arguments.  GvR pointed out that the function call
    enumerate(seqn,4,6) had an alternate, plausible interpretation as
    a slice that would return the fourth and fifth elements of the
    sequence.  To avoid the ambiguity, the optional arguments were
    dropped even though it meant losing flexibility as a loop counter.
    That flexibility was most important for the common case of
    counting from one, as in:
        
        for linenum, line in enumerate(source,1):  print linenum, line

    Comments from GvR:  filter and map should die and be subsumed into list
        comprehensions, not grow more variants. I'd rather introduce
        built-ins that do iterator algebra (e.g. the iterzip that I've
        often used as an example).

        I like the idea of having some way to iterate over a sequence
        and its index set in parallel.  It's fine for this to be a
        built-in.

        I don't like the name "indexed"; adjectives do not make good
        function names.  Maybe iterindexed()?

    Comments from Ka-Ping Yee:  I'm also quite happy with everything  you
        proposed ... and the extra built-ins (really 'indexed' in
        particular) are things I have wanted for a long time.

    Comments from Neil Schemenauer:  The new built-ins sound okay.  Guido
        may be concerned with increasing the number of built-ins too
        much.  You might be better off selling them as part of a
        module.  If you use a module then you can add lots of useful
        functions (Haskell has lots of them that we could steal).

    Comments for Magnus Lie Hetland:  I think indexed would be a useful and
        natural built-in function. I would certainly use it a lot.  I
        like indexed() a lot; +1. I'm quite happy to have it make PEP
        281 obsolete. Adding a separate module for iterator utilities
        seems like a good idea.

    Comments from the Community:  The response to the enumerate() proposal
        has been close to 100% favorable.  Almost everyone loves the
        idea.

    Author response:  Prior to these comments, four built-ins were proposed.
        After the comments, xmap xfilter and xzip were withdrawn.  The
        one that remains is vital for the language and is proposed by
        itself.  Indexed() is trivially easy to implement and can be
        documented in minutes.  More importantly, it is useful in
        everyday programming which does not otherwise involve explicit
        use of generators.

        This proposal originally included another function iterzip().
        That was subsequently implemented as the izip() function in
        the itertools module.


References

    [1] PEP 255 Simple Generators
        http://www.python.org/dev/peps/pep-0255/

    [2] PEP 212 Loop Counter Iteration
        http://www.python.org/dev/peps/pep-0212/

    [3] PEP 234 Iterators
        http://www.python.org/dev/peps/pep-0234/


Copyright

    This document has been placed in the public domain.



pep-0280 Optimizing access to globals

PEP: 280
Title: Optimizing access to globals
Version: $Revision$
Last-Modified: $Date$
Author: Guido van Rossum <guido at python.org>
Status: Deferred
Type: Standards Track
Created: 10-Feb-2002
Python-Version: 2.3
Post-History: 

Deferral

    While this PEP is a nice idea, no-one has yet emerged to do the work of
    hashing out the differences between this PEP, PEP 266 and PEP 267.
    Hence, it is being deferred.


Abstract

    This PEP describes yet another approach to optimizing access to
    module globals, providing an alternative to PEP 266 (Optimizing
    Global Variable/Attribute Access by Skip Montanaro) and PEP 267
    (Optimized Access to Module Namespaces by Jeremy Hylton).

    The expectation is that eventually one approach will be picked and
    implemented; possibly multiple approaches will be prototyped
    first.


Description

    (Note: Jason Orendorff writes: """I implemented this once, long
    ago, for Python 1.5-ish, I believe.  I got it to the point where
    it was only 15% slower than ordinary Python, then abandoned it.
    ;) In my implementation, "cells" were real first-class objects,
    and "celldict" was a copy-and-hack version of dictionary.  I
    forget how the rest worked."""  Reference:
    http://mail.python.org/pipermail/python-dev/2002-February/019876.html)

    Let a cell be a really simple Python object, containing a pointer
    to a Python object and a pointer to a cell.  Both pointers may be
    NULL.  A Python implementation could be:

        class cell(object):

            def __init__(self):
                self.objptr = NULL
                self.cellptr = NULL

    The cellptr attribute is used for chaining cells together for
    searching built-ins; this will be explained later.

    Let a celldict be a mapping from strings (the names of a module's
    globals) to objects (the values of those globals), implemented
    using a dict of cells.  A Python implementation could be:

        class celldict(object):

            def __init__(self):
                self.__dict = {} # dict of cells

            def getcell(self, key):
                c = self.__dict.get(key)
                if c is None:
                    c = cell()
                    self.__dict[key] = c
                return c

            def cellkeys(self):
                return self.__dict.keys()

            def __getitem__(self, key):
                c = self.__dict.get(key)
                if c is None:
                    raise KeyError, key
                value = c.objptr
                if value is NULL:
                    raise KeyError, key
                else:
                    return value

            def __setitem__(self, key, value):
                c = self.__dict.get(key)
                if c is None:
                    c = cell()
                    self.__dict[key] = c
                c.objptr = value

            def __delitem__(self, key):
                c = self.__dict.get(key)
                if c is None or c.objptr is NULL:
                    raise KeyError, key
                c.objptr = NULL

            def keys(self):
                return [k for k, c in self.__dict.iteritems()
                        if c.objptr is not NULL]

            def items(self):
                return [k, c.objptr for k, c in self.__dict.iteritems()
                        if c.objptr is not NULL]

            def values(self):
                preturn [c.objptr for c in self.__dict.itervalues()
                        if c.objptr is not NULL]

            def clear(self):
                for c in self.__dict.values():
                    c.objptr = NULL

            # Etc.

    It is possible that a cell exists corresponding to a given key,
    but the cell's objptr is NULL; let's call such a cell empty.  When
    the celldict is used as a mapping, it is as if empty cells don't
    exist.  However, once added, a cell is never deleted from a
    celldict, and it is possible to get at empty cells using the
    getcell() method.

    The celldict implementation never uses the cellptr attribute of
    cells.

    We change the module implementation to use a celldict for its
    __dict__.  The module's getattr, setattr and delattr operations
    now map to getitem, setitem and delitem on the celldict.  The type
    of <module>.__dict__ and globals() is probably the only backwards
    incompatibility.

    When a module is initialized, its __builtins__ is initialized from
    the __builtin__ module's __dict__, which is itself a celldict.
    For each cell in __builtins__, the new module's __dict__ adds a
    cell with a NULL objptr, whose cellptr points to the corresponding
    cell of __builtins__.  Python pseudo-code (ignoring rexec):

        import __builtin__

        class module(object):

            def __init__(self):
                self.__dict__ = d = celldict()
                d['__builtins__'] = bd = __builtin__.__dict__
                for k in bd.cellkeys():
                    c = self.__dict__.getcell(k)
                    c.cellptr = bd.getcell(k)

            def __getattr__(self, k):
                try:
                    return self.__dict__[k]
                except KeyError:
                    raise IndexError, k

            def __setattr__(self, k, v):
                self.__dict__[k] = v

            def __delattr__(self, k):
                del self.__dict__[k]

    The compiler generates LOAD_GLOBAL_CELL <i> (and STORE_GLOBAL_CELL
    <i> etc.) opcodes for references to globals, where <i> is a small
    index with meaning only within one code object like the const
    index in LOAD_CONST.  The code object has a new tuple, co_globals,
    giving the names of the globals referenced by the code indexed by
    <i>.  No new analysis is required to be able to do this.

    When a function object is created from a code object and a celldict,
    the function object creates an array of cell pointers by asking the
    celldict for cells corresponding to the names in the code object's
    co_globals.  If the celldict doesn't already have a cell for a
    particular name, it creates and an empty one.  This array of cell
    pointers is stored on the function object as func_cells.  When a
    function object is created from a regular dict instead of a
    celldict, func_cells is a NULL pointer.

    When the VM executes a LOAD_GLOBAL_CELL <i> instruction, it gets
    cell number <i> from func_cells.  It then looks in the cell's
    PyObject pointer, and if not NULL, that's the global value.  If it
    is NULL, it follows the cell's cell pointer to the next cell, if it
    is not NULL, and looks in the PyObject pointer in that cell.  If
    that's also NULL, or if there is no second cell, NameError is
    raised.  (It could follow the chain of cell pointers until a NULL
    cell pointer is found; but I have no use for this.)  Similar for
    STORE_GLOBAL_CELL <i>, except it doesn't follow the cell pointer
    chain -- it always stores in the first cell.

    There are fallbacks in the VM for the case where the function's
    globals aren't a celldict, and hence func_cells is NULL.  In that
    case, the code object's co_globals is indexed with <i> to find the
    name of the corresponding global and this name is used to index the
    function's globals dict.


Additional Ideas

    - Never make func_cell a NULL pointer; instead, make up an array
      of empty cells, so that LOAD_GLOBAL_CELL can index func_cells
      without a NULL check.

    - Make c.cellptr equal to c when a cell is created, so that
      LOAD_GLOBAL_CELL can always dereference c.cellptr without a NULL
      check.

    With these two additional ideas added, here's Python pseudo-code
    for LOAD_GLOBAL_CELL:

        def LOAD_GLOBAL_CELL(self, i):
            # self is the frame
            c = self.func_cells[i]
            obj = c.objptr
            if obj is not NULL:
                return obj # Existing global
            return c.cellptr.objptr # Built-in or NULL

    - Be more aggressive:  put the actual values of builtins into module
      dicts, not just pointers to cells containing the actual values.

    There are two points to this:  (1) Simplify and speed access, which
    is the most common operation.  (2) Support faithful emulation of
    extreme existing corner cases.

    WRT  #2, the set of builtins in the scheme above is captured at the
    time a module dict is first created.  Mutations to the set of builtin
    names following that don't get reflected in the module dicts.  Example:
    consider files main.py and cheater.py:

    [main.py]
    import cheater
    def f():
        cheater.cheat()
        return pachinko()
    print f()

    [cheater.py]
    def cheat():
        import __builtin__
        __builtin__.pachinko = lambda: 666

    If main.py is run under Python 2.2 (or before), 666 is printed.  But
    under the proposal, __builtin__.pachinko doesn't exist at the time
    main's __dict__ is initialized.  When the function object for
    f is created, main.__dict__ grows a pachinko cell mapping to two
    NULLs.  When cheat() is called, __builtin__.__dict__ grows a pachinko
    cell too, but main.__dict__ doesn't know-- and will never know --about
    that.  When f's return stmt references pachinko, in will still find
    the double-NULLs in main.__dict__'s pachinko cell, and so raise
    NameError.

    A similar (in cause) break in compatibility can occur if a module
    global foo is del'ed, but a builtin foo was created prior to that
    but after the module dict was first created.  Then the builtin foo
    becomes visible in the module under 2.2 and before, but remains
    invisible under the proposal.

    Mutating builtins is extremely rare (most programs never mutate the
    builtins, and it's hard to imagine a plausible use for frequent
    mutation of the builtins -- I've never seen or heard of one), so it
    doesn't matter how expensive mutating the builtins becomes.  OTOH,
    referencing globals and builtins is very common.  Combining those
    observations suggests a more aggressive caching of builtins in module
    globals, speeding access at the expense of making mutations of the
    builtins (potentially much) more expensive to keep the caches in
    synch.

    Much of the scheme above remains the same, and most of the rest is
    just a little different.  A cell changes to:

        class cell(object):
            def __init__(self, obj=NULL, builtin=0):
                self.objptr = obj
                self.builtinflag = builtin

    and a celldict maps strings to this version of cells.  builtinflag
    is true when and only when objptr contains a value obtained from
    the builtins; in other words, it's true when and only when a cell
    is acting as a cached value.  When builtinflag is false, objptr is
    the value of a module global (possibly NULL).  celldict changes to:

        class celldict(object):

            def __init__(self, builtindict=()):
                self.basedict = builtindict
                self.__dict = d = {}
                for k, v in builtindict.items():
                    d[k] = cell(v, 1)

            def __getitem__(self, key):
                c = self.__dict.get(key)
                if c is None or c.objptr is NULL or c.builtinflag:
                    raise KeyError, key
                return c.objptr

            def __setitem__(self, key, value):
                c = self.__dict.get(key)
                if c is None:
                    c = cell()
                    self.__dict[key] = c
                c.objptr = value
                c.builtinflag = 0

            def __delitem__(self, key):
                c = self.__dict.get(key)
                if c is None or c.objptr is NULL or c.builtinflag:
                    raise KeyError, key
                c.objptr = NULL
                # We may have unmasked a builtin.  Note that because
                # we're checking the builtin dict for that *now*, this
                # still works if the builtin first came into existence
                # after we were constructed.  Note too that del on
                # namespace dicts is rare, so the expensse of this check
                # shouldn't matter.
                if key in self.basedict:
                    c.objptr = self.basedict[key]
                    assert c.objptr is not NULL # else "in" lied
                    c.builtinflag = 1
                else:
                    # There is no builtin with the same name.
                    assert not c.builtinflag

            def keys(self):
                return [k for k, c in self.__dict.iteritems()
                        if c.objptr is not NULL and not c.builtinflag]

            def items(self):
                return [k, c.objptr for k, c in self.__dict.iteritems()
                        if c.objptr is not NULL and not c.builtinflag]

            def values(self):
                preturn [c.objptr for c in self.__dict.itervalues()
                        if c.objptr is not NULL and not c.builtinflag]

            def clear(self):
                for c in self.__dict.values():
                    if not c.builtinflag:
                        c.objptr = NULL

            # Etc.

    The speed benefit comes from simplifying LOAD_GLOBAL_CELL, which
    I expect is executed more frequently than all other namespace
    operations combined:

        def LOAD_GLOBAL_CELL(self, i):
            # self is the frame
            c = self.func_cells[i]
            return c.objptr   # may be NULL (also true before)

    That is, accessing builtins and accessing module globals are equally
    fast.  For module globals, a NULL-pointer test+branch is saved.  For
    builtins, an additional pointer chase is also saved.

    The other part needed to make this fly is expensive, propagating
    mutations of builtins into the module dicts that were initialized
    from the builtins.  This is much like, in 2.2, propagating changes
    in new-style base classes to their descendants:  the builtins need to
    maintain a list of weakrefs to the modules (or module dicts)
    initialized from the builtin's dict.  Given a mutation to the builtin
    dict (adding a new key, changing the value associated with an
    existing key, or deleting a key), traverse the list of module dicts
    and make corresponding mutations to them.  This is straightforward;
    for example, if a key is deleted from builtins, execute
    reflect_bltin_del in each module:

        def reflect_bltin_del(self, key):
            c = self.__dict.get(key)
            assert c is not None # else we were already out of synch
            if c.builtinflag:
                # Put us back in synch.
                c.objptr = NULL
                c.builtinflag = 0
            # Else we're shadowing the builtin, so don't care that
            # the builtin went away.

    Note that c.builtinflag protects from us erroneously deleting a
    module global of the same name.  Adding a new (key, value) builtin
    pair is similar:

        def reflect_bltin_new(self, key, value):
            c = self.__dict.get(key)
            if c is None:
                # Never heard of it before:  cache the builtin value.
                self.__dict[key] = cell(value, 1)
            elif c.objptr is NULL:
                # This used to exist in the module or the builtins,
                # but doesn't anymore; rehabilitate it.
                assert not c.builtinflag
                c.objptr = value
                c.builtinflag = 1
            else:
                # We're shadowing it already.
                assert not c.builtinflag

    Changing the value of an existing builtin:

        def reflect_bltin_change(self, key, newvalue):
            c = self.__dict.get(key)
            assert c is not None # else we were already out of synch
            if c.builtinflag:
                # Put us back in synch.
                c.objptr = newvalue
            # Else we're shadowing the builtin, so don't care that
            # the builtin changed.


FAQs

    Q. Will it still be possible to:
       a) install new builtins in the __builtin__ namespace and have
          them available in all already loaded modules right away ?
       b) override builtins (e.g. open()) with my own copies
          (e.g. to increase security) in a way that makes these new
          copies override the previous ones in all modules ?

    A. Yes, this is the whole point of this design.  In the original
       approach, when LOAD_GLOBAL_CELL finds a NULL in the second
       cell, it should go back to see if the __builtins__ dict has
       been modified (the pseudo code doesn't have this yet).  Tim's
       "more aggressive" alternative also takes care of this.

    Q. How does the new scheme get along with the restricted execution
       model?

    A. It is intended to support that fully.

    Q. What happens when a global is deleted?

    A. The module's celldict would have a cell with a NULL objptr for
       that key.  This is true in both variations, but the "aggressive"
       variation goes on to see whether this unmasks a builtin of the
       same name, and if so copies its value (just a pointer-copy of the
       ultimate PyObject*) into the cell's objptr and sets the cell's
       builtinflag to true.

    Q. What would the C code for LOAD_GLOBAL_CELL look like?

    A. The first version, with the first two bullets under "Additional
       ideas" incorporated, could look like this:

       case LOAD_GLOBAL_CELL:
           cell = func_cells[oparg];
           x = cell->objptr;
           if (x == NULL) {
               x = cell->cellptr->objptr;
               if (x == NULL) {
                   ... error recovery ...
                   break;
               }
           }
           Py_INCREF(x);
           PUSH(x);
           continue;

       We could even write it like this (idea courtesy of Ka-Ping Yee):

       case LOAD_GLOBAL_CELL:
           cell = func_cells[oparg];
           x = cell->cellptr->objptr;
           if (x != NULL) {
               Py_INCREF(x);
               PUSH(x);
               continue;
           }
           ... error recovery ...
           break;

       In modern CPU architectures, this reduces the number of
       branches taken for built-ins, which might be a really good
       thing, while any decent memory cache should realize that
       cell->cellptr is the same as cell for regular globals and hence
       this should be very fast in that case too.

       For the aggressive variant:

       case LOAD_GLOBAL_CELL:
           cell = func_cells[oparg];
           x = cell->objptr;
           if (x != NULL) {
               Py_INCREF(x);
               PUSH(x);
               continue;
           }
           ... error recovery ...
           break;

    Q. What happens in the module's top-level code where there is
       presumably no func_cells array?

    A. We could do some code analysis and create a func_cells array,
       or we could use LOAD_NAME which should use PyMapping_GetItem on
       the globals dict.


Graphics

    Ka-Ping Yee supplied a drawing of the state of things after
    "import spam", where spam.py contains:

        import eggs

        i = -2
        max = 3

        def foo(n):
            y = abs(i) + max
            return eggs.ham(y + n)

    The drawing is at http://web.lfw.org/repo/cells.gif; a larger
    version is at http://lfw.org/repo/cells-big.gif; the source is at
    http://lfw.org/repo/cells.ai.


Comparison

    XXX Here, a comparison of the three approaches could be added.


Copyright

    This document has been placed in the public domain.



pep-0281 Loop Counter Iteration with range and xrange

PEP: 281
Title: Loop Counter Iteration with range and xrange
Version: $Revision$
Last-Modified: $Date$
Author: Magnus Lie Hetland <magnus at hetland.org>
Status: Rejected
Type: Standards Track
Created: 11-Feb-2002
Python-Version: 2.3
Post-History: 

Abstract

   This PEP describes yet another way of exposing the loop counter in
   for-loops. It basically proposes that the functionality of the
   function indices() from PEP 212 [1] be included in the existing
   functions range() and xrange().

Pronouncement

   In commenting on PEP 279's enumerate() function, this PEP's author
   offered, "I'm quite happy to have it make PEP 281 obsolete."
   Subsequently, PEP 279 was accepted into Python 2.3.

   On 17 June 2005, the BDFL concurred with it being obsolete and
   hereby rejected the PEP.  For the record, he found some of the
   examples to somewhat jarring in appearance:

      >>> range(range(5), range(10), range(2))
      [5, 7, 9]


Motivation

   It is often desirable to loop over the indices of a sequence.  PEP
   212 describes several ways of doing this, including adding a
   built-in function called indices, conceptually defined as

       def indices(sequence):
           return range(len(sequence))

   On the assumption that adding functionality to an existing built-in
   function may be less intrusive than adding a new built-in function,
   this PEP proposes adding this functionality to the existing
   functions range() and xrange().


Specification

   It is proposed that all three arguments to the built-in functions
   range() and xrange() are allowed to be objects with a length
   (i.e. objects implementing the __len__ method).  If an argument
   cannot be interpreted as an integer (i.e. it has no __int__
   method), its length will be used instead.

   Examples:

   >>> range(range(10))
   [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
   >>> range(range(5), range(10))
   [5, 6, 7, 8, 9]
   >>> range(range(5), range(10), range(2))
   [5, 7, 9]
   >>> list(xrange(range(10)))
   [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
   >>> list(xrange(xrange(10)))
   [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

   # Number the lines of a file:
   lines = file.readlines()
   for num in range(lines):
       print num, lines[num]


Alternatives

   A natural alternative to the above specification is allowing
   xrange() to access its arguments in a lazy manner.  Thus, instead
   of using their length explicitly, xrange can return one index for
   each element of the stop argument until the end is reached.  A
   similar lazy treatment makes little sense for the start and step
   arguments since their length must be calculated before iteration
   can begin.  (Actually, the length of the step argument isn't needed
   until the second element is returned.)

   A pseudo-implementation (using only the stop argument, and assuming
   that it is iterable) is:

   def xrange(stop):
       i = 0
       for x in stop:
           yield i
           i += 1

   Testing whether to use int() or lazy iteration could be done by
   checking for an __iter__ attribute.  (This example assumes the
   presence of generators, but could easily have been implemented as a
   plain iterator object.)

   It may be questionable whether this feature is truly useful, since
   one would not be able to access the elements of the iterable object
   inside the for loop through indexing.

   Example:

   # Printing the numbers of the lines of a file:
   for num in range(file):
       print num # The line itself is not accessible

   A more controversial alternative (to deal with this) would be to
   let range() behave like the function irange() of PEP 212 when
   supplied with a sequence.

   Example:

   >>> range(5)
   [0, 1, 2, 3, 4]
   >>> range('abcde')
   [(0, 'a'), (1, 'b'), (2, 'c'), (3, 'd'), (4, 'e')]


Backwards Compatibility

   The proposal could cause backwards incompatibilities if arguments
   are used which implement both __int__ and __len__ (or __iter__ in
   the case of lazy iteration with xrange).  The author does not
   believe that this is a significant problem.


References and Footnotes

   [1] PEP 212, Loop Counter Iteration
       http://www.python.org/dev/peps/pep-0212/


Copyright

    This document has been placed in the public domain.



pep-0282 A Logging System

PEP: 282
Title: A Logging System
Version: $Revision$
Last-Modified: $Date$
Author: vinay_sajip at red-dove.com (Vinay Sajip), Trent Mick <trentm at activestate.com>
Status: Final
Type: Standards Track
Created: 4-Feb-2002
Python-Version: 2.3
Post-History: 

Abstract

    This PEP describes a proposed logging package for Python's
    standard library.

    Basically the system involves the user creating one or more logger
    objects on which methods are called to log debugging notes,
    general information, warnings, errors etc.  Different logging
    'levels' can be used to distinguish important messages from less
    important ones.

    A registry of named singleton logger objects is maintained so that

        1) different logical logging streams (or 'channels') exist
           (say, one for 'zope.zodb' stuff and another for
           'mywebsite'-specific stuff)

        2) one does not have to pass logger object references around.

    The system is configurable at runtime.  This configuration
    mechanism allows one to tune the level and type of logging done
    while not touching the application itself.


Motivation

    If a single logging mechanism is enshrined in the standard
    library, 1) logging is more likely to be done 'well', and 2)
    multiple libraries will be able to be integrated into larger
    applications which can be logged reasonably coherently.


Influences

    This proposal was put together after having studied the
    following logging packages:

        o java.util.logging in JDK 1.4 (a.k.a. JSR047) [1]
        o log4j [2]
        o the Syslog package from the Protomatter project [3]
        o MAL's mx.Log package [4]


Simple Example

    This shows a very simple example of how the logging package can be
    used to generate simple logging output on stderr.

        --------- mymodule.py -------------------------------
        import logging
        log = logging.getLogger("MyModule")

        def doIt():
                log.debug("Doin' stuff...")
                #do stuff...
                raise TypeError, "Bogus type error for testing"
        -----------------------------------------------------

        --------- myapp.py ----------------------------------
        import mymodule, logging

        logging.basicConfig()

        log = logging.getLogger("MyApp")

        log.info("Starting my app")
        try:
                mymodule.doIt()
        except Exception, e:
                log.exception("There was a problem.")
        log.info("Ending my app")
        -----------------------------------------------------

        % python myapp.py

        INFO:MyApp: Starting my app
        DEBUG:MyModule: Doin' stuff...
        ERROR:MyApp: There was a problem.
        Traceback (most recent call last):
                File "myapp.py", line 9, in ?
                        mymodule.doIt()
                File "mymodule.py", line 7, in doIt
                        raise TypeError, "Bogus type error for testing"
        TypeError: Bogus type error for testing

        INFO:MyApp: Ending my app

        The above example shows the default output format.  All
        aspects of the output format should be configurable, so that
        you could have output formatted like this:

        2002-04-19 07:56:58,174 MyModule   DEBUG - Doin' stuff...

        or just

        Doin' stuff...


Control Flow

    Applications make logging calls on *Logger* objects.  Loggers are
    organized in a hierarchical namespace and child Loggers inherit
    some logging properties from their parents in the namespace.

    Logger names fit into a "dotted name" namespace, with dots
    (periods) indicating sub-namespaces.  The namespace of logger
    objects therefore corresponds to a single tree data structure.

       "" is the root of the namespace
       "Zope" would be a child node of the root
       "Zope.ZODB" would be a child node of "Zope"

    These Logger objects create *LogRecord* objects which are passed
    to *Handler* objects for output.  Both Loggers and Handlers may
    use logging *levels* and (optionally) *Filters* to decide if they
    are interested in a particular LogRecord.  When it is necessary to
    output a LogRecord externally, a Handler can (optionally) use a
    *Formatter* to localize and format the message before sending it
    to an I/O stream.

    Each Logger keeps track of a set of output Handlers.  By default
    all Loggers also send their output to all Handlers of their
    ancestor Loggers.  Loggers may, however, also be configured to
    ignore Handlers higher up the tree.

    The APIs are structured so that calls on the Logger APIs can be
    cheap when logging is disabled.  If logging is disabled for a
    given log level, then the Logger can make a cheap comparison test
    and return.  If logging is enabled for a given log level, the
    Logger is still careful to minimize costs before passing the
    LogRecord into the Handlers.  In particular, localization and
    formatting (which are relatively expensive) are deferred until the
    Handler requests them.

    The overall Logger hierarchy can also have a level associated with
    it, which takes precedence over the levels of individual Loggers.
    This is done through a module-level function:

        def disable(lvl):
            """
            Do not generate any LogRecords for requests with a severity less
            than 'lvl'.
            """
            ...


Levels

    The logging levels, in increasing order of importance, are:

        DEBUG
        INFO
        WARN
        ERROR
        CRITICAL

    The term CRITICAL is used in preference to FATAL, which is used by
    log4j.  The levels are conceptually the same - that of a serious,
    or very serious, error.  However, FATAL implies death, which in
    Python implies a raised and uncaught exception, traceback, and
    exit.  Since the logging module does not enforce such an outcome
    from a FATAL-level log entry, it makes sense to use CRITICAL in
    preference to FATAL.

    These are just integer constants, to allow simple comparison of
    importance.  Experience has shown that too many levels can be
    confusing, as they lead to subjective interpretation of which
    level should be applied to any particular log request.

    Although the above levels are strongly recommended, the logging
    system should not be prescriptive.  Users may define their own
    levels, as well as the textual representation of any levels.  User
    defined levels must, however, obey the constraints that they are
    all positive integers and that they increase in order of
    increasing severity.

    User-defined logging levels are supported through two module-level
    functions:

        def getLevelName(lvl):
                """Return the text for level 'lvl'."""
                ...

        def addLevelName(lvl, lvlName):
                """
                Add the level 'lvl' with associated text 'levelName', or
                set the textual representation of existing level 'lvl' to be
                'lvlName'."""
                ...

Loggers

    Each Logger object keeps track of a log level (or threshold) that
    it is interested in, and discards log requests below that level.

    A *Manager* class instance maintains the hierarchical namespace of
    named Logger objects.  Generations are denoted with dot-separated
    names: Logger "foo" is the parent of Loggers "foo.bar" and
    "foo.baz".

    The Manager class instance is a singleton and is not directly
    exposed to users, who interact with it using various module-level
    functions.

    The general logging method is:

        class Logger:
            def log(self, lvl, msg, *args, **kwargs):
                """Log 'str(msg) % args' at logging level 'lvl'."""
                ...

    However, convenience functions are defined for each logging level:

        class Logger:
            def debug(self, msg, *args, **kwargs): ...
            def info(self, msg, *args, **kwargs): ...
            def warn(self, msg, *args, **kwargs): ...
            def error(self, msg, *args, **kwargs): ...
            def critical(self, msg, *args, **kwargs): ...

    Only one keyword argument is recognized at present - "exc_info".
    If true, the caller wants exception information to be provided in
    the logging output.  This mechanism is only needed if exception
    information needs to be provided at *any* logging level.  In the
    more common case, where exception information needs to be added to
    the log only when errors occur, i.e. at the ERROR level, then
    another convenience method is provided:

        class Logger:
            def exception(self, msg, *args): ...

    This should only be called in the context of an exception handler,
    and is the preferred way of indicating a desire for exception
    information in the log.  The other convenience methods are
    intended to be called with exc_info only in the unusual situation
    where you might want to provide exception information in the
    context of an INFO message, for example.

    The "msg" argument shown above will normally be a format string;
    however, it can be any object x for which str(x) returns the
    format string.  This facilitates, for example, the use of an
    object which fetches a locale- specific message for an
    internationalized/localized application, perhaps using the
    standard gettext module.  An outline example:

        class Message:
            """Represents a message"""
            def __init__(self, id):
                """Initialize with the message ID"""

            def __str__(self):
                """Return an appropriate localized message text"""

        ...

        logger.info(Message("abc"), ...)

    Gathering and formatting data for a log message may be expensive,
    and a waste if the logger was going to discard the message anyway.
    To see if a request will be honoured by the logger, the
    isEnabledFor() method can be used:

        class Logger:
            def isEnabledFor(self, lvl):
                """
                Return true if requests at level 'lvl' will NOT be
                discarded.
                """
                ...

    so instead of this expensive and possibly wasteful DOM to XML
    conversion:

        ...
        hamletStr = hamletDom.toxml()
        log.info(hamletStr)
        ...

    one can do this:

        if log.isEnabledFor(logging.INFO):
            hamletStr = hamletDom.toxml()
            log.info(hamletStr)

    When new loggers are created, they are initialized with a level
    which signifies "no level".  A level can be set explicitly using
    the setLevel() method:

        class Logger:
            def setLevel(self, lvl): ...

    If a logger's level is not set, the system consults all its
    ancestors, walking up the hierarchy until an explicitly set level
    is found.  That is regarded as the "effective level" of the
    logger, and can be queried via the getEffectiveLevel() method:

        def getEffectiveLevel(self): ...

    Loggers are never instantiated directly.  Instead, a module-level
    function is used:

        def getLogger(name=None): ...

    If no name is specified, the root logger is returned.  Otherwise,
    if a logger with that name exists, it is returned.  If not, a new
    logger is initialized and returned.  Here, "name" is synonymous
    with "channel name".

    Users can specify a custom subclass of Logger to be used by the
    system when instantiating new loggers:

        def setLoggerClass(klass): ...

    The passed class should be a subclass of Logger, and its __init__
    method should call Logger.__init__.


Handlers

    Handlers are responsible for doing something useful with a given
    LogRecord.  The following core Handlers will be implemented:

        - StreamHandler: A handler for writing to a file-like object.
        - FileHandler: A handler for writing to a single file or set
          of rotating files.
        - SocketHandler: A handler for writing to remote TCP ports.
        - DatagramHandler: A handler for writing to UDP sockets, for
          low-cost logging.  Jeff Bauer already had such a system [5].
        - MemoryHandler: A handler that buffers log records in memory
          until the buffer is full or a particular condition occurs
          [1].
        - SMTPHandler: A handler for sending to email addresses via SMTP.
        - SysLogHandler: A handler for writing to Unix syslog via UDP.
        - NTEventLogHandler: A handler for writing to event logs on
          Windows NT, 2000 and XP.
        - HTTPHandler: A handler for writing to a Web server with
          either GET or POST semantics.

          Handlers can also have levels set for them using the
          setLevel() method:

              def setLevel(self, lvl): ...


    The FileHandler can be set up to create a rotating set of log
    files.  In this case, the file name passed to the constructor is
    taken as a "base" file name.  Additional file names for the
    rotation are created by appending .1, .2, etc. to the base file
    name, up to a maximum as specified when rollover is requested.
    The setRollover method is used to specify a maximum size for a log
    file and a maximum number of backup files in the rotation.

        def setRollover(maxBytes, backupCount): ...

    If maxBytes is specified as zero, no rollover ever occurs and the
    log file grows indefinitely.  If a non-zero size is specified,
    when that size is about to be exceeded, rollover occurs.  The
    rollover method ensures that the base file name is always the most
    recent, .1 is the next most recent, .2 the next most recent after
    that, and so on.

    There are many additional handlers implemented in the test/example
    scripts provided with [6] - for example, XMLHandler and
    SOAPHandler.


LogRecords

        A LogRecord acts as a receptacle for information about a
        logging event.  It is little more than a dictionary, though it
        does define a getMessage method which merges a message with
        optional runarguments.


Formatters

    A Formatter is responsible for converting a LogRecord to a string
    representation.  A Handler may call its Formatter before writing a
    record.  The following core Formatters will be implemented:

        - Formatter: Provide printf-like formatting, using the % operator.

        - BufferingFormatter: Provide formatting for multiple
          messages, with header and trailer formatting support.

    Formatters are associated with Handlers by calling setFormatter()
    on a handler:

        def setFormatter(self, form): ...

    Formatters use the % operator to format the logging message.  The
    format string should contain %(name)x and the attribute dictionary
    of the LogRecord is used to obtain message-specific data.  The
    following attributes are provided:

    %(name)s            Name of the logger (logging channel)

    %(levelno)s         Numeric logging level for the message (DEBUG,
                        INFO, WARN, ERROR, CRITICAL)

    %(levelname)s       Text logging level for the message ("DEBUG", "INFO",
                        "WARN", "ERROR", "CRITICAL")

    %(pathname)s        Full pathname of the source file where the logging
                        call was issued (if available)

    %(filename)s        Filename portion of pathname

    %(module)s          Module from which logging call was made

    %(lineno)d          Source line number where the logging call was issued
                        (if available)

    %(created)f         Time when the LogRecord was created (time.time()
                        return value)

    %(asctime)s         Textual time when the LogRecord was created

    %(msecs)d           Millisecond portion of the creation time

    %(relativeCreated)d Time in milliseconds when the LogRecord was created,
                        relative to the time the logging module was loaded
                        (typically at application startup time)

    %(thread)d          Thread ID (if available)

    %(message)s         The result of record.getMessage(), computed just as
                        the record is emitted

    If a formatter sees that the format string includes "(asctime)s",
    the creation time is formatted into the LogRecord's asctime
    attribute.  To allow flexibility in formatting dates, Formatters
    are initialized with a format string for the message as a whole,
    and a separate format string for date/time.  The date/time format
    string should be in time.strftime format.  The default value for
    the message format is "%(message)s".  The default date/time format
    is ISO8601.

    The formatter uses a class attribute, "converter", to indicate how
    to convert a time from seconds to a tuple.  By default, the value
    of "converter" is "time.localtime".  If needed, a different
    converter (e.g.  "time.gmtime") can be set on an individual
    formatter instance, or the class attribute changed to affect all
    formatter instances.


Filters

    When level-based filtering is insufficient, a Filter can be called
    by a Logger or Handler to decide if a LogRecord should be output.
    Loggers and Handlers can have multiple filters installed, and any
    one of them can veto a LogRecord being output.

        class Filter:
            def filter(self, record):
                """
                Return a value indicating true if the record is to be
                processed.  Possibly modify the record, if deemed
                appropriate by the filter.
                """

    The default behaviour allows a Filter to be initialized with a
    Logger name.  This will only allow through events which are
    generated using the named logger or any of its children.  For
    example, a filter initialized with "A.B" will allow events logged
    by loggers "A.B", "A.B.C", "A.B.C.D", "A.B.D" etc. but not "A.BB",
    "B.A.B" etc.  If initialized with the empty string, all events are
    passed by the Filter.  This filter behaviour is useful when it is
    desired to focus attention on one particular area of an
    application; the focus can be changed simply by changing a filter
    attached to the root logger.

    There are many examples of Filters provided in [6].


Configuration

    The main benefit of a logging system like this is that one can
    control how much and what logging output one gets from an
    application without changing that application's source code.
    Therefore, although configuration can be performed through the
    logging API, it must also be possible to change the logging
    configuration without changing an application at all.  For
    long-running programs like Zope, it should be possible to change
    the logging configuration while the program is running.

    Configuration includes the following:

        - What logging level a logger or handler should be interested in.
        - What handlers should be attached to which loggers.
        - What filters should be attached to which handlers and loggers.
        - Specifying attributes specific to certain handlers and filters.

    In general each application will have its own requirements for how
    a user may configure logging output.  However, each application
    will specify the required configuration to the logging system
    through a standard mechanism.

    The most simple configuration is that of a single handler, writing
    to stderr, attached to the root logger.  This configuration is set
    up by calling the basicConfig() function once the logging module
    has been imported.

        def basicConfig(): ...

    For more sophisticated configurations, this PEP makes no specific
    proposals, for the following reasons:

        - A specific proposal may be seen as prescriptive.
        - Without the benefit of wide practical experience in the
          Python community, there is no way to know whether any given
          configuration approach is a good one.  That practice can't
          really come until the logging module is used, and that means
          until *after* Python 2.3 has shipped.
        - There is a likelihood that different types of applications
          may require different configuration approaches, so that no
          "one size fits all".

    The reference implementation [6] has a working configuration file
    format, implemented for the purpose of proving the concept and
    suggesting one possible alternative.  It may be that separate
    extension modules, not part of the core Python distribution, are
    created for logging configuration and log viewing, supplemental
    handlers and other features which are not of interest to the bulk
    of the community.


Thread Safety

    The logging system should support thread-safe operation without
    any special action needing to be taken by its users.


Module-Level Functions

    To support use of the logging mechanism in short scripts and small
    applications, module-level functions debug(), info(), warn(),
    error(), critical() and exception() are provided.  These work in
    the same way as the correspondingly named methods of Logger - in
    fact they delegate to the corresponding methods on the root
    logger.  A further convenience provided by these functions is that
    if no configuration has been done, basicConfig() is automatically
    called.

    At application exit, all handlers can be flushed by calling the function

        def shutdown(): ...

    This will flush and close all handlers.


Implementation

    The reference implementation is Vinay Sajip's logging module [6].


Packaging

    The reference implementation is implemented as a single module.
    This offers the simplest interface - all users have to do is
    "import logging" and they are in a position to use all the
    functionality available.


References

    [1] java.util.logging
        http://java.sun.com/j2se/1.4/docs/guide/util/logging/

    [2] log4j: a Java logging package
        http://jakarta.apache.org/log4j/docs/index.html

    [3] Protomatter's Syslog
        http://protomatter.sourceforge.net/1.1.6/index.html
        http://protomatter.sourceforge.net/1.1.6/javadoc/com/protomatter/syslog/syslog-whitepaper.html

    [4] MAL mentions his mx.Log logging module:
        http://mail.python.org/pipermail/python-dev/2002-February/019767.html

    [5] Jeff Bauer's Mr. Creosote
        http://starship.python.net/crew/jbauer/creosote/

    [6] Vinay Sajip's logging module.
        http://www.red-dove.com/python_logging.html


Copyright

    This document has been placed in the public domain.



pep-0283 Python 2.3 Release Schedule

PEP: 283
Title: Python 2.3 Release Schedule
Version: $Revision$
Last-Modified: $Date$
Author: Guido van Rossum
Status: Final
Type: Informational
Created: 27-Feb-2002
Python-Version: 2.3
Post-History: 27-Feb-2002

Abstract

    This document describes the development and release schedule for
    Python 2.3.  The schedule primarily concerns itself with PEP-sized
    items.  Small features may be added up to and including the first
    beta release.  Bugs may be fixed until the final release.

    There will be at least two alpha releases, two beta releases, and
    one release candidate.  Alpha and beta releases will be spaced at
    least 4 weeks apart (except if an emergency release must be made
    to correct a blunder in the previous release; then the blunder
    release does not count).  Release candidates will be spaced at
    least one week apart (excepting again blunder corrections).

      alpha 1      --  31 Dec 2002
      alpha 2      --  19 Feb 2003
      beta 1       --  25 Apr 2003
      beta 2       --  29 Jun 2003
      candidate 1  --  18 Jul 2003
      candidate 2  --  24 Jul 2003
      final        --  29 Jul 2003

Release Manager

    Barry Warsaw, Jeremy Hylton, Tim Peters


Completed features for 2.3

    This list is not complete.  See Doc/whatsnew/whatsnew23.tex in CVS
    for more, and of course Misc/NEWS for the full list.

    - Tk 8.4 update.

    - The bool type and its constants, True and False (PEP 285).

    - PyMalloc was greatly enhanced and is enabled by default.

    - Universal newline support (PEP 278).

    - PEP 263  Defining Python Source Code Encodings        Lemburg

      Implemented (at least phase 1, which is all that's planned for
      2.3).

    - Extended slice notation for all built-in sequences.  The patch
      by Michael Hudson is now all checked in.

    - Speed up list iterations by filling tp_iter and other tweaks.
      See http://www.python.org/sf/560736; also done for xrange and
      tuples.

    - Timeout sockets.  http://www.python.org/sf/555085

    - Stage B0 of the int/long integration (PEP 237).  This means
      issuing a FutureWarning about situations where hex or oct
      conversions or left shifts returns a different value for an int
      than for a long with the same value.  The semantics do *not*
      change in Python 2.3; that will happen in Python 2.4.

    - Nuke SET_LINENO from all code objects (providing a different way
      to set debugger breakpoints).  This can boost pystone by >5%.
      http://www.python.org/sf/587993, now checked in.  (Unfortunately
      the pystone boost didn't happen.  What happened?)

    - Write a pymemcompat.h that people can bundle with their
      extensions and then use the 2.3 memory interface with all
      Pythons in the range 1.5.2 to 2.3.  (Michael Hudson checked in
      Misc/pymemcompat.h.)

    - Add a new concept, "pending deprecation", with associated
      warning PendingDeprecationWarning.  This warning is normally
      suppressed, but can be enabled by a suitable -W option.  Only a
      few things use this at this time.

    - Warn when an extension type's tp_compare returns anything except
      -1, 0 or 1.  http://www.python.org/sf/472523

    - Warn for assignment to None (in various forms).

    - PEP 218  Adding a Built-In Set Object Type            Wilson

      Alex Martelli contributed a new version of Greg Wilson's
      prototype, and I've reworked that quite a bit.  It's in the
      standard library now as the module "sets", although some details
      may still change until the first beta release.  (There are no
      plans to make this a built-in type, for now.)

    - PEP 293  Codec error handling callbacks               Dรถrwald

      Fully implemented.  Error handling in unicode.encode or
      str.decode can now be customized.

    - PEP 282  A Logging System                             Mick

      Vinay Sajip's implementation has been packagized and imported.
      (Documentation and unit tests still pending.)
      http://www.python.org/sf/578494

    - A modified MRO (Method Resolution Order) algorithm.  Consensus
      is that we should adopt C3.  Samuele Pedroni has contributed a
      draft implementation in C, see http://www.python.org/sf/619475
      This has now been checked in.

    - A new command line option parser.  Greg Ward's Optik package
      (http://optik.sf.net) has been adopted, converted to a single
      module named optparse.  See also
      http://www.python.org/sigs/getopt-sig/

    - A standard datetime type.  This started as a wiki:
      http://www.zope.org/Members/fdrake/DateTimeWiki/FrontPage .  A
      prototype was coded in nondist/sandbox/datetime/.  Tim Peters
      has finished the C implementation and checked it in.

    - PEP 273  Import Modules from Zip Archives             Ahlstrom

      Implemented as a part of the PEP 302 implementation work.

    - PEP 302  New Import Hooks                             JvR

      Implemented (though the 2.3a1 release contained some bugs that
      have been fixed post-release).

    - A new pickling protocol.  See PEP 307.

    - PEP 305 (CSV File API, by Skip Montanaro et al.) is in; this is
      the csv module.

    - Raymond Hettinger's itertools module is in.

    - PEP 311 (Simplified GIL Acquisition for Extensions, by Mark
      Hammond) has been included in beta 1.

    - Two new PyArg_Parse*() format codes, 'k' returns an unsigned C
      long int that receives the lower LONG_BIT bits of the Python
      argument, truncating without range checking. 'K' returns an
      unsigned C long long int that receives the lower LONG_LONG_BIT
      bits, truncating without range checking.  (SF 595026; Thomas
      Heller did this work.)

    - A new version of IDLE was imported from the IDLEfork project
      (http://idlefork.sf.net).  The code now lives in the idlelib
      package in the standard library and the idle script is installed
      by setup.py.

Planned features for 2.3

    Too late for anything more to get done here.


Ongoing tasks

    The following are ongoing TO-DO items which we should attempt to
    work on without hoping for completion by any particular date.

    - Documentation: complete the distribution and installation
      manuals.

    - Documentation: complete the documentation for new-style
      classes.

    - Look over the Demos/ directory and update where required (Andrew
      Kuchling has done a lot of this)

    - New tests.

    - Fix doc bugs on SF.

    - Remove use of deprecated features in the core.

    - Document deprecated features appropriately.

    - Mark deprecated C APIs with Py_DEPRECATED.

    - Deprecate modules which are unmaintained, or perhaps make a new
      category for modules 'Unmaintained'

    - In general, lots of cleanup so it is easier to move forward.


Open issues

    There are some issues that may need more work and/or thought
    before the final release (and preferably before the first beta
    release):  No issues remaining.


Features that did not make it into Python 2.3

    - The import lock could use some redesign.  (SF 683658.)

    - Set API issues; is the sets module perfect?

      I expect it's good enough to stop polishing it until we've had
      more widespread user experience.

    - A nicer API to open text files, replacing the ugly (in some
      people's eyes) "U" mode flag.  There's a proposal out there to
      have a new built-in type textfile(filename, mode, encoding).
      (Shouldn't it have a bufsize argument too?)

      Ditto.

    - New widgets for Tkinter???

      Has anyone gotten the time for this?  *Are* there any new
      widgets in Tk 8.4?  Note that we've got better Tix support
      already (though not on Windows yet).

    - Fredrik Lundh's basetime proposal:
      http://effbot.org/ideas/time-type.htm

      I believe this is dead now.

    - PEP 304 (Controlling Generation of Bytecode Files by Montanaro)
      seems to have lost steam.

    - For a class defined inside another class, the __name__ should be
      "outer.inner", and pickling should work.  (SF 633930.  I'm no
      longer certain this is easy or even right.)

    - reST is going to be used a lot in Zope3.  Maybe it could become
      a standard library module?  (Since reST's author thinks it's too
      instable, I'm inclined not to do this.)

    - Decide on a clearer deprecation policy (especially for modules)
      and act on it.  For a start, see this message from Neal Norwitz:
      http://mail.python.org/pipermail/python-dev/2002-April/023165.html
      There seems insufficient interest in moving this further in an
      organized fashion, and it's not particularly important.

    - Provide alternatives for common uses of the types module;
      Skip Montanaro has posted a proto-PEP for this idea:
      http://mail.python.org/pipermail/python-dev/2002-May/024346.html
      There hasn't been any progress on this, AFAICT.

    - Use pending deprecation for the types and string modules.  This
      requires providing alternatives for the parts that aren't
      covered yet (e.g. string.whitespace and types.TracebackType).
      It seems we can't get consensus on this.

    - Deprecate the buffer object.
      http://mail.python.org/pipermail/python-dev/2002-July/026388.html
      http://mail.python.org/pipermail/python-dev/2002-July/026408.html
      It seems that this is never going to be resolved.

    - PEP 269  Pgen Module for Python                       Riehl

      (Some necessary changes are in; the pgen module itself needs to
      mature more.)

    - Add support for the long-awaited Python catalog.  Kapil
      Thangavelu has a Zope-based implementation that he demoed at
      OSCON 2002.  Now all we need is a place to host it and a person
      to champion it.  (Some changes to distutils to support this are
      in, at least.)

    - PEP 266  Optimizing Global Variable/Attribute Access  Montanaro
      PEP 267  Optimized Access to Module Namespaces        Hylton
      PEP 280  Optimizing access to globals                 van Rossum

      These are basically three friendly competing proposals.  Jeremy
      has made a little progress with a new compiler, but it's going
      slow and the compiler is only the first step.  Maybe we'll be
      able to refactor the compiler in this release.  I'm tempted to
      say we won't hold our breath.  In the mean time, Oren Tirosh has
      a much simpler idea that may give a serious boost to the
      performance of accessing globals and built-ins, by optimizing
      and inlining the dict access:
      http://tothink.com/python/fastnames/

    - Lazily tracking tuples?
      http://mail.python.org/pipermail/python-dev/2002-May/023926.html
      http://www.python.org/sf/558745
      Not much enthusiasm I believe.

    - PEP 286  Enhanced Argument Tuples                     von Loewis

      I haven't had the time to review this thoroughly.  It seems a
      deep optimization hack (also makes better correctness guarantees
      though).

    - Make 'as' a keyword.  It has been a pseudo-keyword long enough.
      Too much effort to bother.


Copyright

    This document has been placed in the public domain.



pep-0284 Integer for-loops

PEP: 284
Title: Integer for-loops
Version: $Revision$
Last-Modified: $Date$
Author: David Eppstein <eppstein at ics.uci.edu>, Greg Ewing <greg.ewing at canterbury.ac.nz>
Status: Rejected
Type: Standards Track
Created: 1-Mar-2002
Python-Version: 2.3
Post-History: 

Abstract

    This PEP proposes to simplify iteration over intervals of
    integers, by extending the range of expressions allowed after a
    "for" keyword to allow three-way comparisons such as

        for lower <= var < upper:

    in place of the current

        for item in list:

    syntax.  The resulting loop or list iteration will loop over all
    values of var that make the comparison true, starting from the
    left endpoint of the given interval.

Pronouncement

    This PEP is rejected.  There were a number of fixable issues with
    the proposal (see the fixups listed in Raymond Hettinger's
    python-dev post on 18 June 2005).  However, even with the fixups the
    proposal did not garner support.  Specifically, Guido did not buy
    the premise that the range() format needed fixing, "The whole point
    (15 years ago) of range() was to *avoid* needing syntax to specify a
    loop over numbers. I think it's worked out well and there's nothing
    that needs to be fixed (except range() needs to become an iterator,
    which it will in Python 3.0)."

Rationale

    One of the most common uses of for-loops in Python is to iterate
    over an interval of integers.  Python provides functions range()
    and xrange() to generate lists and iterators for such intervals,
    which work best for the most frequent case: half-open intervals
    increasing from zero.  However, the range() syntax is more awkward
    for open or closed intervals, and lacks symmetry when reversing
    the order of iteration.  In addition, the call to an unfamiliar
    function makes it difficult for newcomers to Python to understand
    code that uses range() or xrange().

    The perceived lack of a natural, intuitive integer iteration
    syntax has led to heated debate on python-list, and spawned at
    least four PEPs before this one.  PEP 204 [1] (rejected) proposed
    to re-use Python's slice syntax for integer ranges, leading to a
    terser syntax but not solving the readability problem of
    multi-argument range().  PEP 212 [2] (deferred) proposed several
    syntaxes for directly converting a list to a sequence of integer
    indices, in place of the current idiom

        range(len(list))

    for such conversion, and PEP 281 [3] proposes to simplify the same
    idiom by allowing it to be written as

        range(list).

    PEP 276 [4] proposes to allow automatic conversion of integers to
    iterators, simplifying the most common half-open case but not
    addressing the complexities of other types of interval.
    Additional alternatives have been discussed on python-list.

    The solution described here is to allow a three-way comparison
    after a "for" keyword, both in the context of a for-loop and of a
    list comprehension:

        for lower <= var < upper:

    This would cause iteration over an interval of consecutive
    integers, beginning at the left bound in the comparison and ending
    at the right bound.  The exact comparison operations used would
    determine whether the interval is open or closed at either end and
    whether the integers are considered in ascending or descending
    order.

    This syntax closely matches standard mathematical notation, so is
    likely to be more familiar to Python novices than the current
    range() syntax.  Open and closed interval endpoints are equally
    easy to express, and the reversal of an integer interval can be
    formed simply by swapping the two endpoints and reversing the
    comparisons.  In addition, the semantics of such a loop would
    closely resemble one way of interpreting the existing Python
    for-loops:

        for item in list

    iterates over exactly those values of item that cause the
    expression

        item in list

    to be true.  Similarly, the new format

        for lower <= var < upper:

    would iterate over exactly those integer values of var that cause
    the expression

        lower <= var < upper

    to be true.


Specification

    We propose to extend the syntax of a for statement, currently

        for_stmt: "for" target_list "in" expression_list ":" suite
                  ["else" ":" suite]

    as described below:

        for_stmt: "for" for_test ":" suite ["else" ":" suite]
        for_test: target_list "in" expression_list |
                  or_expr less_comp or_expr less_comp or_expr |
                  or_expr greater_comp or_expr greater_comp or_expr
        less_comp: "<" | "<="
        greater_comp: ">" | ">="

    Similarly, we propose to extend the syntax of list comprehensions,
    currently

        list_for: "for" expression_list "in" testlist [list_iter]

    by replacing it with:

        list_for: "for" for_test [list_iter]

    In all cases the expression formed by for_test would be subject to
    the same precedence rules as comparisons in expressions.  The two
    comp_operators in a for_test must be required to be both of
    similar types, unlike chained comparisons in expressions which do
    not have such a restriction.

    We refer to the two or_expr's occurring on the left and right
    sides of the for-loop syntax as the bounds of the loop, and the
    middle or_expr as the variable of the loop.  When a for-loop using
    the new syntax is executed, the expressions for both bounds will
    be evaluated, and an iterator object created that iterates through
    all integers between the two bounds according to the comparison
    operations used.  The iterator will begin with an integer equal or
    near to the left bound, and then step through the remaining
    integers with a step size of +1 or -1 if the comparison operation
    is in the set described by less_comp or greater_comp respectively.
    The execution will then proceed as if the expression had been

        for variable in iterator

    where "variable" refers to the variable of the loop and "iterator"
    refers to the iterator created for the given integer interval.

    The values taken by the loop variable in an integer for-loop may
    be either plain integers or long integers, according to the
    magnitude of the bounds.  Both bounds of an integer for-loop must
    evaluate to a real numeric type (integer, long, or float).  Any
    other value will cause the for-loop statement to raise a TypeError
    exception.


Issues

    The following issues were raised in discussion of this and related
    proposals on the Python list.

    - Should the right bound be evaluated once, or every time through
      the loop?  Clearly, it only makes sense to evaluate the left
      bound once.  For reasons of consistency and efficiency, we have
      chosen the same convention for the right bound.

    - Although the new syntax considerably simplifies integer
      for-loops, list comprehensions using the new syntax are not as
      simple.  We feel that this is appropriate since for-loops are
      more frequent than comprehensions.

    - The proposal does not allow access to integer iterator objects
      such as would be created by xrange.  True, but we see this as a
      shortcoming in the general list-comprehension syntax, beyond the
      scope of this proposal.  In addition, xrange() will still be
      available.

    - The proposal does not allow increments other than 1 and -1.
      More general arithmetic progressions would need to be created by
      range() or xrange(), or by a list comprehension syntax such as

        [2*x for 0 <= x <= 100]

    - The position of the loop variable in the middle of a three-way
      comparison is not as apparent as the variable in the present

        for item in list

      syntax, leading to a possible loss of readability.  We feel that
      this loss is outweighed by the increase in readability from a
      natural integer iteration syntax.

    - To some extent, this PEP addresses the same issues as PEP 276
      [4].  We feel that the two PEPs are not in conflict since PEP
      276 is primarily concerned with half-open ranges starting in 0
      (the easy case of range()) while this PEP is primarily concerned
      with simplifying all other cases.  However, if this PEP is
      approved, its new simpler syntax for integer loops could to some
      extent reduce the motivation for PEP 276.

    - It is not clear whether it makes sense to allow floating point
      bounds for an integer loop: if a float represents an inexact
      value, how can it be used to determine an exact sequence of
      integers?  On the other hand, disallowing float bounds would
      make it difficult to use floor() and ceiling() in integer
      for-loops, as it is difficult to use them now with range().  We
      have erred on the side of flexibility, but this may lead to some
      implementation difficulties in determining the smallest and
      largest integer values that would cause a given comparison to be
      true.

    - Should types other than int, long, and float be allowed as
      bounds?  Another choice would be to convert all bounds to
      integers by int(), and allow as bounds anything that can be so
      converted instead of just floats.  However, this would change
      the semantics: 0.3 <= x is not the same as int(0.3) <= x, and it
      would be confusing for a loop with 0.3 as lower bound to start
      at zero.  Also, in general int(f) can be very far from f.


Implementation

    An implementation is not available at this time.  Implementation
    is not expected to pose any great difficulties: the new syntax
    could, if necessary, be recognized by parsing a general expression
    after each "for" keyword and testing whether the top level
    operation of the expression is "in" or a three-way comparison.
    The Python compiler would convert any instance of the new syntax
    into a loop over the items in a special iterator object.


References

    [1] PEP 204, Range Literals
        http://www.python.org/dev/peps/pep-0204/

    [2] PEP 212, Loop Counter Iteration
        http://www.python.org/dev/peps/pep-0212/

    [3] PEP 281, Loop Counter Iteration with range and xrange
        http://www.python.org/dev/peps/pep-0281/

    [4] PEP 276, Simple Iterator for ints
        http://www.python.org/dev/peps/pep-0276/


Copyright

    This document has been placed in the public domain.



pep-0285 Adding a bool type

PEP: 285
Title: Adding a bool type
Version: $Revision$
Last-Modified: $Date$
Author: Guido van Rossum <guido at python.org>
Status: Final
Type: Standards Track
Created: 8-Mar-2002
Python-Version: 2.3
Post-History: 8-Mar-2002, 30-Mar-2002, 3-Apr-2002

Abstract

    This PEP proposes the introduction of a new built-in type, bool,
    with two constants, False and True.  The bool type would be a
    straightforward subtype (in C) of the int type, and the values
    False and True would behave like 0 and 1 in most respects (for
    example, False==0 and True==1 would be true) except repr() and
    str().  All built-in operations that conceptually return a Boolean
    result will be changed to return False or True instead of 0 or 1;
    for example, comparisons, the "not" operator, and predicates like
    isinstance().


Review

    I've collected enough feedback to last me a lifetime, so I declare
    the review period officially OVER.  I had Chinese food today; my
    fortune cookie said "Strong and bitter words indicate a weak
    cause."  It reminded me of some of the posts against this
    PEP... :-)

    Anyway, here are my BDFL pronouncements.  (Executive summary: I'm
    not changing a thing; all variants are rejected.)

    1) Should this PEP be accepted?

    => Yes.

       There have been many arguments against the PEP.  Many of them
       were based on misunderstandings.  I've tried to clarify some of
       the most common misunderstandings below in the main text of the
       PEP.  The only issue that weighs at all for me is the tendency
       of newbies to write "if x == True" where "if x" would suffice.
       More about that below too.  I think this is not a sufficient
       reason to reject the PEP.

    2) Should str(True) return "True" or "1"?  "1" might reduce
       backwards compatibility problems, but looks strange.
       (repr(True) would always return "True".)

    => "True".

       Almost all reviewers agree with this.

    3) Should the constants be called 'True' and 'False' (similar to
       None) or 'true' and 'false' (as in C++, Java and C99)?

    => True and False.

       Most reviewers agree that consistency within Python is more
       important than consistency with other languages.

    4) Should we strive to eliminate non-Boolean operations on bools
       in the future, through suitable warnings, so that for example
       True+1 would eventually (in Python 3000) be illegal?

    => No.

       There's a small but vocal minority that would prefer to see
       "textbook" bools that don't support arithmetic operations at
       all, but most reviewers agree with me that bools should always
       allow arithmetic operations.

    5) Should operator.truth(x) return an int or a bool?

    => bool.

       Tim Peters believes it should return an int, but almost all
       other reviewers agree that it should return a bool.  My
       rationale: operator.truth() exists to force a Boolean context
       on its argument (it calls the C API PyObject_IsTrue()).
       Whether the outcome is reported as int or bool is secondary; if
       bool exists there's no reason not to use it.  (Under the PEP,
       operator.truth() now becomes an alias for bool(); that's fine.)

    6) Should bool inherit from int?

    => Yes.

       In an ideal world, bool might be better implemented as a
       separate integer type that knows how to perform mixed-mode
       arithmetic.  However, inheriting bool from int eases the
       implementation enormously (in part since all C code that calls
       PyInt_Check() will continue to work -- this returns true for
       subclasses of int).  Also, I believe this is right in terms of
       substitutability: code that requires an int can be fed a bool
       and it will behave the same as 0 or 1.  Code that requires a
       bool may not work when it is given an int; for example, 3 & 4
       is 0, but both 3 and 4 are true when considered as truth
       values.

    7) Should the name 'bool' be changed?

    => No.

       Some reviewers have argued for boolean instead of bool, because
       this would be easier to understand (novices may have heard of
       Boolean algebra but may not make the connection with bool) or
       because they hate abbreviations.  My take: Python uses
       abbreviations judiciously (like 'def', 'int', 'dict') and I
       don't think these are a burden to understanding.  To a newbie,
       it doesn't matter whether it's called a waffle or a bool; it's
       a new word, and they learn quickly what it means.

       One reviewer has argued to make the name 'truth'.  I find this
       an unattractive name, and would actually prefer to reserve this
       term (in documentation) for the more abstract concept of truth
       values that already exists in Python.  For example: "when a
       container is interpreted as a truth value, an empty container
       is considered false and a non-empty one is considered true."

    8) Should we strive to require that Boolean operations (like "if",
       "and", "not") have a bool as an argument in the future, so that
       for example "if []:" would become illegal and would have to be
       writen as "if bool([]):" ???

    => No!!!

       Some people believe that this is how a language with a textbook
       Boolean type should behave.  Because it was brought up, others
       have worried that I might agree with this position.  Let me
       make my position on this quite clear.  This is not part of the
       PEP's motivation and I don't intend to make this change.  (See
       also the section "Clarification" below.)


Rationale

    Most languages eventually grow a Boolean type; even C99 (the new
    and improved C standard, not yet widely adopted) has one.

    Many programmers apparently feel the need for a Boolean type; most
    Python documentation contains a bit of an apology for the absence
    of a Boolean type.  I've seen lots of modules that defined
    constants "False=0" and "True=1" (or similar) at the top and used
    those.  The problem with this is that everybody does it
    differently.  For example, should you use "FALSE", "false",
    "False", "F" or even "f"?  And should false be the value zero or
    None, or perhaps a truth value of a different type that will print
    as "true" or "false"?  Adding a standard bool type to the language
    resolves those issues.

    Some external libraries (like databases and RPC packages) need to
    be able to distinguish between Boolean and integral values, and
    while it's usually possible to craft a solution, it would be
    easier if the language offered a standard Boolean type.  This also
    applies to Jython: some Java classes have separately overloaded
    methods or constructors for int and boolean arguments.  The bool
    type can be used to select the boolean variant.  (The same is
    apparently the case for some COM interfaces.)

    The standard bool type can also serve as a way to force a value to
    be interpreted as a Boolean, which can be used to normalize
    Boolean values.  When a Boolean value needs to be normalized to
    one of two values, bool(x) is much clearer than "not not x" and
    much more concise than

        if x:
            return 1
        else:
            return 0

    Here are some arguments derived from teaching Python.  When
    showing people comparison operators etc. in the interactive shell,
    I think this is a bit ugly:

        >>> a = 13
        >>> b = 12
        >>> a > b
        1
        >>>

    If this was:

        >>> a > b
        True
        >>>

    it would require a millisecond less thinking each time a 0 or 1
    was printed.

    There's also the issue (which I've seen baffling even experienced
    Pythonistas who had been away from the language for a while) that
    if you see:

        >>> cmp(a, b)
        1
        >>> cmp(a, a)
        0
        >>> 

    you might be tempted to believe that cmp() also returned a truth
    value, whereas in reality it can return three different values
    (-1, 0, 1).  If ints were not (normally) used to represent
    Booleans results, this would stand out much more clearly as
    something completely different.


Specification

    The following Python code specifies most of the properties of the
    new type:

        class bool(int):

            def __new__(cls, val=0):
                # This constructor always returns an existing instance
                if val:
                    return True
                else:
                    return False

            def __repr__(self):
                if self:
                    return "True"
                else:
                    return "False"

            __str__ = __repr__

            def __and__(self, other):
                if isinstance(other, bool):
                    return bool(int(self) & int(other))
                else:
                    return int.__and__(self, other)

            __rand__ = __and__

            def __or__(self, other):
                if isinstance(other, bool):
                    return bool(int(self) | int(other))
                else:
                    return int.__or__(self, other)

            __ror__ = __or__

            def __xor__(self, other):
                if isinstance(other, bool):
                    return bool(int(self) ^ int(other))
                else:
                    return int.__xor__(self, other)

            __rxor__ = __xor__

        # Bootstrap truth values through sheer willpower
        False = int.__new__(bool, 0)
        True = int.__new__(bool, 1)

    The values False and True will be singletons, like None.  Because
    the type has two values, perhaps these should be called
    "doubletons"?  The real implementation will not allow other
    instances of bool to be created.

    True and False will properly round-trip through pickling and
    marshalling; for example pickle.loads(pickle.dumps(True)) will
    return True, and so will marshal.loads(marshal.dumps(True)).

    All built-in operations that are defined to return a Boolean
    result will be changed to return False or True instead of 0 or 1.
    In particular, this affects comparisons (<, <=, ==, !=, >, >=, is,
    is not, in, not in), the unary operator 'not', the built-in
    functions callable(), hasattr(), isinstance() and issubclass(),
    the dict method has_key(), the string and unicode methods
    endswith(), isalnum(), isalpha(), isdigit(), islower(), isspace(),
    istitle(), isupper(), and startswith(), the unicode methods
    isdecimal() and isnumeric(), and the 'closed' attribute of file
    objects.  The predicates in the operator module are also changed
    to return a bool, including operator.truth().

    Because bool inherits from int, True+1 is valid and equals 2, and
    so on.  This is important for backwards compatibility: because
    comparisons and so on currently return integer values, there's no
    way of telling what uses existing applications make of these
    values.

    It is expected that over time, the standard library will be
    updated to use False and True when appropriate (but not to require
    a bool argument type where previous an int was allowed).  This
    change should not pose additional problems and is not specified in
    detail by this PEP.


C API

    The header file "boolobject.h" defines the C API for the bool
    type.  It is included by "Python.h" so there is no need to include
    it directly.

    The existing names Py_False and Py_True reference the unique bool
    objects False and True (previously these referenced static int
    objects with values 0 and 1, which were not unique amongst int
    values).

    A new API, PyObject *PyBool_FromLong(long), takes a C long int
    argument and returns a new reference to either Py_False (when the
    argument is zero) or Py_True (when it is nonzero).

    To check whether an object is a bool, the macro PyBool_Check() can
    be used.

    The type of bool instances is PyBoolObject *.

    The bool type object is available as PyBool_Type.


Clarification

    This PEP does *not* change the fact that almost all object types
    can be used as truth values.  For example, when used in an if
    statement, an empty list is false and a non-empty one is true;
    this does not change and there is no plan to ever change this.

    The only thing that changes is the preferred values to represent
    truth values when returned or assigned explicitly.  Previously,
    these preferred truth values were 0 and 1; the PEP changes the
    preferred values to False and True, and changes built-in
    operations to return these preferred values.


Compatibility

    Because of backwards compatibility, the bool type lacks many
    properties that some would like to see.  For example, arithmetic
    operations with one or two bool arguments is allowed, treating
    False as 0 and True as 1.  Also, a bool may be used as a sequence
    index.

    I don't see this as a problem, and I don't want evolve the
    language in this direction either.  I don't believe that a
    stricter interpretation of "Booleanness" makes the language any
    clearer.

    Another consequence of the compatibility requirement is that the
    expression "True and 6" has the value 6, and similarly the
    expression "False or None" has the value None.  The "and" and "or"
    operators are usefully defined to return the first argument that
    determines the outcome, and this won't change; in particular, they
    don't force the outcome to be a bool.  Of course, if both
    arguments are bools, the outcome is always a bool.  It can also
    easily be coerced into being a bool by writing for example "bool(x
    and y)".


Resolved Issues

    (See also the Review section above.)

    - Because the repr() or str() of a bool value is different from an
      int value, some code (for example doctest-based unit tests, and
      possibly database code that relies on things like "%s" % truth)
      may fail.  It is easy to work around this (without explicitly
      referencing the bool type), and it is expected that this only
      affects a very small amount of code that can easily be fixed.

    - Other languages (C99, C++, Java) name the constants "false" and
      "true", in all lowercase.  For Python, I prefer to stick with
      the example set by the existing built-in constants, which all
      use CapitalizedWords: None, Ellipsis, NotImplemented (as well as
      all built-in exceptions).  Python's built-in namespace uses all
      lowercase for functions and types only.

    - It has been suggested that, in order to satisfy user
      expectations, for every x that is considered true in a Boolean
      context, the expression x == True should be true, and likewise
      if x is considered false, x == False should be true.  In
      particular newbies who have only just learned about Boolean
      variables are likely to write

          if x == True: ...

      instead of the correct form,

          if x: ...

      There seem to be strong psychological and linguistic reasons why
      many people are at first uncomfortable with the latter form, but
      I believe that the solution should be in education rather than
      in crippling the language.  After all, == is general seen as a
      transitive operator, meaning that from a==b and b==c we can
      deduce a==c.  But if any comparison to True were to report
      equality when the other operand was a true value of any type,
      atrocities like 6==True==7 would hold true, from which one could
      infer the falsehood 6==7.  That's unacceptable.  (In addition,
      it would break backwards compatibility.  But even if it didn't,
      I'd still be against this, for the stated reasons.)

      Newbies should also be reminded that there's never a reason to
      write

          if bool(x): ...

      since the bool is implicit in the "if".  Explicit is *not*
      better than implicit here, since the added verbiage impairs
      redability and there's no other interpretation possible.  There
      is, however, sometimes a reason to write

          b = bool(x)

      This is useful when it is unattractive to keep a reference to an
      arbitrary object x, or when normalization is required for some
      other reason.  It is also sometimes appropriate to write

          i = int(bool(x))

      which converts the bool to an int with the value 0 or 1.  This
      conveys the intention to henceforth use the value as an int.


Implementation

    A complete implementation in C has been uploaded to the
    SourceForge patch manager:

        http://python.org/sf/528022

    This will soon be checked into CVS for python 2.3a0.


Copyright

    This document has been placed in the public domain.



pep-0286 Enhanced Argument Tuples

PEP: 286
Title: Enhanced Argument Tuples
Version: $Revision$
Last-Modified: $Date$
Author: Martin von Lรถwis <martin at v.loewis.de>
Status: Deferred
Type: Standards Track
Created: 3-Mar-2002
Python-Version: 2.3
Post-History: 

Abstract

    PyArg_ParseTuple is confronted with difficult memory management if
    an argument converter creates new memory.  To deal with these
    cases, a specialized argument type is proposed.

PEP Deferral

    Further exploration of the concepts covered in this PEP has been deferred
    for lack of a current champion interested in promoting the goals of the
    PEP and collecting and incorporating feedback, and with sufficient
    available time to do so effectively.

    The resolution of this PEP may also be affected by the resolution of
    PEP 426, which proposes the use of a preprocessing step to generate
    some aspects of C API interface code.

Problem description

    Today, argument tuples keep references to the function arguments,
    which are guaranteed to live as long as the argument tuple exists
    which is at least as long as the function call is being executed.

    In some cases, parsing an argument will allocate new memory, which
    is then to be released by the caller.  This has two problems:

    1. In case of failure, the application cannot know what memory to
       release; most callers don't even know that they have the
       responsibility to release that memory.  Example for this are
       the N converter (bug #416288) and the es# converter (bug
       #501716).

    2. Even for successful argument parsing, it is still inconvenient
       for the caller to be responsible for releasing the memory.  In
       some cases, this is unnecessarily inefficient.  For example,
       the es converter copies the conversion result into memory, even
       though there already is a string object that has the right
       contents.


Proposed solution

    A new type 'argument tuple' is introduced.  This type derives from
    tuple, adding an __dict__ member (at tp_dictoffset -4).  Instances
    of this type might get the following attributes:

       - 'failobjects', a list of objects which need to be deallocated
         in case of success

       - 'okobjects', a list of object which will be released when the
         argument tuple is released

    To manage this type, the following functions will be added, and
    used appropriately in ceval.c and getargs.c:

       - PyArgTuple_New(int);
       - PyArgTuple_AddFailObject(PyObject*, PyObject*);
       - PyArgTuple_AddFailMemory(PyObject*, void*);
       - PyArgTuple_AddOkObject(PyObject*, PyObject*);
       - PyArgTuple_AddOkMemory(PyObject*, void*);
       - PyArgTuple_ClearFailed(PyObject*);

    When argument parsing fails, all fail objects will be released
    through Py_DECREF, and all fail memory will be released through
    PyMem_Free.  If parsing succeeds, the references to the fail
    objects and fail memory are dropped, without releasing anything.

    When the argument tuple is released, all ok objects and memory
    will be released.

    If those functions are called with an object of a different type,
    a warning is issued and no further action is taken; usage of the
    affected converters without using argument tuples is deprecated.


Affected converters

    The following converters will add fail memory and fail objects: N,
    es, et, es#, et# (unless memory is passed into the converter)


New converters

    To simplify Unicode conversion, the e* converters are duplicated
    as E* converters (Es, Et, Es#, Et#).  The usage of the E*
    converters is identical to that of the e* converters, except that
    the application will not need to manage the resulting memory.
    This will be implemented through registration of Ok objects with
    the argument tuple.  The e* converters are deprecated.


Copyright

    This document has been placed in the public domain.



pep-0287 reStructuredText Docstring Format

PEP:287
Title:reStructuredText Docstring Format
Version:$Revision$
Last-Modified:$Date$
Author:David Goodger <goodger at python.org>
Discussions-To:<doc-sig at python.org>
Status:Active
Type:Informational
Content-Type:text/x-rst
Created:25-Mar-2002
Post-History:02-Apr-2002
Replaces:216

Abstract

When plaintext hasn't been expressive enough for inline documentation, Python programmers have sought out a format for docstrings. This PEP proposes that the reStructuredText markup [5] be adopted as a standard markup format for structured plaintext documentation in Python docstrings, and for PEPs and ancillary documents as well. reStructuredText is a rich and extensible yet easy-to-read, what-you-see-is-what-you-get plaintext markup syntax.

Only the low-level syntax of docstrings is addressed here. This PEP is not concerned with docstring semantics or processing at all (see PEP 256 for a "Road Map to the Docstring PEPs"). Nor is it an attempt to deprecate pure plaintext docstrings, which are always going to be legitimate. The reStructuredText markup is an alternative for those who want more expressive docstrings.

Benefits

Programmers are by nature a lazy breed. We reuse code with functions, classes, modules, and subsystems. Through its docstring syntax, Python allows us to document our code from within. The "holy grail" of the Python Documentation Special Interest Group (Doc-SIG [6]) has been a markup syntax and toolset to allow auto-documentation, where the docstrings of Python systems can be extracted in context and processed into useful, high-quality documentation for multiple purposes.

Document markup languages have three groups of customers: the authors who write the documents, the software systems that process the data, and the readers, who are the final consumers and the most important group. Most markups are designed for the authors and software systems; readers are only meant to see the processed form, either on paper or via browser software. ReStructuredText is different: it is intended to be easily readable in source form, without prior knowledge of the markup. ReStructuredText is entirely readable in plaintext format, and many of the markup forms match common usage (e.g., *emphasis*), so it reads quite naturally. Yet it is rich enough to produce complex documents, and extensible so that there are few limits. Of course, to write reStructuredText documents some prior knowledge is required.

The markup offers functionality and expressivity, while maintaining easy readability in the source text. The processed form (HTML etc.) makes it all accessible to readers: inline live hyperlinks; live links to and from footnotes; automatic tables of contents (with live links!); tables; images for diagrams etc.; pleasant, readable styled text.

The reStructuredText parser is available now, part of the Docutils [24] project. Standalone reStructuredText documents and PEPs can be converted to HTML; other output format writers are being worked on and will become available over time. Work is progressing on a Python source "Reader" which will implement auto-documentation from docstrings. Authors of existing auto-documentation tools are encouraged to integrate the reStructuredText parser into their projects, or better yet, to join forces to produce a world-class toolset for the Python standard library.

Tools will become available in the near future, which will allow programmers to generate HTML for online help, XML for multiple purposes, and eventually PDF, DocBook, and LaTeX for printed documentation, essentially "for free" from the existing docstrings. The adoption of a standard will, at the very least, benefit docstring processing tools by preventing further "reinventing the wheel".

Eventually PyDoc, the one existing standard auto-documentation tool, could have reStructuredText support added. In the interim it will have no problem with reStructuredText markup, since it treats all docstrings as preformatted plaintext.

Goals

These are the generally accepted goals for a docstring format, as discussed in the Doc-SIG:

  1. It must be readable in source form by the casual observer.
  2. It must be easy to type with any standard text editor.
  3. It must not need to contain information which can be deduced from parsing the module.
  4. It must contain sufficient information (structure) so it can be converted to any reasonable markup format.
  5. It must be possible to write a module's entire documentation in docstrings, without feeling hampered by the markup language.

reStructuredText meets and exceeds all of these goals, and sets its own goals as well, even more stringent. See Docstring-Significant Features below.

The goals of this PEP are as follows:

  1. To establish reStructuredText as a standard structured plaintext format for docstrings (inline documentation of Python modules and packages), PEPs, README-type files and other standalone documents. "Accepted" status will be sought through Python community consensus and eventual BDFL pronouncement.

    Please note that reStructuredText is being proposed as a standard, not the only standard. Its use will be entirely optional. Those who don't want to use it need not.

  2. To solicit and address any related concerns raised by the Python community.

  3. To encourage community support. As long as multiple competing markups are out there, the development community remains fractured. Once a standard exists, people will start to use it, and momentum will inevitably gather.

  4. To consolidate efforts from related auto-documentation projects. It is hoped that interested developers will join forces and work on a joint/merged/common implementation.

Once reStructuredText is a Python standard, effort can be focused on tools instead of arguing for a standard. Python needs a standard set of documentation tools.

With regard to PEPs, one or both of the following strategies may be applied:

  1. Keep the existing PEP section structure constructs (one-line section headers, indented body text). Subsections can either be forbidden, or supported with reStructuredText-style underlined headers in the indented body text.
  2. Replace the PEP section structure constructs with the reStructuredText syntax. Section headers will require underlines, subsections will be supported out of the box, and body text need not be indented (except for block quotes).

Strategy (b) is recommended, and its implementation is complete.

Support for RFC 2822 headers has been added to the reStructuredText parser for PEPs (unambiguous given a specific context: the first contiguous block of the document). It may be desired to concretely specify what over/underline styles are allowed for PEP section headers, for uniformity.

Rationale

The lack of a standard syntax for docstrings has hampered the development of standard tools for extracting and converting docstrings into documentation in standard formats (e.g., HTML, DocBook, TeX). There have been a number of proposed markup formats and variations, and many tools tied to these proposals, but without a standard docstring format they have failed to gain a strong following and/or floundered half-finished.

Throughout the existence of the Doc-SIG, consensus on a single standard docstring format has never been reached. A lightweight, implicit markup has been sought, for the following reasons (among others):

  1. Docstrings written within Python code are available from within the interactive interpreter, and can be "print"ed. Thus the use of plaintext for easy readability.
  2. Programmers want to add structure to their docstrings, without sacrificing raw docstring readability. Unadorned plaintext cannot be transformed ("up-translated") into useful structured formats.
  3. Explicit markup (like XML or TeX) is widely considered unreadable by the uninitiated.
  4. Implicit markup is aesthetically compatible with the clean and minimalist Python syntax.

Many alternative markups for docstrings have been proposed on the Doc-SIG over the years; a representative sample is listed below. Each is briefly analyzed in terms of the goals stated above. Please note that this is not intended to be an exclusive list of all existing markup systems; there are many other markups (Texinfo, Doxygen, TIM, YODL, AFT, ...) which are not mentioned.

  • XML [7], SGML [8], DocBook [9], HTML [10], XHTML [11]

    XML and SGML are explicit, well-formed meta-languages suitable for all kinds of documentation. XML is a variant of SGML. They are best used behind the scenes, because to untrained eyes they are verbose, difficult to type, and too cluttered to read comfortably as source. DocBook, HTML, and XHTML are all applications of SGML and/or XML, and all share the same basic syntax and the same shortcomings.

  • TeX [12]

    TeX is similar to XML/SGML in that it's explicit, but not very easy to write, and not easy for the uninitiated to read.

  • Perl POD [13]

    Most Perl modules are documented in a format called POD (Plain Old Documentation). This is an easy-to-type, very low level format with strong integration with the Perl parser. Many tools exist to turn POD documentation into other formats: info, HTML and man pages, among others. However, the POD syntax takes after Perl itself in terms of readability.

  • JavaDoc [14]

    Special comments before Java classes and functions serve to document the code. A program to extract these, and turn them into HTML documentation is called javadoc, and is part of the standard Java distribution. However, JavaDoc has a very intimate relationship with HTML, using HTML tags for most markup. Thus it shares the readability problems of HTML.

  • Setext [15], StructuredText [16]

    Early on, variants of Setext (Structure Enhanced Text), including Zope Corp's StructuredText, were proposed for Python docstring formatting. Hereafter these variants will collectively be called "STexts". STexts have the advantage of being easy to read without special knowledge, and relatively easy to write.

    Although used by some (including in most existing Python auto-documentation tools), until now STexts have failed to become standard because:

    • STexts have been incomplete. Lacking "essential" constructs that people want to use in their docstrings, STexts are rendered less than ideal. Note that these "essential" constructs are not universal; everyone has their own requirements.
    • STexts have been sometimes surprising. Bits of text are unexpectedly interpreted as being marked up, leading to user frustration.
    • SText implementations have been buggy.
    • Most STexts have have had no formal specification except for the implementation itself. A buggy implementation meant a buggy spec, and vice-versa.
    • There has been no mechanism to get around the SText markup rules when a markup character is used in a non-markup context. In other words, no way to escape markup.

Proponents of implicit STexts have vigorously opposed proposals for explicit markup (XML, HTML, TeX, POD, etc.), and the debates have continued off and on since 1996 or earlier.

reStructuredText is a complete revision and reinterpretation of the SText idea, addressing all of the problems listed above.

Specification

The specification and user documentaton for reStructuredText is quite extensive. Rather than repeating or summarizing it all here, links to the originals are provided.

Please first take a look at A ReStructuredText Primer [17], a short and gentle introduction. The Quick reStructuredText [18] user reference quickly summarizes all of the markup constructs. For complete and extensive details, please refer to the following documents:

In addition, Problems With StructuredText [22] explains many markup decisions made with regards to StructuredText, and A Record of reStructuredText Syntax Alternatives [23] records markup decisions made independently.

Docstring-Significant Features

  • A markup escaping mechanism.

    Backslashes (\) are used to escape markup characters when needed for non-markup purposes. However, the inline markup recognition rules have been constructed in order to minimize the need for backslash-escapes. For example, although asterisks are used for emphasis, in non-markup contexts such as "*" or "(*)" or "x * y", the asterisks are not interpreted as markup and are left unchanged. For many non-markup uses of backslashes (e.g., describing regular expressions), inline literals or literal blocks are applicable; see the next item.

  • Markup to include Python source code and Python interactive sessions: inline literals, literal blocks, and doctest blocks.

    Inline literals use double-backquotes to indicate program I/O or code snippets. No markup interpretation (including backslash-escape [\] interpretation) is done within inline literals.

    Literal blocks (block-level literal text, such as code excerpts or ASCII graphics) are indented, and indicated with a double-colon ("::") at the end of the preceding paragraph (right here -->):

    if literal_block:
        text = 'is left as-is'
        spaces_and_linebreaks = 'are preserved'
        markup_processing = None
    

    Doctest blocks begin with ">>> " and end with a blank line. Neither indentation nor literal block double-colons are required. For example:

    Here's a doctest block:
    
    >>> print 'Python-specific usage examples; begun with ">>>"'
    Python-specific usage examples; begun with ">>>"
    >>> print '(cut and pasted from interactive sessions)'
    (cut and pasted from interactive sessions)
    
  • Markup that isolates a Python identifier: interpreted text.

    Text enclosed in single backquotes is recognized as "interpreted text", whose interpretation is application-dependent. In the context of a Python docstring, the default interpretation of interpreted text is as Python identifiers. The text will be marked up with a hyperlink connected to the documentation for the identifier given. Lookup rules are the same as in Python itself: LGB namespace lookups (local, global, builtin). The "role" of the interpreted text (identifying a class, module, function, etc.) is determined implicitly from the namespace lookup. For example:

    class Keeper(Storer):
    
        """
        Keep data fresher longer.
    
        Extend `Storer`.  Class attribute `instances` keeps track
        of the number of `Keeper` objects instantiated.
        """
    
        instances = 0
        """How many `Keeper` objects are there?"""
    
        def __init__(self):
            """
            Extend `Storer.__init__()` to keep track of
            instances.  Keep count in `self.instances` and data
            in `self.data`.
            """
            Storer.__init__(self)
            self.instances += 1
    
            self.data = []
            """Store data in a list, most recent last."""
    
        def storedata(self, data):
            """
            Extend `Storer.storedata()`; append new `data` to a
            list (in `self.data`).
            """
            self.data = data
    

    Each piece of interpreted text is looked up according to the local namespace of the block containing its docstring.

  • Markup that isolates a Python identifier and specifies its type: interpreted text with roles.

    Although the Python source context reader is designed not to require explicit roles, they may be used. To classify identifiers explicitly, the role is given along with the identifier in either prefix or suffix form:

    Use :method:`Keeper.storedata` to store the object's data in
    `Keeper.data`:instance_attribute:.
    

    The syntax chosen for roles is verbose, but necessarily so (if anyone has a better alternative, please post it to the Doc-SIG [6]). The intention of the markup is that there should be little need to use explicit roles; their use is to be kept to an absolute minimum.

  • Markup for "tagged lists" or "label lists": field lists.

    Field lists represent a mapping from field name to field body. These are mostly used for extension syntax, such as "bibliographic field lists" (representing document metadata such as author, date, and version) and extension attributes for directives (see below). They may be used to implement methodologies (docstring semantics), such as identifying parameters, exceptions raised, etc.; such usage is beyond the scope of this PEP.

    A modified RFC 2822 syntax is used, with a colon before as well as after the field name. Field bodies are more versatile as well; they may contain multiple field bodies (even nested field lists). For example:

    :Date: 2002-03-22
    :Version: 1
    :Authors:
        - Me
        - Myself
        - I
    

    Standard RFC 2822 header syntax cannot be used for this construct because it is ambiguous. A word followed by a colon at the beginning of a line is common in written text.

  • Markup extensibility: directives and substitutions.

    Directives are used as an extension mechanism for reStructuredText, a way of adding support for new block-level constructs without adding new syntax. Directives for images, admonitions (note, caution, etc.), and tables of contents generation (among others) have been implemented. For example, here's how to place an image:

    .. image:: mylogo.png
    

    Substitution definitions allow the power and flexibility of block-level directives to be shared by inline text. For example:

    The |biohazard| symbol must be used on containers used to
    dispose of medical waste.
    
    .. |biohazard| image:: biohazard.png
    
  • Section structure markup.

    Section headers in reStructuredText use adornment via underlines (and possibly overlines) rather than indentation. For example:

    This is a Section Title
    =======================
    
    This is a Subsection Title
    --------------------------
    
    This paragraph is in the subsection.
    
    This is Another Section Title
    =============================
    
    This paragraph is in the second section.
    

Questions & Answers

  1. Is reStructuredText rich enough?

    Yes, it is for most people. If it lacks some construct that is required for a specific application, it can be added via the directive mechanism. If a useful and common construct has been overlooked and a suitably readable syntax can be found, it can be added to the specification and parser.

  2. Is reStructuredText too rich?

    For specific applications or individuals, perhaps. In general, no.

    Since the very beginning, whenever a docstring markup syntax has been proposed on the Doc-SIG [6], someone has complained about the lack of support for some construct or other. The reply was often something like, "These are docstrings we're talking about, and docstrings shouldn't have complex markup." The problem is that a construct that seems superfluous to one person may be absolutely essential to another.

    reStructuredText takes the opposite approach: it provides a rich set of implicit markup constructs (plus a generic extension mechanism for explicit markup), allowing for all kinds of documents. If the set of constructs is too rich for a particular application, the unused constructs can either be removed from the parser (via application-specific overrides) or simply omitted by convention.

  3. Why not use indentation for section structure, like StructuredText does? Isn't it more "Pythonic"?

    Guido van Rossum wrote the following in a 2001-06-13 Doc-SIG post:

    I still think that using indentation to indicate sectioning is wrong. If you look at how real books and other print publications are laid out, you'll notice that indentation is used frequently, but mostly at the intra-section level. Indentation can be used to offset lists, tables, quotations, examples, and the like. (The argument that docstrings are different because they are input for a text formatter is wrong: the whole point is that they are also readable without processing.)

    I reject the argument that using indentation is Pythonic: text is not code, and different traditions and conventions hold. People have been presenting text for readability for over 30 centuries. Let's not innovate needlessly.

    See Section Structure via Indentation [25] in Problems With StructuredText [22] for further elaboration.

  4. Why use reStructuredText for PEPs? What's wrong with the existing standard?

    The existing standard for PEPs is very limited in terms of general expressibility, and referencing is especially lacking for such a reference-rich document type. PEPs are currently converted into HTML, but the results (mostly monospaced text) are less than attractive, and most of the value-added potential of HTML (especially inline hyperlinks) is untapped.

    Making reStructuredText a standard markup for PEPs will enable much richer expression, including support for section structure, inline markup, graphics, and tables. In several PEPs there are ASCII graphics diagrams, which are all that plaintext documents can support. Since PEPs are made available in HTML form, the ability to include proper diagrams would be immediately useful.

    Current PEP practices allow for reference markers in the form "[1]" in the text, and the footnotes/references themselves are listed in a section toward the end of the document. There is currently no hyperlinking between the reference marker and the footnote/reference itself (it would be possible to add this to pep2html.py, but the "markup" as it stands is ambiguous and mistakes would be inevitable). A PEP with many references (such as this one ;-) requires a lot of flipping back and forth. When revising a PEP, often new references are added or unused references deleted. It is painful to renumber the references, since it has to be done in two places and can have a cascading effect (insert a single new reference 1, and every other reference has to be renumbered; always adding new references to the end is suboptimal). It is easy for references to go out of sync.

    PEPs use references for two purposes: simple URL references and footnotes. reStructuredText differentiates between the two. A PEP might contain references like this:

    Abstract
    
        This PEP proposes adding frungible doodads [1] to the core.
        It extends PEP 9876 [2] via the BCA [3] mechanism.
    
    ...
    
    References and Footnotes
    
        [1] http://www.example.org/
    
        [2] PEP 9876, Let's Hope We Never Get Here
            http://www.python.org/dev/peps/pep-9876/
    
        [3] "Bogus Complexity Addition"
    

    Reference 1 is a simple URL reference. Reference 2 is a footnote containing text and a URL. Reference 3 is a footnote containing text only. Rewritten using reStructuredText, this PEP could look like this:

    Abstract
    ========
    
    This PEP proposes adding `frungible doodads`_ to the core.  It
    extends PEP 9876 [#pep9876]_ via the BCA [#]_ mechanism.
    
    ...
    
    References & Footnotes
    ======================
    
    .. _frungible doodads: http://www.example.org/
    
    .. [#pep9876] PEP 9876, Let's Hope We Never Get Here
    
    .. [#] "Bogus Complexity Addition"
    

    URLs and footnotes can be defined close to their references if desired, making them easier to read in the source text, and making the PEPs easier to revise. The "References and Footnotes" section can be auto-generated with a document tree transform. Footnotes from throughout the PEP would be gathered and displayed under a standard header. If URL references should likewise be written out explicitly (in citation form), another tree transform could be used.

    URL references can be named ("frungible doodads"), and can be referenced from multiple places in the document without additional definitions. When converted to HTML, references will be replaced with inline hyperlinks (HTML <a> tags). The two footnotes are automatically numbered, so they will always stay in sync. The first footnote also contains an internal reference name, "pep9876", so it's easier to see the connection between reference and footnote in the source text. Named footnotes can be referenced multiple times, maintaining consistent numbering.

    The "#pep9876" footnote could also be written in the form of a citation:

    It extends PEP 9876 [PEP9876]_ ...
    
    .. [PEP9876] PEP 9876, Let's Hope We Never Get Here
    

    Footnotes are numbered, whereas citations use text for their references.

  5. Wouldn't it be better to keep the docstring and PEP proposals separate?

    The PEP markup proposal may be removed if it is deemed that there is no need for PEP markup, or it could be made into a separate PEP. If accepted, PEP 1, PEP Purpose and Guidelines [1], and PEP 9, Sample PEP Template [2] will be updated.

    It seems natural to adopt a single consistent markup standard for all uses of structured plaintext in Python, and to propose it all in one place.

  6. The existing pep2html.py script converts the existing PEP format to HTML. How will the new-format PEPs be converted to HTML?

    A new version of pep2html.py with integrated reStructuredText parsing has been completed. The Docutils project supports PEPs with a "PEP Reader" component, including all functionality currently in pep2html.py (auto-recognition of PEP & RFC references, email masking, etc.).

  7. Who's going to convert the existing PEPs to reStructuredText?

    PEP authors or volunteers may convert existing PEPs if they like, but there is no requirement to do so. The reStructuredText-based PEPs will coexist with the old PEP standard. The pep2html.py mentioned in answer 6 processes both old and new standards.

  8. Why use reStructuredText for README and other ancillary files?

    The reasoning given for PEPs in answer 4 above also applies to README and other ancillary files. By adopting a standard markup, these files can be converted to attractive cross-referenced HTML and put up on python.org. Developers of other projects can also take advantage of this facility for their own documentation.

  9. Won't the superficial similarity to existing markup conventions cause problems, and result in people writing invalid markup (and not noticing, because the plaintext looks natural)? How forgiving is reStructuredText of "not quite right" markup?

    There will be some mis-steps, as there would be when moving from one programming language to another. As with any language, proficiency grows with experience. Luckily, reStructuredText is a very little language indeed.

    As with any syntax, there is the possibility of syntax errors. It is expected that a user will run the processing system over their input and check the output for correctness.

    In a strict sense, the reStructuredText parser is very unforgiving (as it should be; "In the face of ambiguity, refuse the temptation to guess" [3] applies to parsing markup as well as computer languages). Here's design goal 3 from An Introduction to reStructuredText [19]:

    Unambiguous. The rules for markup must not be open for interpretation. For any given input, there should be one and only one possible output (including error output).

    While unforgiving, at the same time the parser does try to be helpful by producing useful diagnostic output ("system messages"). The parser reports problems, indicating their level of severity (from least to most: debug, info, warning, error, severe). The user or the client software can decide on reporting thresholds; they can ignore low-level problems or cause high-level problems to bring processing to an immediate halt. Problems are reported during the parse as well as included in the output, often with two-way links between the source of the problem and the system message explaining it.

  10. Will the docstrings in the Python standard library modules be converted to reStructuredText?

    No. Python's library reference documentation is maintained separately from the source. Docstrings in the Python standard library should not try to duplicate the library reference documentation. The current policy for docstrings in the Python standard library is that they should be no more than concise hints, simple and markup-free (although many do contain ad-hoc implicit markup).

  11. I want to write all my strings in Unicode. Will anything break?

    The parser fully supports Unicode. Docutils supports arbitrary input and output encodings.

  12. Why does the community need a new structured text design?

    The existing structured text designs are deficient, for the reasons given in "Rationale" above. reStructuredText aims to be a complete markup syntax, within the limitations of the "readable plaintext" medium.

  13. What is wrong with existing documentation methodologies?

    What existing methodologies? For Python docstrings, there is no official standard markup format, let alone a documentation methodology akin to JavaDoc. The question of methodology is at a much higher level than syntax (which this PEP addresses). It is potentially much more controversial and difficult to resolve, and is intentionally left out of this discussion.

References & Footnotes

[1]PEP 1, PEP Guidelines, Warsaw, Hylton (http://www.python.org/dev/peps/pep-0001/)
[2]PEP 9, Sample PEP Template, Warsaw (http://www.python.org/dev/peps/pep-0009/)
[3]From The Zen of Python (by Tim Peters) [26] (or just "import this" in Python)
[4]PEP 216, Docstring Format, Zadka (http://www.python.org/dev/peps/pep-0216/)
[5]http://docutils.sourceforge.net/rst.html
[6](1, 2, 3, 4) http://www.python.org/sigs/doc-sig/
[7]http://www.w3.org/XML/
[8]http://www.oasis-open.org/cover/general.html
[9]http://docbook.org/tdg/en/html/docbook.html
[10]http://www.w3.org/MarkUp/
[11]http://www.w3.org/MarkUp/#xhtml1
[12]http://www.tug.org/interest.html
[13]http://perldoc.perl.org/perlpod.html
[14]http://java.sun.com/j2se/javadoc/
[15]http://docutils.sourceforge.net/mirror/setext.html
[16]http://www.zope.org/DevHome/Members/jim/StructuredTextWiki/FrontPage
[17]http://docutils.sourceforge.net/docs/user/rst/quickstart.html
[18]http://docutils.sourceforge.net/docs/user/rst/quickref.html
[19](1, 2) http://docutils.sourceforge.net/docs/ref/rst/introduction.html
[20]http://docutils.sourceforge.net/docs/ref/rst/restructuredtext.html
[21]http://docutils.sourceforge.net/docs/ref/rst/directives.html
[22](1, 2) http://docutils.sourceforge.net/docs/dev/rst/problems.html
[23]http://docutils.sourceforge.net/docs/dev/rst/alternatives.html
[24]http://docutils.sourceforge.net/
[25]http://docutils.sourceforge.net/docs/dev/rst/problems.html#section-structure-via-indentation
[26]http://www.python.org/doc/Humor.html#zen

Acknowledgements

Some text is borrowed from PEP 216, Docstring Format [4], by Moshe Zadka.

Special thanks to all members past & present of the Python Doc-SIG [6].

pep-0288 Generators Attributes and Exceptions

PEP: 288
Title: Generators Attributes and Exceptions
Version: $Revision$
Last-Modified: $Date$
Author: Raymond Hettinger <python at rcn.com>
Status: Withdrawn
Type: Standards Track
Created: 21-Mar-2002
Python-Version: 2.5
Post-History: 

Abstract

    This PEP proposes to enhance generators by providing mechanisms for
    raising exceptions and sharing data with running generators.


Status

    This PEP is withdrawn.  The exception raising mechanism was extended
    and subsumed into PEP 343.  The attribute passing capability
    never built a following, did not have a clear implementation,
    and did not have a clean way for the running generator to access
    its own namespace.


Rationale

    Currently, only class based iterators can provide attributes and
    exception handling.  However, class based iterators are harder to
    write, less compact, less readable, and slower.  A better solution
    is to enable these capabilities for generators.

    Enabling attribute assignments allows data to be passed to and from
    running generators.  The approach of sharing data using attributes
    pervades Python.  Other approaches exist but are somewhat hackish
    in comparison.

    Another evolutionary step is to add a generator method to allow
    exceptions to be passed to a generator.  Currently, there is no
    clean method for triggering exceptions from outside the generator.
    Also, generator exception passing helps mitigate the try/finally
    prohibition for generators.  The need is especially acute for
    generators needing to flush buffers or close resources upon termination.
    
    The two proposals are backwards compatible and require no new
    keywords.  They are being recommended for Python version 2.5.



Specification for Generator Attributes

    Essentially, the proposal is to emulate attribute writing for classes.
    The only wrinkle is that generators lack a way to refer to instances of
    themselves.  So, the proposal is to provide a function for discovering
    the reference.  For example:

        def mygen(filename):
            self = sys.get_generator()
            myfile = open(filename)
            for line in myfile:
                if len(line) < 10:
                    continue
                self.pos = myfile.tell()
                yield line.upper()

        g = mygen('sample.txt')
        line1 = g.next()
        print 'Position', g.pos

    Uses for generator attributes include:

        1. Providing generator clients with extra information (as shown
           above).
        2. Externally setting control flags governing generator operation
           (possibly telling a generator when to step in or step over
           data groups).
        3. Writing lazy consumers with complex execution states
           (an arithmetic encoder output stream for example).
        4. Writing co-routines (as demonstrated in Dr. Mertz's articles [1]).

    The control flow of 'yield' and 'next' is unchanged by this
    proposal.  The only change is that data can passed to and from the
    generator.  Most of the underlying machinery is already in place,
    only the access function needs to be added.



Specification for Generator Exception Passing:

    Add a .throw(exception) method to the generator interface:

        def logger():
            start = time.time()
            log = []
            try:
                while True:
                    log.append(time.time() - start)
                    yield log[-1]
            except WriteLog:
                writelog(log)

        g = logger()
        for i in [10,20,40,80,160]:
            testsuite(i)
            g.next()
        g.throw(WriteLog)

    There is no existing work-around for triggering an exception
    inside a generator.  It is the only case in Python where active
    code cannot be excepted to or through.

    Generator exception passing also helps address an intrinsic
    limitation on generators, the prohibition against their using
    try/finally to trigger clean-up code [2].

    Note A: The name of the throw method was selected for several
    reasons.  Raise is a keyword and so cannot be used as a method
    name.  Unlike raise which immediately raises an exception from the
    current execution point, throw will first return to the generator
    and then raise the exception.  The word throw is suggestive of
    putting the exception in another location.  The word throw is
    already associated with exceptions in other languages.

    Alternative method names were considered: resolve(), signal(),
    genraise(), raiseinto(), and flush().  None of these fit as well
    as throw().

    Note B:  To keep the throw() syntax simple only the instance
    version of the raise syntax would be supported (no variants for
    "raise string" or "raise class, instance").

    Calling "g.throw(instance)" would correspond to writing
    "raise instance" immediately after the most recent yield.



References

    [1] Dr. David Mertz's draft columns for Charming Python:
        http://gnosis.cx/publish/programming/charming_python_b5.txt
        http://gnosis.cx/publish/programming/charming_python_b7.txt

    [2] PEP 255 Simple Generators:
        http://www.python.org/dev/peps/pep-0255/

    [3] Proof-of-concept recipe:
        http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/164044



Copyright

    This document has been placed in the public domain.



pep-0289 Generator Expressions

PEP:289
Title:Generator Expressions
Version:$Revision$
Last-Modified:$Date$
Author:python at rcn.com (Raymond Hettinger)
Status:Final
Type:Standards Track
Content-Type:text/x-rst
Created:30-Jan-2002
Python-Version:2.4
Post-History:22-Oct-2003

Abstract

This PEP introduces generator expressions as a high performance, memory efficient generalization of list comprehensions [1] and generators [2].

Rationale

Experience with list comprehensions has shown their wide-spread utility throughout Python. However, many of the use cases do not need to have a full list created in memory. Instead, they only need to iterate over the elements one at a time.

For instance, the following summation code will build a full list of squares in memory, iterate over those values, and, when the reference is no longer needed, delete the list:

sum([x*x for x in range(10)])

Memory is conserved by using a generator expression instead:

sum(x*x for x in range(10))

Similar benefits are conferred on constructors for container objects:

s = Set(word  for line in page  for word in line.split())
d = dict( (k, func(k)) for k in keylist)

Generator expressions are especially useful with functions like sum(), min(), and max() that reduce an iterable input to a single value:

max(len(line)  for line in file  if line.strip())

Generator expressions also address some examples of functionals coded with lambda:

reduce(lambda s, a: s + a.myattr, data, 0)
reduce(lambda s, a: s + a[3], data, 0)

These simplify to:

sum(a.myattr for a in data)
sum(a[3] for a in data)

List comprehensions greatly reduced the need for filter() and map(). Likewise, generator expressions are expected to minimize the need for itertools.ifilter() and itertools.imap(). In contrast, the utility of other itertools will be enhanced by generator expressions:

dotproduct = sum(x*y for x,y in itertools.izip(x_vector, y_vector))

Having a syntax similar to list comprehensions also makes it easy to convert existing code into an generator expression when scaling up application.

Early timings showed that generators had a significant performance advantage over list comprehensions. However, the latter were highly optimized for Py2.4 and now the performance is roughly comparable for small to mid-sized data sets. As the data volumes grow larger, generator expressions tend to perform better because they do not exhaust cache memory and they allow Python to re-use objects between iterations.

BDFL Pronouncements

This PEP is ACCEPTED for Py2.4.

The Details

(None of this is exact enough in the eye of a reader from Mars, but I hope the examples convey the intention well enough for a discussion in c.l.py. The Python Reference Manual should contain a 100% exact semantic and syntactic specification.)

  1. The semantics of a generator expression are equivalent to creating an anonymous generator function and calling it. For example:

    g = (x**2 for x in range(10))
    print g.next()
    

    is equivalent to:

    def __gen(exp):
        for x in exp:
            yield x**2
    g = __gen(iter(range(10)))
    print g.next()
    

    Only the outermost for-expression is evaluated immediately, the other expressions are deferred until the generator is run:

    g = (tgtexp  for var1 in exp1 if exp2 for var2 in exp3 if exp4)
    

    is equivalent to:

    def __gen(bound_exp):
        for var1 in bound_exp:
            if exp2:
                for var2 in exp3:
                    if exp4:
                        yield tgtexp
    g = __gen(iter(exp1))
    del __gen
    
  2. The syntax requires that a generator expression always needs to be directly inside a set of parentheses and cannot have a comma on either side. With reference to the file Grammar/Grammar in CVS, two rules change:

    1. The rule:

      atom: '(' [testlist] ')'
      

      changes to:

      atom: '(' [testlist_gexp] ')'
      

      where testlist_gexp is almost the same as listmaker, but only allows a single test after 'for' ... 'in':

      testlist_gexp: test ( gen_for | (',' test)* [','] )
      
    2. The rule for arglist needs similar changes.

    This means that you can write:

    sum(x**2 for x in range(10))
    

    but you would have to write:

    reduce(operator.add, (x**2 for x in range(10)))
    

    and also:

    g = (x**2 for x in range(10))
    

    i.e. if a function call has a single positional argument, it can be a generator expression without extra parentheses, but in all other cases you have to parenthesize it.

    The exact details were checked in to Grammar/Grammar version 1.49.

  3. The loop variable (if it is a simple variable or a tuple of simple variables) is not exposed to the surrounding function. This facilitates the implementation and makes typical use cases more reliable. In some future version of Python, list comprehensions will also hide the induction variable from the surrounding code (and, in Py2.4, warnings will be issued for code accessing the induction variable).

    For example:

    x = "hello"
    y = list(x for x in "abc")
    print x    # prints "hello", not "c"
    
  4. List comprehensions will remain unchanged. For example:

    [x for x in S]    # This is a list comprehension.
    [(x for x in S)]  # This is a list containing one generator
                      # expression.
    

    Unfortunately, there is currently a slight syntactic difference. The expression:

    [x for x in 1, 2, 3]
    

    is legal, meaning:

    [x for x in (1, 2, 3)]
    

    But generator expressions will not allow the former version:

    (x for x in 1, 2, 3)
    

    is illegal.

    The former list comprehension syntax will become illegal in Python 3.0, and should be deprecated in Python 2.4 and beyond.

    List comprehensions also "leak" their loop variable into the surrounding scope. This will also change in Python 3.0, so that the semantic definition of a list comprehension in Python 3.0 will be equivalent to list(<generator expression>). Python 2.4 and beyond should issue a deprecation warning if a list comprehension's loop variable has the same name as a variable used in the immediately surrounding scope.

Early Binding versus Late Binding

After much discussion, it was decided that the first (outermost) for-expression should be evaluated immediately and that the remaining expressions be evaluated when the generator is executed.

Asked to summarize the reasoning for binding the first expression, Guido offered [5]:

Consider sum(x for x in foo()). Now suppose there's a bug in foo()
that raises an exception, and a bug in sum() that raises an
exception before it starts iterating over its argument. Which
exception would you expect to see? I'd be surprised if the one in
sum() was raised rather the one in foo(), since the call to foo()
is part of the argument to sum(), and I expect arguments to be
processed before the function is called.

OTOH, in sum(bar(x) for x in foo()), where sum() and foo()
are bugfree, but bar() raises an exception, we have no choice but
to delay the call to bar() until sum() starts iterating -- that's
part of the contract of generators. (They do nothing until their
next() method is first called.)

Various use cases were proposed for binding all free variables when the generator is defined. And some proponents felt that the resulting expressions would be easier to understand and debug if bound immediately.

However, Python takes a late binding approach to lambda expressions and has no precedent for automatic, early binding. It was felt that introducing a new paradigm would unnecessarily introduce complexity.

After exploring many possibilities, a consensus emerged that binding issues were hard to understand and that users should be strongly encouraged to use generator expressions inside functions that consume their arguments immediately. For more complex applications, full generator definitions are always superior in terms of being obvious about scope, lifetime, and binding [6].

Reduction Functions

The utility of generator expressions is greatly enhanced when combined with reduction functions like sum(), min(), and max(). The heapq module in Python 2.4 includes two new reduction functions: nlargest() and nsmallest(). Both work well with generator expressions and keep no more than n items in memory at one time.

Acknowledgements

  • Raymond Hettinger first proposed the idea of "generator comprehensions" in January 2002.
  • Peter Norvig resurrected the discussion in his proposal for Accumulation Displays.
  • Alex Martelli provided critical measurements that proved the performance benefits of generator expressions. He also provided strong arguments that they were a desirable thing to have.
  • Phillip Eby suggested "iterator expressions" as the name.
  • Subsequently, Tim Peters suggested the name "generator expressions".
  • Armin Rigo, Tim Peters, Guido van Rossum, Samuele Pedroni, Hye-Shik Chang and Raymond Hettinger teased out the issues surrounding early versus late binding [5].
  • Jiwon Seo single handedly implemented various versions of the proposal including the final version loaded into CVS. Along the way, there were periodic code reviews by Hye-Shik Chang and Raymond Hettinger. Guido van Rossum made the key design decisions after comments from Armin Rigo and newsgroup discussions. Raymond Hettinger provided the test suite, documentation, tutorial, and examples [6].

References

[1]PEP 202 List Comprehensions http://www.python.org/dev/peps/pep-0202/
[2]PEP 255 Simple Generators http://www.python.org/dev/peps/pep-0255/
[3]Peter Norvig's Accumulation Display Proposal http://www.norvig.com/pyacc.html
[4]Jeff Epler had worked up a patch demonstrating the previously proposed bracket and yield syntax http://python.org/sf/795947
[5](1, 2) Discussion over the relative merits of early versus late binding http://mail.python.org/pipermail/python-dev/2004-April/044555.html
[6](1, 2) Patch discussion and alternative patches on Source Forge http://www.python.org/sf/872326

pep-0290 Code Migration and Modernization

PEP:290
Title:Code Migration and Modernization
Version:$Revision$
Last-Modified:$Date$
Author:Raymond Hettinger <python at rcn.com>
Status:Active
Type:Informational
Content-Type:text/x-rst
Created:6-Jun-2002
Post-History:

Abstract

This PEP is a collection of procedures and ideas for updating Python applications when newer versions of Python are installed.

The migration tips highlight possible areas of incompatibility and make suggestions on how to find and resolve those differences. The modernization procedures show how older code can be updated to take advantage of new language features.

Rationale

This repository of procedures serves as a catalog or checklist of known migration issues and procedures for addressing those issues.

Migration issues can arise for several reasons. Some obsolete features are slowly deprecated according to the guidelines in PEP 4 [1]. Also, some code relies on undocumented behaviors which are subject to change between versions. Some code may rely on behavior which was subsequently shown to be a bug and that behavior changes when the bug is fixed.

Modernization options arise when new versions of Python add features that allow improved clarity or higher performance than previously available.

Guidelines for New Entries

Developers with commit access may update this PEP directly. Others can send their ideas to a developer for possible inclusion.

While a consistent format makes the repository easier to use, feel free to add or subtract sections to improve clarity.

Grep patterns may be supplied as tool to help maintainers locate code for possible updates. However, fully automated search/replace style regular expressions are not recommended. Instead, each code fragment should be evaluated individually.

The contra-indications section is the most important part of a new entry. It lists known situations where the update SHOULD NOT be applied.

Migration Issues

Comparison Operators Not a Shortcut for Producing 0 or 1

Prior to Python 2.3, comparison operations returned 0 or 1 rather than True or False. Some code may have used this as a shortcut for producing zero or one in places where their boolean counterparts are not appropriate. For example:

def identity(m=1):
    """Create and m-by-m identity matrix"""
    return [[i==j for i in range(m)] for j in range(m)]

In Python 2.2, a call to identity(2) would produce:

[[1, 0], [0, 1]]

In Python 2.3, the same call would produce:

[[True, False], [False, True]]

Since booleans are a subclass of integers, the matrix would continue to calculate normally, but it will not print as expected. The list comprehension should be changed to read:

return [[int(i==j) for i in range(m)] for j in range(m)]

There are similiar concerns when storing data to be used by other applications which may expect a number instead of True or False.

Modernization Procedures

Procedures are grouped by the Python version required to be able to take advantage of the modernization.

Python 2.4 or Later

Inserting and Popping at the Beginning of Lists

Python's lists are implemented to perform best with appends and pops on the right. Use of pop(0) or insert(0, x) triggers O(n) data movement for the entire list. To help address this need, Python 2.4 introduces a new container, collections.deque() which has efficient append and pop operations on the both the left and right (the trade-off is much slower getitem/setitem access). The new container is especially helpful for implementing data queues:

Pattern:

c = list(data)   -->   c = collections.deque(data)
c.pop(0)         -->   c.popleft()
c.insert(0, x)   -->   c.appendleft()

Locating:

grep pop(0 or
grep insert(0

Simplifying Custom Sorts

In Python 2.4, the sort method for lists and the new sorted built-in function both accept a key function for computing sort keys. Unlike the cmp function which gets applied to every comparison, the key function gets applied only once to each record. It is much faster than cmp and typically more readable while using less code. The key function also maintains the stability of the sort (records with the same key are left in their original order.

Original code using a comparison function:

names.sort(lambda x,y: cmp(x.lower(), y.lower()))

Alternative original code with explicit decoration:

tempnames = [(n.lower(), n) for n in names]
tempnames.sort()
names = [original for decorated, original in tempnames]

Revised code using a key function:

names.sort(key=str.lower)       # case-insensitive sort

Locating: grep sort *.py

Replacing Common Uses of Lambda

In Python 2.4, the operator module gained two new functions, itemgetter() and attrgetter() that can replace common uses of the lambda keyword. The new functions run faster and are considered by some to improve readability.

Pattern:

lambda r: r[2]      -->  itemgetter(2)
lambda r: r.myattr  -->  attrgetter('myattr')

Typical contexts:

sort(studentrecords, key=attrgetter('gpa'))   # set a sort field
map(attrgetter('lastname'), studentrecords)   # extract a field

Locating: grep lambda *.py

Simplified Reverse Iteration

Python 2.4 introduced the reversed builtin function for reverse iteration. The existing approaches to reverse iteration suffered from wordiness, performance issues (speed and memory consumption), and/or lack of clarity. A preferred style is to express the sequence in a forwards direction, apply reversed to the result, and then loop over the resulting fast, memory friendly iterator.

Original code expressed with half-open intervals:

for i in range(n-1, -1, -1):
    print seqn[i]

Alternative original code reversed in multiple steps:

rseqn = list(seqn)
rseqn.reverse()
for value in rseqn:
    print value

Alternative original code expressed with extending slicing:

for value in seqn[::-1]:
    print value

Revised code using the reversed function:

for value in reversed(seqn):
    print value

Python 2.3 or Later

Testing String Membership

In Python 2.3, for string2 in string1, the length restriction on string2 is lifted; it can now be a string of any length. When searching for a substring, where you don't care about the position of the substring in the original string, using the in operator makes the meaning clear.

Pattern:

string1.find(string2) >= 0   -->  string2 in string1
string1.find(string2) != -1  -->  string2 in string1

Replace apply() with a Direct Function Call

In Python 2.3, apply() was marked for Pending Deprecation because it was made obsolete by Python 1.6's introduction of * and ** in function calls. Using a direct function call was always a little faster than apply() because it saved the lookup for the builtin. Now, apply() is even slower due to its use of the warnings module.

Pattern:

apply(f, args, kwds)  -->  f(*args, **kwds)

Note: The Pending Deprecation was removed from apply() in Python 2.3.3 since it creates pain for people who need to maintain code that works with Python versions as far back as 1.5.2, where there was no alternative to apply(). The function remains deprecated, however.

Python 2.2 or Later

Testing Dictionary Membership

For testing dictionary membership, use the 'in' keyword instead of the 'has_key()' method. The result is shorter and more readable. The style becomes consistent with tests for membership in lists. The result is slightly faster because has_key requires an attribute search and uses a relatively expensive function call.

Pattern:

if d.has_key(k):  -->  if k in d:

Contra-indications:

  1. Some dictionary-like objects may not define a __contains__() method:

    if dictlike.has_key(k)
    

Locating: grep has_key

Looping Over Dictionaries

Use the new iter methods for looping over dictionaries. The iter methods are faster because they do not have to create a new list object with a complete copy of all of the keys, values, or items. Selecting only keys, values, or items (key/value pairs) as needed saves the time for creating throwaway object references and, in the case of items, saves a second hash look-up of the key.

Pattern:

for key in d.keys():      -->  for key in d:
for value in d.values():  -->  for value in d.itervalues():
for key, value in d.items():
                          -->  for key, value in d.iteritems():

Contra-indications:

  1. If you need a list, do not change the return type:

    def getids():  return d.keys()
    
  2. Some dictionary-like objects may not define iter methods:

    for k in dictlike.keys():
    
  3. Iterators do not support slicing, sorting or other operations:

    k = d.keys(); j = k[:]
    
  4. Dictionary iterators prohibit modifying the dictionary:

    for k in d.keys(): del[k]
    

stat Methods

Replace stat constants or indices with new os.stat attributes and methods. The os.stat attributes and methods are not order-dependent and do not require an import of the stat module.

Pattern:

os.stat("foo")[stat.ST_MTIME]  -->  os.stat("foo").st_mtime
os.stat("foo")[stat.ST_MTIME]  -->  os.path.getmtime("foo")

Locating: grep os.stat or grep stat.S

Reduce Dependency on types Module

The types module is likely to be deprecated in the future. Use built-in constructor functions instead. They may be slightly faster.

Pattern:

isinstance(v, types.IntType)      -->  isinstance(v, int)
isinstance(s, types.StringTypes)  -->  isinstance(s, basestring)

Full use of this technique requires Python 2.3 or later (basestring was introduced in Python 2.3), but Python 2.2 is sufficient for most uses.

Locating: grep types *.py | grep import

Avoid Variable Names that Clash with the __builtins__ Module

In Python 2.2, new built-in types were added for dict and file. Scripts should avoid assigning variable names that mask those types. The same advice also applies to existing builtins like list.

Pattern:

file = open('myfile.txt') --> f = open('myfile.txt')
dict = obj.__dict__ --> d = obj.__dict__

Locating: grep 'file ' *.py

Python 2.1 or Later

whrandom Module Deprecated

All random-related methods have been collected in one place, the random module.

Pattern:

import whrandom --> import random

Locating: grep whrandom

Python 2.0 or Later

String Methods

The string module is likely to be deprecated in the future. Use string methods instead. They're faster too.

Pattern:

import string ; string.method(s, ...)  -->  s.method(...)
c in string.whitespace                 -->  c.isspace()

Locating: grep string *.py | grep import

startswith and endswith String Methods

Use these string methods instead of slicing. No slice has to be created and there's no risk of miscounting.

Pattern:

"foobar"[:3] == "foo"   -->  "foobar".startswith("foo")
"foobar"[-3:] == "bar"  -->  "foobar".endswith("bar")

The atexit Module

The atexit module supports multiple functions to be executed upon program termination. Also, it supports parameterized functions. Unfortunately, its implementation conflicts with the sys.exitfunc attribute which only supports a single exit function. Code relying on sys.exitfunc may interfere with other modules (including library modules) that elect to use the newer and more versatile atexit module.

Pattern:

sys.exitfunc = myfunc  -->  atexit.register(myfunc)

Python 1.5 or Later

Class-Based Exceptions

String exceptions are deprecated, so derive from the Exception base class. Unlike the obsolete string exceptions, class exceptions all derive from another exception or the Exception base class. This allows meaningful groupings of exceptions. It also allows an "except Exception" clause to catch all exceptions.

Pattern:

NewError = 'NewError'  -->  class NewError(Exception): pass

Locating: Use PyChecker [2].

All Python Versions

Testing for None

Since there is only one None object, equality can be tested with identity. Identity tests are slightly faster than equality tests. Also, some object types may overload comparison, so equality testing may be much slower.

Pattern:

if v == None  -->  if v is None:
if v != None  -->  if v is not None:

Locating: grep '== None' or grep '!= None'

pep-0291 Backward Compatibility for the Python 2 Standard Library

PEP: 291
Title: Backward Compatibility for the Python 2 Standard Library
Version: $Revision$
Last-Modified: $Date$
Author: Neal Norwitz <nnorwitz at gmail.com>
Status: Final
Type: Informational
Created: 06-Jun-2002
Python-Version: 2.3
Post-History: 

Abstract

    This PEP describes the packages and modules in the Python 2
    standard library which should remain backward compatible with
    previous versions of Python.  If a package is not listed here,
    then it need only remain compatible with the version of Python it
    is distributed with.

    This PEP has no bearing on the Python 3 standard library.


Rationale

    Authors have various reasons why packages and modules should
    continue to work with previous versions of Python.  In order to
    maintain backward compatibility for these modules while moving the
    rest of the standard library forward, it is necessary to know
    which modules can be modified and which should use old and
    possibly deprecated features.

    Generally, authors should attempt to keep changes backward
    compatible with the previous released version of Python in order
    to make bug fixes easier to backport.

    In addition to a package or module being listed in this PEP, 
    authors must add a comment at the top of each file documenting
    the compatibility requirement.

    When a major version of Python is released, a Subversion branch is
    created for continued maintenance and bug fix releases.  A package
    version on a branch may have a different compatibility requirement
    than the same package on the trunk (i.e. current bleeding-edge
    development).  Where appropriate, these branch compatibilities are
    listed below.


Features to Avoid

    The following list contains common features to avoid in order
    to maintain backward compatibility with each version of Python.
    This list is not complete!  It is only meant as a general guide.

    Note that the features below were implemented in the version
    following the one listed.  For example, features listed next to
    1.5.2 were implemented in 2.0.

        Version    Features to Avoid
        -------    -----------------
          1.5.2    string methods, Unicode, list comprehensions, 
                   augmented assignment (eg, +=), zip(), import x as y,
                   dict.setdefault(), print >> f,
                   calling f(*args, **kw), plus all features below

          2.0      nested scopes, rich comparisons,
                   function attributes, plus all features below

          2.1      use of object or new-style classes, iterators, 
                   using generators, nested scopes, or //
                   without from __future__ import ... statement,
                   isinstance(X, TYP) where TYP is a tuple of types,
                   plus all features below

          2.2      bool, True, False, basestring, enumerate(),
                   {}.pop(), PendingDeprecationWarning,
                   Universal Newlines, plus all features below
                   plus all features below

          2.3      generator expressions, multi-line imports,
                   decorators, int/long unification, set/frozenset,
                   reversed(), sorted(), "".rsplit(),
                   plus all features below

          2.4      with statement, conditional expressions,
                   combined try/except/finally, relative imports,
                   yield expressions or generator.throw/send/close(),
                   plus all features below

          2.5      with statement without from __future__ import,
                   io module, str.format(), except as,
                   bytes, b'' literals, property.setter/deleter


Backward Compatible Packages, Modules, and Tools

    Package/Module     Maintainer(s)          Python Version     Notes
    --------------     -------------          --------------     -----
    2to3               Benjamin Peterson           2.5
    bsddb              Greg Smith                  2.1
                       Barry Warsaw
    compiler           Jeremy Hylton               2.1
    ctypes             Thomas Heller               2.3
    decimal            Raymond Hettinger           2.3           [2]
    distutils          Tarek Ziade                 2.3
    email              Barry Warsaw                2.1 / 2.3     [1]
    modulefinder       Thomas Heller               2.2
                       Just van Rossum
    pkgutil            Phillip Eby                 2.3
    platform           Marc-Andre Lemburg          1.5.2
    pybench            Marc-Andre Lemburg          1.5.2         [3]
    sre                Fredrik Lundh               2.1
    subprocess         Peter Astrand               2.2
    wsgiref            Phillip J. Eby              2.1
    xml (PyXML)        Martin v. Loewis            2.0
    xmlrpclib          Fredrik Lundh               2.1

    Tool                         Maintainer(s)   Python Version
    ----                         -------------   --------------
    None


    Notes
    -----

    [1] The email package version 2 was distributed with Python up to
        Python 2.3, and this must remain Python 2.1 compatible.  email
        package version 3 will be distributed with Python 2.4 and will
        need to remain compatible only with Python 2.3.

    [2] Specification updates will be treated as bugfixes and backported.
        Python 2.3 compatibility will be kept for at least Python 2.4.
        The decision will be revisited for Python 2.5 and not changed
        unless compelling advantages arise.

    [3] pybench lives under the Tools/ directory. Compatibility with
        older Python versions is needed in order to be able to compare
        performance between Python versions. New features may still
        be used in new tests, which may then be configured to fail 
        gracefully on import by the tool in older Python versions.


Copyright

    This document has been placed in the public domain.



pep-0292 Simpler String Substitutions

PEP: 292
Title: Simpler String Substitutions
Version: $Revision$
Last-Modified: $Date$
Author: Barry Warsaw <barry at python.org>
Status: Final
Type: Standards Track
Created: 18-Jun-2002
Python-Version: 2.4
Post-History: 18-Jun-2002, 23-Mar-2004, 22-Aug-2004

Abstract

    This PEP describes a simpler string substitution feature, also
    known as string interpolation.  This PEP is "simpler" in two
    respects:

    1. Python's current string substitution feature
       (i.e. %-substitution) is complicated and error prone.  This PEP
       is simpler at the cost of some expressiveness.

    2. PEP 215 proposed an alternative string interpolation feature,
       introducing a new `$' string prefix.  PEP 292 is simpler than
       this because it involves no syntax changes and has much simpler
       rules for what substitutions can occur in the string.


Rationale

    Python currently supports a string substitution syntax based on
    C's printf() '%' formatting character[1].  While quite rich,
    %-formatting codes are also error prone, even for
    experienced Python programmers.  A common mistake is to leave off
    the trailing format character, e.g. the `s' in "%(name)s".

    In addition, the rules for what can follow a % sign are fairly
    complex, while the usual application rarely needs such complexity.
    Most scripts need to do some string interpolation, but most of
    those use simple `stringification' formats, i.e. %s or %(name)s
    This form should be made simpler and less error prone.


A Simpler Proposal

    We propose the addition of a new class, called 'Template', which
    will live in the string module.  The Template class supports new
    rules for string substitution; its value contains placeholders,
    introduced with the $ character.  The following rules for
    $-placeholders apply:

    1. $$ is an escape; it is replaced with a single $

    2. $identifier names a substitution placeholder matching a mapping
       key of "identifier".  By default, "identifier" must spell a
       Python identifier as defined in [2].  The first non-identifier
       character after the $ character terminates this placeholder
       specification.

    3. ${identifier} is equivalent to $identifier.  It is required
       when valid identifier characters follow the placeholder but are
       not part of the placeholder, e.g. "${noun}ification".

    If the $ character appears at the end of the line, or is followed
    by any other character than those described above, a ValueError
    will be raised at interpolation time.  Values in mapping are
    converted automatically to strings.

    No other characters have special meaning, however it is possible
    to derive from the Template class to define different substitution
    rules.  For example, a derived class could allow for periods in
    the placeholder (e.g. to support a kind of dynamic namespace and
    attribute path lookup), or could define a delimiter character
    other than '$'.

    Once the Template has been created, substitutions can be performed
    by calling one of two methods:

    - substitute().  This method returns a new string which results
      when the values of a mapping are substituted for the
      placeholders in the Template.  If there are placeholders which
      are not present in the mapping, a KeyError will be raised.

    - safe_substitute().  This is similar to the substitute() method,
      except that KeyErrors are never raised (due to placeholders
      missing from the mapping).  When a placeholder is missing, the
      original placeholder will appear in the resulting string.

   Here are some examples:

        >>> from string import Template
        >>> s = Template('${name} was born in ${country}')
        >>> print s.substitute(name='Guido', country='the Netherlands')
        Guido was born in the Netherlands
        >>> print s.substitute(name='Guido')
        Traceback (most recent call last):
        [...]
        KeyError: 'country'
        >>> print s.safe_substitute(name='Guido')
        Guido was born in ${country}

    The signature of substitute() and safe_substitute() allows for
    passing the mapping of placeholders to values, either as a single
    dictionary-like object in the first positional argument, or as
    keyword arguments as shown above.  The exact details and
    signatures of these two methods is reserved for the standard
    library documentation.


Why `$' and Braces?

    The BDFL said it best[4]: "The $ means "substitution" in so many
    languages besides Perl that I wonder where you've been. [...]
    We're copying this from the shell."

    Thus the substitution rules are chosen because of the similarity
    with so many other languages.  This makes the substitution rules
    easier to teach, learn, and remember.


Comparison to PEP 215

    PEP 215 describes an alternate proposal for string interpolation.
    Unlike that PEP, this one does not propose any new syntax for
    Python.  All the proposed new features are embodied in a new
    library module.  PEP 215 proposes a new string prefix
    representation such as $"" which signal to Python that a new type
    of string is present.  $-strings would have to interact with the
    existing r-prefixes and u-prefixes, essentially doubling the
    number of string prefix combinations.

    PEP 215 also allows for arbitrary Python expressions inside the
    $-strings, so that you could do things like:

        import sys
        print $"sys = $sys, sys = $sys.modules['sys']"

    which would return

        sys = <module 'sys' (built-in)>, sys = <module 'sys' (built-in)>

    It's generally accepted that the rules in PEP 215 are safe in the
    sense that they introduce no new security issues (see PEP 215,
    "Security Issues" for details).  However, the rules are still
    quite complex, and make it more difficult to see the substitution
    placeholder in the original $-string.

    The interesting thing is that the Template class defined in this
    PEP is designed for inheritance and, with a little extra work,
    it's possible to support PEP 215's functionality using existing
    Python syntax.

    For example, one could define subclasses of Template and dict that
    allowed for a more complex placeholder syntax and a mapping that
    evaluated those placeholders.


Internationalization

    The implementation supports internationalization by recording the
    original template string in the Template instance's 'template'
    attribute.  This attribute would serve as the lookup key in an
    gettext-based catalog.  It is up to the application to turn the
    resulting string back into a Template for substitution.

    However, the Template class was designed to work more intuitively
    in an internationalized application, by supporting the mixing-in
    of Template and unicode subclasses.  Thus an internationalized
    application could create an application-specific subclass,
    multiply inheriting from Template and unicode, and using instances
    of that subclass as the gettext catalog key.  Further, the
    subclass could alias the special __mod__() method to either
    .substitute() or .safe_substitute() to provide a more traditional
    string/unicode like %-operator substitution syntax.


Reference Implementation

    The implementation has been committed to the Python 2.4 source tree.


References

    [1] String Formatting Operations
        http://docs.python.org/library/stdtypes.html#string-formatting-operations

    [2] Identifiers and Keywords
        http://docs.python.org/reference/lexical_analysis.html#identifiers-and-keywords

    [3] Guido's python-dev posting from 21-Jul-2002
        http://mail.python.org/pipermail/python-dev/2002-July/026397.html

    [4] http://mail.python.org/pipermail/python-dev/2002-June/025652.html

    [5] Reference Implementation
        http://sourceforge.net/tracker/index.php?func=detail&aid=1014055&group_id=5470&atid=305470

Copyright

    This document has been placed in the public domain.



pep-0293 Codec Error Handling Callbacks

PEP: 293
Title: Codec Error Handling Callbacks
Version: $Revision$
Last-Modified: $Date$
Author: Walter DĂśrwald <walter at livinglogic.de>
Status: Final
Type: Standards Track
Created: 18-Jun-2002
Python-Version: 2.3
Post-History: 19-Jun-2002

Abstract

    This PEP aims at extending Python's fixed codec error handling
    schemes with a more flexible callback based approach.

    Python currently uses a fixed error handling for codec error
    handlers.  This PEP describes a mechanism which allows Python to
    use function callbacks as error handlers.  With these more
    flexible error handlers it is possible to add new functionality to
    existing codecs by e.g. providing fallback solutions or different
    encodings for cases where the standard codec mapping does not
    apply.


Specification

    Currently the set of codec error handling algorithms is fixed to
    either "strict", "replace" or "ignore" and the semantics of these
    algorithms is implemented separately for each codec.

    The proposed patch will make the set of error handling algorithms
    extensible through a codec error handler registry which maps
    handler names to handler functions.  This registry consists of the
    following two C functions:

        int PyCodec_RegisterError(const char *name, PyObject *error)

        PyObject *PyCodec_LookupError(const char *name)

    and their Python counterparts

        codecs.register_error(name, error)

        codecs.lookup_error(name)

    PyCodec_LookupError raises a LookupError if no callback function
    has been registered under this name.

    Similar to the encoding name registry there is no way of
    unregistering callback functions or iterating through the
    available functions.

    The callback functions will be used in the following way by the
    codecs: when the codec encounters an encoding/decoding error, the
    callback function is looked up by name, the information about the
    error is stored in an exception object and the callback is called
    with this object.  The callback returns information about how to
    proceed (or raises an exception).

    For encoding, the exception object will look like this:

       class UnicodeEncodeError(UnicodeError):
           def __init__(self, encoding, object, start, end, reason):
               UnicodeError.__init__(self,
                   "encoding '%s' can't encode characters " +
                   "in positions %d-%d: %s" % (encoding,
                       start, end-1, reason))
               self.encoding = encoding
               self.object = object
               self.start = start
               self.end = end
               self.reason = reason

    This type will be implemented in C with the appropriate setter and
    getter methods for the attributes, which have the following
    meaning:

      * encoding: The name of the encoding;
      * object: The original unicode object for which encode() has
        been called;
      * start: The position of the first unencodable character;
      * end: (The position of the last unencodable character)+1 (or
        the length of object, if all characters from start to the end
        of object are unencodable);
      * reason: The reason why object[start:end] couldn't be encoded.

    If object has consecutive unencodable characters, the encoder
    should collect those characters for one call to the callback if
    those characters can't be encoded for the same reason.  The
    encoder is not required to implement this behaviour but may call
    the callback for every single character, but it is strongly
    suggested that the collecting method is implemented.

    The callback must not modify the exception object.  If the
    callback does not raise an exception (either the one passed in, or
    a different one), it must return a tuple:

        (replacement, newpos)

    replacement is a unicode object that the encoder will encode and
    emit instead of the unencodable object[start:end] part, newpos
    specifies a new position within object, where (after encoding the
    replacement) the encoder will continue encoding.

    Negative values for newpos are treated as being relative to
    end of object. If newpos is out of bounds the encoder will raise
    an IndexError.

    If the replacement string itself contains an unencodable character
    the encoder raises the exception object (but may set a different
    reason string before raising).

    Should further encoding errors occur, the encoder is allowed to
    reuse the exception object for the next call to the callback.
    Furthermore the encoder is allowed to cache the result of
    codecs.lookup_error.

    If the callback does not know how to handle the exception, it must
    raise a TypeError.

    Decoding works similar to encoding with the following differences:
    The exception class is named UnicodeDecodeError and the attribute
    object is the original 8bit string that the decoder is currently
    decoding.

    The decoder will call the callback with those bytes that
    constitute one undecodable sequence, even if there is more than
    one undecodable sequence that is undecodable for the same reason
    directly after the first one.  E.g. for the "unicode-escape"
    encoding, when decoding the illegal string "\\u00\\u01x", the
    callback will be called twice (once for "\\u00" and once for
    "\\u01").  This is done to be able to generate the correct number
    of replacement characters.

    The replacement returned from the callback is a unicode object
    that will be emitted by the decoder as-is without further
    processing instead of the undecodable object[start:end] part.

    There is a third API that uses the old strict/ignore/replace error
    handling scheme:

        PyUnicode_TranslateCharmap/unicode.translate

    The proposed patch will enhance PyUnicode_TranslateCharmap, so
    that it also supports the callback registry.  This has the
    additional side effect that PyUnicode_TranslateCharmap will
    support multi-character replacement strings (see SF feature
    request #403100 [1]).

    For PyUnicode_TranslateCharmap the exception class will be named
    UnicodeTranslateError.  PyUnicode_TranslateCharmap will collect
    all consecutive untranslatable characters (i.e. those that map to
    None) and call the callback with them.  The replacement returned
    from the callback is a unicode object that will be put in the
    translated result as-is, without further processing.

    All encoders and decoders are allowed to implement the callback
    functionality themselves, if they recognize the callback name
    (i.e. if it is a system callback like "strict", "replace" and
    "ignore").  The proposed patch will add two additional system
    callback names: "backslashreplace" and "xmlcharrefreplace", which
    can be used for encoding and translating and which will also be
    implemented in-place for all encoders and
    PyUnicode_TranslateCharmap.

    The Python equivalent of these five callbacks will look like this:

        def strict(exc):
            raise exc

        def ignore(exc):
            if isinstance(exc, UnicodeError):
                return (u"", exc.end)
            else:
                raise TypeError("can't handle %s" % exc.__name__)

       def replace(exc):
            if isinstance(exc, UnicodeEncodeError):
                return ((exc.end-exc.start)*u"?", exc.end)
            elif isinstance(exc, UnicodeDecodeError):
                return (u"\\ufffd", exc.end)
            elif isinstance(exc, UnicodeTranslateError):
                return ((exc.end-exc.start)*u"\\ufffd", exc.end)
            else:
                raise TypeError("can't handle %s" % exc.__name__)

       def backslashreplace(exc):
            if isinstance(exc,
                (UnicodeEncodeError, UnicodeTranslateError)):
                s = u""
                for c in exc.object[exc.start:exc.end]:
                   if ord(c)<=0xff:
                       s += u"\\x%02x" % ord(c)
                   elif ord(c)<=0xffff:
                       s += u"\\u%04x" % ord(c)
                   else:
                       s += u"\\U%08x" % ord(c)
                return (s, exc.end)
            else:
                raise TypeError("can't handle %s" % exc.__name__) 

       def xmlcharrefreplace(exc):
            if isinstance(exc,
                (UnicodeEncodeError, UnicodeTranslateError)):
                s = u""
                for c in exc.object[exc.start:exc.end]:
                   s += u"&#%d;" % ord(c)
                return (s, exc.end)
            else:
                raise TypeError("can't handle %s" % exc.__name__) 

    These five callback handlers will also be accessible to Python as
    codecs.strict_error, codecs.ignore_error, codecs.replace_error,
    codecs.backslashreplace_error and codecs.xmlcharrefreplace_error.


Rationale

    Most legacy encoding do not support the full range of Unicode
    characters.  For these cases many high level protocols support a
    way of escaping a Unicode character (e.g. Python itself supports
    the \x, \u and \U convention, XML supports character references
    via &#xxx; etc.).

    When implementing such an encoding algorithm, a problem with the
    current implementation of the encode method of Unicode objects
    becomes apparent: For determining which characters are unencodable
    by a certain encoding, every single character has to be tried,
    because encode does not provide any information about the location
    of the error(s), so

        # (1)
        us = u"xxx"
        s = us.encode(encoding)

    has to be replaced by

        # (2)
        us = u"xxx"
        v = []
        for c in us:
            try:
                v.append(c.encode(encoding))
            except UnicodeError:
                v.append("&#%d;" % ord(c))
        s = "".join(v)

    This slows down encoding dramatically as now the loop through the
    string is done in Python code and no longer in C code.

    Furthermore this solution poses problems with stateful encodings.
    For example UTF-16 uses a Byte Order Mark at the start of the
    encoded byte string to specify the byte order.  Using (2) with
    UTF-16, results in an 8 bit string with a BOM between every
    character.

    To work around this problem, a stream writer - which keeps state
    between calls to the encoding function - has to be used:

        # (3)
        us = u"xxx"
        import codecs, cStringIO as StringIO
        writer = codecs.getwriter(encoding)

        v = StringIO.StringIO()
        uv = writer(v)
        for c in us:
            try:
                uv.write(c)
            except UnicodeError:
                uv.write(u"&#%d;" % ord(c))
        s = v.getvalue()

    To compare the speed of (1) and (3) the following test script has
    been used:

        # (4)
        import time
        us = u"äa"*1000000
        encoding = "ascii"
        import codecs, cStringIO as StringIO

        t1 = time.time()

        s1 = us.encode(encoding, "replace")

        t2 = time.time()

        writer = codecs.getwriter(encoding)

        v = StringIO.StringIO()
        uv = writer(v)
        for c in us:
            try:
                uv.write(c)
            except UnicodeError:
                uv.write(u"?")
        s2 = v.getvalue()

        t3 = time.time()

        assert(s1==s2)
        print "1:", t2-t1
        print "2:", t3-t2
        print "factor:", (t3-t2)/(t2-t1)

    On Linux this gives the following output (with Python 2.3a0):

        1: 0.274321913719
        2: 51.1284689903
        factor: 186.381278466

    i.e. (3) is 180 times slower than (1).

    Callbacks must be stateless, because as soon as a callback is
    registered it is available globally and can be called by multiple
    encode() calls.  To be able to use stateful callbacks, the errors
    parameter for encode/decode/translate would have to be changed
    from char * to PyObject *, so that the callback could be used
    directly, without the need to register the callback globally.  As
    this requires changes to lots of C prototypes, this approach was
    rejected.

    Currently all encoding/decoding functions have arguments

        const Py_UNICODE *p, int size

    or

        const char *p, int size

    to specify the unicode characters/8bit characters to be
    encoded/decoded.  So in case of an error the codec has to create a
    new unicode or str object from these parameters and store it in
    the exception object.  The callers of these encoding/decoding
    functions extract these parameters from str/unicode objects
    themselves most of the time, so it could speed up error handling
    if these object were passed directly.  As this again requires
    changes to many C functions, this approach has been rejected.

    For stream readers/writers the errors attribute must be changeable
    to be able to switch between different error handling methods
    during the lifetime of the stream reader/writer. This is currently
    the case for codecs.StreamReader and codecs.StreamWriter and
    all their subclasses. All core codecs and probably most of the
    third party codecs (e.g. JapaneseCodecs) derive their stream
    readers/writers from these classes so this already works,
    but the attribute errors should be documented as a requirement.


Implementation Notes

    A sample implementation is available as SourceForge patch #432401
    [2] including a script for testing the speed of various
    string/encoding/error combinations and a test script.

    Currently the new exception classes are old style Python
    classes. This means that accessing attributes results
    in a dict lookup. The C API is implemented in a way
    that makes it possible to switch to new style classes
    behind the scene, if Exception (and UnicodeError) will
    be changed to new style classes implemented in C for
    improved performance.

    The class codecs.StreamReaderWriter uses the errors parameter for
    both reading and writing.  To be more flexible this should
    probably be changed to two separate parameters for reading and
    writing.

    The errors parameter of PyUnicode_TranslateCharmap is not
    availably to Python, which makes testing of the new functionality
    of PyUnicode_TranslateCharmap impossible with Python scripts.  The
    patch should add an optional argument errors to unicode.translate
    to expose the functionality and make testing possible.

    Codecs that do something different than encoding/decoding from/to
    unicode and want to use the new machinery can define their own
    exception classes and the strict handlers will automatically work
    with it. The other predefined error handlers are unicode specific
    and expect to get a Unicode(Encode|Decode|Translate)Error
    exception object so they won't work.


Backwards Compatibility

    The semantics of unicode.encode with errors="replace" has changed:
    The old version always stored a ? character in the output string
    even if no character was mapped to ? in the mapping.  With the
    proposed patch, the replacement string from the callback will
    again be looked up in the mapping dictionary.  But as all
    supported encodings are ASCII based, and thus map ? to ?, this
    should not be a problem in practice.

    Illegal values for the errors argument raised ValueError before,
    now they will raise LookupError.


References

    [1] SF feature request #403100
        "Multicharacter replacements in PyUnicode_TranslateCharmap"
        http://www.python.org/sf/403100

    [2] SF patch #432401 "unicode encoding error callbacks"
        http://www.python.org/sf/432401


Copyright

    This document has been placed in the public domain.



pep-0294 Type Names in the types Module

PEP: 294
Title: Type Names in the types Module
Version: $Revision$
Last-Modified: $Date$
Author: oren at hishome.net (Oren Tirosh)
Status: Rejected
Type: Standards Track
Created: 19-Jun-2002
Python-Version: 2.5
Post-History: 

Abstract

    This PEP proposes that symbols matching the type name should be
    added to the types module for all basic Python types in the types
    module:

        types.IntegerType -> types.int
        types.FunctionType -> types.function
        types.TracebackType -> types.traceback
         ...    

    The long capitalized names currently in the types module will be
    deprecated.

    With this change the types module can serve as a replacement for
    the new module.  The new module shall be deprecated and listed in
    PEP 4.


Pronouncement

    A centralized repository of type names was a mistake.  Neither the
    "types" nor "new" modules should be carried forward to Python 3.0.

    In the meantime, it does not make sense to make the proposed updates
    to the modules.  This would cause disruption without any compensating
    benefit.

    Instead, the problem that some internal types (frames, functions,
    etc.) don't live anywhere outside those modules may be addressed by
    either adding them to __builtin__ or sys.  This will provide a
    smoother transition to Python 3.0.
    

Rationale

    Using two sets of names for the same objects is redundant and
    confusing.

    In Python versions prior to 2.2 the symbols matching many type
    names were taken by the factory functions for those types.  Now
    all basic types have been unified with their factory functions and
    therefore the type names are available to be consistently used to
    refer to the type object.

    Most types are accessible as either builtins or in the new module
    but some types such as traceback and generator are only accssible
    through the types module under names which do not match the type
    name.  This PEP provides a uniform way to access all basic types
    under a single set of names.


Specification

    The types module shall pass the following test:

        import types
        for t in vars(types).values():
            if type(t) is type:
                assert getattr(types, t.__name__) is t

    The types 'class', 'instance method' and 'dict-proxy' have already
    been renamed to the valid Python identifiers 'classobj',
    'instancemethod' and 'dictproxy', making this possible.


Backward compatibility

    Because of their widespread use it is not planned to actually
    remove the long names from the types module in some future
    version.  However, the long names should be changed in
    documentation and library sources to discourage their use in new
    code.


Reference Implementation

    A reference implementation is available in SourceForge patch
    #569328: http://www.python.org/sf/569328
  

Copyright

    This document has been placed in the public domain.



pep-0295 Interpretation of multiline string constants

PEP: 295
Title: Interpretation of multiline string constants
Version: $Revision$
Last-Modified: $Date$
Author: Stepan Koltsov <yozh at mx1.ru>
Status: Rejected
Type: Standards Track
Created: 22-Jul-2002
Python-Version: 3.0
Post-History: 

Abstract

    This PEP describes an interpretation of multiline string constants
    for Python.  It suggests stripping spaces after newlines and
    stripping a newline if it is first character after an opening
    quotation.


Rationale

    This PEP proposes an interpretation of multiline string constants
    in Python.  Currently, the value of string constant is all the
    text between quotations, maybe with escape sequences substituted,
    e.g.:

        def f():
            """
            la-la-la
            limona, banana
            """
        
        def g():
            return "This is \
            string"
        
        print repr(f.__doc__)
        print repr(g())
    
    prints:
    
        '\n\tla-la-la\n\tlimona, banana\n\t'
        'This is \tstring'
    
    This PEP suggest two things

	- ignore the first character after opening quotation, if it is
	  newline
	- second: ignore in string constants all spaces and tabs up to
	  first non-whitespace character, but no more then current
	  indentation.

    After applying this, previous program will print:
    
        'la-la-la\nlimona, banana\n'
        'This is string'
    
    To get this result, previous programs could be rewritten for
    current Python as (note, this gives the same result with new
    strings meaning):
    
        def f():
            """\
        la-la-la
        limona, banana
        """
        
        def g():
            "This is \
        string"
    
    Or stripping can be done with library routines at runtime (as
    pydoc does), but this decreases program readability.


Implementation

    I'll say nothing about CPython, Jython or Python.NET.
    
    In original Python, there is no info about the current indentation
    (in spaces) at compile time, so space and tab stripping should be
    done at parse time.  Currently no flags can be passed to the
    parser in program text (like from __future__ import xxx).  I
    suggest enabling or disabling of this feature at Python compile
    time depending of CPP flag Py_PARSE_MULTILINE_STRINGS.


Alternatives

    New interpretation of string constants can be implemented with flags
    'i' and 'o' to string constants, like
    
        i"""
        SELECT * FROM car
        WHERE model = 'i525'
        """ is in new style,
        
        o"""SELECT * FROM employee
        WHERE birth < 1982
        """ is in old style, and
        
        """
        SELECT employee.name, car.name, car.price FROM employee, car
        WHERE employee.salary * 36 > car.price
        """ is in new style after Python-x.y.z and in old style otherwise.
    
    Also this feature can be disabled if string is raw, i.e. if flag 'r'
    specified.


Copyright

    This document has been placed in the Public Domain.



pep-0296 Adding a bytes Object Type

PEP: 296
Title: Adding a bytes Object Type
Version: $Revision$
Last-Modified: $Date$
Author: xscottg at yahoo.com (Scott Gilbert)
Status: Withdrawn
Type: Standards Track
Created: 12-Jul-2002
Python-Version: 2.3
Post-History: 

Notice

    This PEP is withdrawn by the author (in favor of PEP 358).


Abstract

    This PEP proposes the creation of a new standard type and builtin
    constructor called 'bytes'.  The bytes object is an efficiently
    stored array of bytes with some additional characteristics that
    set it apart from several implementations that are similar.


Rationale

    Python currently has many objects that implement something akin to
    the bytes object of this proposal.  For instance the standard
    string, buffer, array, and mmap objects are all very similar in
    some regards to the bytes object.  Additionally, several
    significant third party extensions have created similar objects to
    try and fill similar needs.  Frustratingly, each of these objects
    is too narrow in scope and is missing critical features to make it
    applicable to a wider category of problems.


Specification

    The bytes object has the following important characteristics:

    1. Efficient underlying array storage via the standard C type "unsigned
    char".  This allows fine grain control over how much memory is
    allocated.  With the alignment restrictions designated in the next
    item, it is trivial for low level extensions to cast the pointer
    to a different type as needed.
    
    Also, since the object is implemented as an array of bytes, it is
    possible to pass the bytes object to the extensive library of
    routines already in the standard library that presently work with
    strings.  For instance, the bytes object in conjunction with the
    struct module could be used to provide a complete replacement for
    the array module using only Python script.

    If an unusual platform comes to light, one where there isn't a
    native unsigned 8 bit type, the object will do its best to
    represent itself at the Python script level as though it were an
    array of 8 bit unsigned values.  It is doubtful whether many
    extensions would handle this correctly, but Python script could be
    portable in these cases.

    2. Alignment of the allocated byte array is whatever is promised by the
    platform implementation of malloc.  A bytes object created from an
    extension can be supplied that provides any arbitrary alignment as
    the extension author sees fit.

    This alignment restriction should allow the bytes object to be
    used as storage for all standard C types - including PyComplex
    objects or other structs of standard C type types.  Further
    alignment restrictions can be provided by extensions as necessary.

    3. The bytes object implements a subset of the sequence operations
    provided by string/array objects, but with slightly different
    semantics in some cases.  In particular, a slice always returns a
    new bytes object, but the underlying memory is shared between the
    two objects.  This type of slice behavior has been called creating
    a "view".  Additionally, repetition and concatenation are
    undefined for bytes objects and will raise an exception.

    As these objects are likely to find use in high performance
    applications, one motivation for the decision to use view slicing
    is that copying between bytes objects should be very efficient and
    not require the creation of temporary objects.  The following code
    illustrates this:

        # create two 10 Meg bytes objects
        b1 = bytes(10000000)
        b2 = bytes(10000000)

        # copy from part of one to another with out creating a 1 Meg temporary
        b1[2000000:3000000] = b2[4000000:5000000]

    Slice assignment where the rvalue is not the same length as the
    lvalue will raise an exception.  However, slice assignment will
    work correctly with overlapping slices (typically implemented with
    memmove).

    4. The bytes object will be recognized as a native type by the pickle and
    cPickle modules for efficient serialization.  (In truth, this is
    the only requirement that can't be implemented via a third party
    extension.)

    Partial solutions to address the need to serialize the data stored
    in a bytes-like object without creating a temporary copy of the
    data into a string have been implemented in the past.  The tofile
    and fromfile methods of the array object are good examples of
    this.  The bytes object will support these methods too.  However,
    pickling is useful in other situations - such as in the shelve
    module, or implementing RPC of Python objects, and requiring the
    end user to use two different serialization mechanisms to get an
    efficient transfer of data is undesirable.

    XXX: Will try to implement pickling of the new bytes object in
    such a way that previous versions of Python will unpickle it as a
    string object.

    When unpickling, the bytes object will be created from memory
    allocated from Python (via malloc).  As such, it will lose any
    additional properties that an extension supplied pointer might
    have provided (special alignment, or special types of memory).

    XXX: Will try to make it so that C subclasses of bytes type can
    supply the memory that will be unpickled into.  For instance, a
    derived class called PageAlignedBytes would unpickle to memory
    that is also page aligned.

    On any platform where an int is 32 bits (most of them), it is
    currently impossible to create a string with a length larger than
    can be represented in 31 bits.  As such, pickling to a string will
    raise an exception when the operation is not possible.

    At least on platforms supporting large files (many of them),
    pickling large bytes objects to files should be possible via
    repeated calls to the file.write() method.

    5. The bytes type supports the PyBufferProcs interface, but a bytes object
    provides the additional guarantee that the pointer will not be
    deallocated or reallocated as long as a reference to the bytes
    object is held.  This implies that a bytes object is not resizable
    once it is created, but allows the global interpreter lock (GIL)
    to be released while a separate thread manipulates the memory
    pointed to if the PyBytes_Check(...) test passes.

    This characteristic of the bytes object allows it to be used in
    situations such as asynchronous file I/O or on multiprocessor
    machines where the pointer obtained by PyBufferProcs will be used
    independently of the global interpreter lock.

    Knowing that the pointer can not be reallocated or freed after the
    GIL is released gives extension authors the capability to get true
    concurrency and make use of additional processors for long running
    computations on the pointer.

    6. In C/C++ extensions, the bytes object can be created from a supplied
    pointer and destructor function to free the memory when the
    reference count goes to zero.

    The special implementation of slicing for the bytes object allows
    multiple bytes objects to refer to the same pointer/destructor.
    As such, a refcount will be kept on the actual
    pointer/destructor.  This refcount is separate from the refcount
    typically associated with Python objects.

    XXX: It may be desirable to expose the inner refcounted object as an
    actual Python object.  If a good use case arises, it should be possible
    for this to be implemented later with no loss to backwards compatibility.

    7. It is also possible to signify the bytes object as readonly, in this
    case it isn't actually mutable, but does provide the other features of a
    bytes object.

    8. The bytes object keeps track of the length of its data with a Python
    LONG_LONG type.  Even though the current definition for PyBufferProcs
    restricts the length to be the size of an int, this PEP does not propose
    to make any changes there.  Instead, extensions can work around this limit
    by making an explicit PyBytes_Check(...) call, and if that succeeds they
    can make a PyBytes_GetReadBuffer(...) or PyBytes_GetWriteBuffer call to
    get the pointer and full length of the object as a LONG_LONG.

    The bytes object will raise an exception if the standard PyBufferProcs
    mechanism is used and the size of the bytes object is greater than can be
    represented by an integer.

    From Python scripting, the bytes object will be subscriptable with longs
    so the 32 bit int limit can be avoided.

    There is still a problem with the len() function as it is PyObject_Size()
    and this returns an int as well.  As a workaround, the bytes object will
    provide a .length() method that will return a long.

    9. The bytes object can be constructed at the Python scripting level by
    passing an int/long to the bytes constructor with the number of bytes to
    allocate.  For example:

       b = bytes(100000) # alloc 100K bytes

    The constructor can also take another bytes object.  This will be useful
    for the implementation of unpickling, and in converting a read-write bytes
    object into a read-only one.  An optional second argument will be used to
    designate creation of a readonly bytes object.

    10. From the C API, the bytes object can be allocated using any of the
    following signatures:

       PyObject* PyBytes_FromLength(LONG_LONG len, int readonly);
       PyObject* PyBytes_FromPointer(void* ptr, LONG_LONG len, int readonly
                void (*dest)(void *ptr, void *user), void* user);
    
    In the PyBytes_FromPointer(...) function, if the dest function pointer is
    passed in as NULL, it will not be called.  This should only be used for
    creating bytes objects from statically allocated space.
    
    The user pointer has been called a closure in other places.  It is a
    pointer that the user can use for whatever purposes.  It will be passed to
    the destructor function on cleanup and can be useful for a number of
    things.  If the user pointer is not needed, NULL should be passed instead.
 
    11. The bytes type will be a new style class as that seems to be where all
    standard Python types are headed.


Contrast to existing types

    The most common way to work around the lack of a bytes object has been to
    simply use a string object in its place.  Binary files, the struct/array
    modules, and several other examples exist of this.  Putting aside the
    style issue that these uses typically have nothing to do with text
    strings, there is the real problem that strings are not mutable, so direct
    manipulation of the data returned in these cases is not possible.  Also,
    numerous optimizations in the string module (such as caching the hash
    value or interning the pointers) mean that extension authors are on very
    thin ice if they try to break the rules with the string object.

    The buffer object seems like it was intended to address the purpose that
    the bytes object is trying fulfill, but several shortcomings in its
    implementation [1] have made it less useful in many common cases.  The
    buffer object made a different choice for its slicing behavior (it returns
    new strings instead of buffers for slicing and other operations), and it
    doesn't make many of the promises on alignment or being able to release
    the GIL that the bytes object does.

    Also in regards to the buffer object, it is not possible to simply replace
    the buffer object with the bytes object and maintain backwards
    compatibility.  The buffer object provides a mechanism to take the
    PyBufferProcs supplied pointer of another object and present it as its
    own.  Since the behavior of the other object can not be guaranteed to
    follow the same set of strict rules that a bytes object does, it can't be
    used in places that a bytes object could.

    The array module supports the creation of an array of bytes, but it does
    not provide a C API for supplying pointers and destructors to extension
    supplied memory.  This makes it unusable for constructing objects out of
    shared memory, or memory that has special alignment or locking for things
    like DMA transfers.  Also, the array object does not currently pickle.
    Finally since the array object allows its contents to grow, via the extend
    method, the pointer can be changed if the GIL is not held while using it.

    Creating a buffer object from an array object has the same problem of
    leaving an invalid pointer when the array object is resized.

    The mmap object caters to its particular niche, but does not attempt to
    solve a wider class of problems.

    Finally, any third party extension can not implement pickling without
    creating a temporary object of a standard python type.  For example in the
    Numeric community, it is unpleasant that a large array can't pickle
    without creating a large binary string to duplicate the array data.


Backward Compatibility

    The only possibility for backwards compatibility problems that the author
    is aware of are in previous versions of Python that try to unpickle data
    containing the new bytes type.


Reference Implementation

    XXX: Actual implementation is in progress, but changes are still possible
    as this PEP gets further review.

    The following new files will be added to the Python baseline:

        Include/bytesobject.h  # C interface
        Objects/bytesobject.c  # C implementation
        Lib/test/test_bytes.py # unit testing
        Doc/lib/libbytes.tex   # documentation

    The following files will also be modified:

        Include/Python.h       # adding bytesmodule.h include file
        Python/bltinmodule.c   # adding the bytes type object
        Modules/cPickle.c      # adding bytes to the standard types
        Lib/pickle.py          # adding bytes to the standard types

    It is possible that several other modules could be cleaned up and
    implemented in terms of the bytes object.  The mmap module comes to mind
    first, but as noted above it would be possible to reimplement the array
    module as a pure Python module.  While it is attractive that this PEP
    could actually reduce the amount of source code by some amount, the author
    feels that this could cause unnecessary risk for breaking existing
    applications and should be avoided at this time.


Additional Notes/Comments

    - Guido van Rossum wondered whether it would make sense to be able
    to create a bytes object from a mmap object.  The mmap object
    appears to support the requirements necessary to provide memory
    for a bytes object.  (It doesn't resize, and the pointer is valid
    for the lifetime of the object.)  As such, a method could be added
    to the mmap module such that a bytes object could be created
    directly from a mmap object.  An initial stab at how this would be
    implemented would be to use the PyBytes_FromPointer() function
    described above and pass the mmap_object as the user pointer.  The
    destructor function would decref the mmap_object for cleanup.

    - Todd Miller notes that it may be useful to have two new functions:
    PyObject_AsLargeReadBuffer() and PyObject_AsLargeWriteBuffer that are
    similar to PyObject_AsReadBuffer() and PyObject_AsWriteBuffer(), but
    support getting a LONG_LONG length in addition to the void* pointer.
    These functions would allow extension authors to work transparently with
    bytes object (that support LONG_LONG lengths) and most other buffer like
    objects (which only support int lengths).  These functions could be in
    lieu of, or in addition to, creating a specific PyByte_GetReadBuffer() and
    PyBytes_GetWriteBuffer() functions.

    XXX: The author thinks this is very a good idea as it paves the way for
    other objects to eventually support large (64 bit) pointers, and it should
    only affect abstract.c and abstract.h.  Should this be added above?

    - It was generally agreed that abusing the segment count of the
    PyBufferProcs interface is not a good hack to work around the 31 bit
    limitation of the length.  If you don't know what this means, then you're
    in good company.  Most code in the Python baseline, and presumably in many
    third party extensions, punt when the segment count is not 1.


References

    [1] The buffer interface
        http://mail.python.org/pipermail/python-dev/2000-October/009974.html


Copyright

    This document has been placed in the public domain.



pep-0297 Support for System Upgrades

PEP: 297
Title: Support for System Upgrades
Version: $Revision$
Last-Modified: $Date$
Author: Marc-AndrĂŠ Lemburg <mal at lemburg.com>
Status: Rejected
Type: Standards Track
Created: 19-Jul-2001
Python-Version: 2.6
Post-History: 

Rejection Notice

    This PEP is rejected for failure to generate significant interest.


Abstract

    This PEP proposes strategies to allow the Python standard library
    to be upgraded in parts without having to reinstall the complete
    distribution or having to wait for a new patch level release.


Problem

    Python currently does not allow overriding modules or packages in
    the standard library per default. Even though this is possible by
    defining a PYTHONPATH environment variable (the paths defined in
    this variable are prepended to the Python standard library path),
    there is no standard way of achieving this without changing the
    configuration.

    Since Python's standard library is starting to host packages which
    are also available separately, e.g. the distutils, email and PyXML
    packages, which can also be installed independently of the Python
    distribution, it is desirable to have an option to upgrade these
    packages without having to wait for a new patch level release of
    the Python interpreter to bring along the changes.

    On some occasions, it may also be desirable to update modules of
    the standard library without going through the whole Python release
    cycle, e.g. in order to provide hot-fixes for security problems.

Proposed Solutions

    This PEP proposes two different but not necessarily conflicting
    solutions:

    1. Adding a new standard search path to sys.path:
       $stdlibpath/system-packages just before the $stdlibpath
       entry. This complements the already existing entry for site
       add-ons $stdlibpath/site-packages which is appended to the
       sys.path at interpreter startup time.

       To make use of this new standard location, distutils will need
       to grow support for installing certain packages in
       $stdlibpath/system-packages rather than the standard location
       for third-party packages $stdlibpath/site-packages.

    2. Tweaking distutils to install directly into $stdlibpath for the
       system upgrades rather than into $stdlibpath/site-packages.

    The first solution has a few advantages over the second:

    * upgrades can be easily identified (just look in
      $stdlibpath/system-packages)

    * upgrades can be de-installed without affecting the rest
      of the interpreter installation

    * modules can be virtually removed from packages; this is
      due to the way Python imports packages: once it finds the
      top-level package directory it stay in this directory for
      all subsequent package submodule imports

    * the approach has an overall much cleaner design than the
      hackish install on top of an existing installation approach

    The only advantages of the second approach are that the Python
    interpreter does not have to changed and that it works with
    older Python versions.

    Both solutions require changes to distutils. These changes can
    also be implemented by package authors, but it would be better to
    define a standard way of switching on the proposed behaviour.


Scope

    Solution 1: Python 2.6 and up
    Solution 2: all Python versions supported by distutils


Credits

    None


References

    None


Copyright

    This document has been placed in the public domain.



pep-0298 The Locked Buffer Interface

PEP: 298
Title: The Locked Buffer Interface
Version: $Revision$
Last-Modified: $Date$
Author: Thomas Heller <theller at python.net>
Status: Withdrawn
Type: Standards Track
Created: 26-Jul-2002
Python-Version: 2.3
Post-History: 30-Jul-2002, 1-Aug-2002

Abstract

    This PEP proposes an extension to the buffer interface called the
    'locked buffer interface'.

    The locked buffer interface avoids the flaws of the 'old' buffer
    interface [1] as defined in Python versions up to and including
    2.2, and has the following semantics:

        The lifetime of the retrieved pointer is clearly defined and
        controlled by the client.

        The buffer size is returned as a 'size_t' data type, which
        allows access to large buffers on platforms where sizeof(int)
        != sizeof(void *).

    (Guido comments: This second sounds like a change we could also
    make to the "old" buffer interface, if we introduce another flag
    bit that's *not* part of the default flags.)


Specification

    The locked buffer interface exposes new functions which return the
    size and the pointer to the internal memory block of any python
    object which chooses to implement this interface.

    Retrieving a buffer from an object puts this object in a locked
    state during which the buffer may not be freed, resized, or
    reallocated.

    The object must be unlocked again by releasing the buffer if it's
    no longer used by calling another function in the locked buffer
    interface.  If the object never resizes or reallocates the buffer
    during its lifetime, this function may be NULL.  Failure to call
    this function (if it is != NULL) is a programming error and may
    have unexpected results.

    The locked buffer interface omits the memory segment model which
    is present in the old buffer interface - only a single memory
    block can be exposed.

    The memory blocks can be accessed without holding the global
    interpreter lock.


Implementation

    Define a new flag in Include/object.h:

        /* PyBufferProcs contains bf_acquirelockedreadbuffer,
           bf_acquirelockedwritebuffer, and bf_releaselockedbuffer */
        #define Py_TPFLAGS_HAVE_LOCKEDBUFFER (1L<<15)


    This flag would be included in Py_TPFLAGS_DEFAULT:

        #define Py_TPFLAGS_DEFAULT  ( \
                             ....
                             Py_TPFLAGS_HAVE_LOCKEDBUFFER | \
                             ....
                            0)


    Extend the PyBufferProcs structure by new fields in
    Include/object.h:

        typedef size_t (*acquirelockedreadbufferproc)(PyObject *,
                                                      const void **);
        typedef size_t (*acquirelockedwritebufferproc)(PyObject *,
                                                       void **);
        typedef void (*releaselockedbufferproc)(PyObject *);

        typedef struct {
            getreadbufferproc bf_getreadbuffer;
            getwritebufferproc bf_getwritebuffer;
            getsegcountproc bf_getsegcount;
            getcharbufferproc bf_getcharbuffer;
            /* locked buffer interface functions */
            acquirelockedreadbufferproc bf_acquirelockedreadbuffer;
            acquirelockedwritebufferproc bf_acquirelockedwritebuffer;
            releaselockedbufferproc bf_releaselockedbuffer;
        } PyBufferProcs;


    The new fields are present if the Py_TPFLAGS_HAVE_LOCKEDBUFFER
    flag is set in the object's type.

    The Py_TPFLAGS_HAVE_LOCKEDBUFFER flag implies the
    Py_TPFLAGS_HAVE_GETCHARBUFFER flag.

    The acquirelockedreadbufferproc and acquirelockedwritebufferproc
    functions return the size in bytes of the memory block on success,
    and fill in the passed void * pointer on success.  If these
    functions fail - either because an error occurs or no memory block
    is exposed - they must set the void * pointer to NULL and raise an
    exception.  The return value is undefined in these cases and
    should not be used.

    If calls to these functions succeed, eventually the buffer must be
    released by a call to the releaselockedbufferproc, supplying the
    original object as argument.  The releaselockedbufferproc cannot
    fail.  For objects that actually maintain an internal lock count
    it would be a fatal error if the releaselockedbufferproc function
    would be called too often, leading to a negative lock count.

    Similar to the 'old' buffer interface, any of these functions may
    be set to NULL, but it is strongly recommended to implement the
    releaselockedbufferproc function (even if it does nothing) if any
    of the acquireread/writelockedbufferproc functions are
    implemented, to discourage extension writers from checking for a
    NULL value and not calling it.

    These functions aren't supposed to be called directly, they are
    called through convenience functions declared in
    Include/abstract.h:

        int PyObject_AquireLockedReadBuffer(PyObject *obj,
                                            const void **buffer,
                                            size_t *buffer_len);

        int PyObject_AcquireLockedWriteBuffer(PyObject *obj,
                                              void **buffer,
                                              size_t *buffer_len);

        void PyObject_ReleaseLockedBuffer(PyObject *obj);

    The former two functions return 0 on success, set buffer to the
    memory location and buffer_len to the length of the memory block
    in bytes. On failure, or if the locked buffer interface is not
    implemented by obj, they return -1 and set an exception.

    The latter function doesn't return anything, and cannot fail.


Backward Compatibility

    The size of the PyBufferProcs structure changes if this proposal
    is implemented, but the type's tp_flags slot can be used to
    determine if the additional fields are present.


Reference Implementation

    An implementation has been uploaded to the SourceForge patch
    manager as http://www.python.org/sf/652857.


Additional Notes/Comments

    Python strings, unicode strings, mmap objects, and array objects
    would expose the locked buffer interface.

    mmap and array objects would actually enter a locked state while
    the buffer is active, this is not needed for strings and unicode
    objects.  Resizing locked array objects is not allowed and will
    raise an exception. Whether closing a locked mmap object is an
    error or will only be deferred until the lock count reaches zero
    is an implementation detail.

    Guido recommends:

        But I'm still very concerned that if most built-in types
        (e.g. strings, bytes) don't implement the release
        functionality, it's too easy for an extension to seem to work
        while forgetting to release the buffer.

        I recommend that at least some built-in types implement the
        acquire/release functionality with a counter, and assert that
        the counter is zero when the object is deleted -- if the
        assert fails, someone DECREF'ed their reference to the object
        without releasing it.  (The rule should be that you must own a
        reference to the object while you've aquired the object.)

        For strings that might be impractical because the string
        object would have to grow 4 bytes to hold the counter; but the
        new bytes object (PEP 296) could easily implement the counter,
        and the array object too -- that way there will be plenty of
        opportunity to test proper use of the protocol.


Community Feedback

    Greg Ewing doubts the locked buffer interface is needed at all, he
    thinks the normal buffer interface could be used if the pointer is
    (re)fetched each time it's used.  This seems to be dangerous,
    because even innocent looking calls to the Python API like
    Py_DECREF() may trigger execution of arbitrary Python code.

    The first version of this proposal didn't have the release
    function, but it turned out that this would have been too
    restrictive: mmap and array objects wouldn't have been able to
    implement it, because mmap objects can be closed anytime if not
    locked, and array objects could resize or reallocate the buffer.

    This PEP will probably be rejected because nobody except the
    author needs it.



References

    [1] The buffer interface
        http://mail.python.org/pipermail/python-dev/2000-October/009974.html

    [2] The Buffer Problem
        http://www.python.org/dev/peps/pep-0296/


Copyright

    This document has been placed in the public domain.



pep-0299 Special __main__() function in modules

PEP: 299
Title: Special __main__() function in modules
Version: $Revision$
Last-Modified: $Date$
Author: Jeff Epler <jepler at unpythonic.net>
Status: Rejected
Type: Standards Track
Created: 12-Aug-2002
Python-Version: 2.3
Post-History: 29-Mar-2006

Abstract

    Many Python modules are also intended to be callable as standalone
    scripts.  This PEP proposes that a special function called
    __main__() should serve this purpose.


Motivation

    There should be one simple and universal idiom for invoking a
    module as a standalone script.

    The semi-standard idiom

        if __name__ == '__main__':
            perform "standalone" functionality

    is unclear to programmers of languages like C and C++.  It also
    does not permit invocation of the standalone function when the
    module is imported.  The variant

        if __name__ == '__main__':
            main_function()

    is sometimes seen, but there exists no standard name for the
    function, and because arguments are taken from sys.argv it is not
    possible to pass specific arguments without changing the argument
    list seen by all other modules.  (Imagine a threaded Python
    program, with two threads wishing to invoke the standalone
    functionality of different modules with different argument lists)


Proposal

    The standard name of the 'main function' should be '__main__'.
    When a module is invoked on the command line, such as

        python mymodule.py

    then the module behaves as though the following lines existed at
    the end of the module (except that the attribute __sys may not be
    used or assumed to exist elsewhere in the script):

        if globals().has_key("__main__"):
            import sys as __sys
            __sys.exit(__main__(__sys.argv))

    Other modules may execute

        import mymodule
        mymodule.__main__(['mymodule', ...])

    It is up to mymodule to document thread-safety issues or other
    issues which might restrict use of __main__.  (Other issues might
    include use of mutually exclusive GUI modules, non-sharable
    resources like hardware devices, reassignment of sys.stdin/stdout,
    etc)


Implementation

    In modules/main.c, the block near line 385 (after the
    PyRun_AnyFileExFlags call) will be changed so that the above code
    (or its C equivalent) is executed.


Open Issues

    - Should the return value from __main__ be treated as the exit value?

      Yes.  Many __main__ will naturally return None, which sys.exit
      translates into a "success" return code.  In those that return a
      numeric result, it behaves just like the argument to sys.exit()
      or the return value from C's main().

    - Should the argument list to __main__ include argv[0], or just the
      "real" arguments argv[1:]?

      argv[0] is included for symmetry with sys.argv and easy
      transition to the new standard idiom.


Rejection

    In a short discussion on python-dev [1], two major backwards
    compatibility problems were brought up and Guido pronounced that he
    doesn't like the idea anyway as it's "not worth the change (in docs,
    user habits, etc.) and there's nothing particularly broken."


References

    [1] Georg Brandl, "What about PEP 299",
        http://mail.python.org/pipermail/python-dev/2006-March/062951.html


Copyright

    This document has been placed in the public domain.



pep-0301 Package Index and Metadata for Distutils

PEP:301
Title:Package Index and Metadata for Distutils
Version:$Revision$
Last-Modified:$Date$
Author:Richard Jones <richard at python.org>
Status:Final
Type:Standards Track
Content-Type:text/x-rst
Created:24-Oct-2002
Python-Version:2.3
Post-History:8-Nov-2002

Abstract

This PEP proposes several extensions to the Distutils packaging system [1]. These enhancements include a central package index server, tools for submitting package information to the index and extensions to the package metadata to include Trove [2] information.

This PEP does not address issues of package dependency. It also does not address storage and download of packages as described in PEP 243 [6]. Nor is it proposing a local database of packages as described in PEP 262 [7].

Existing package repositories such as the Vaults of Parnassus [3], CPAN [4] and PAUSE [5] will be investigated as prior art in this field.

Rationale

Python programmers have long needed a simple method of discovering existing modules and systems available for their use. It is arguable that the existence of these systems for other languages have been a significant contribution to their popularity. The existence of the Catalog-SIG, and the many discussions there indicate that there is a large population of users who recognise this need.

The introduction of the Distutils packaging system to Python simplified the process of distributing shareable code, and included mechanisms for the capture of package metadata, but did little with the metadata save ship it with the package.

An interface to the index should be hosted in the python.org domain, giving it an air of legitimacy that existing catalog efforts do not have.

The interface for submitting information to the catalog should be as simple as possible - hopefully just a one-line command for most users.

Issues of package dependency are not addressed due to the complexity of such a system. PEP 262 proposes such a system, but as of this writing the PEP is still unfinished.

Issues of package dissemination (storage on a central server) are not addressed because they require assumptions about availability of storage and bandwidth that I am not in a position to make. PEP 243, which is still being developed, is tackling these issues and many more. This proposal is considered compatible with, and adjunct to the proposal in PEP 243.

Specification

The specification takes three parts, the web interface, the Distutils register command and the Distutils Trove classification.

Web Interface

A web interface is implemented over a simple store. The interface is available through the python.org domain, either directly or as packages.python.org.

The store has columns for all metadata fields. The (name, version) double is used as a uniqueness key. Additional submissions for an existing (name, version) will result in an update operation.

The web interface implements the following commands/interfaces:

index
Lists known packages, optionally filtered. An additional HTML page, search, presents a form to the user which is used to customise the index view. The index will include a browsing interface like that presented in the Trove interface design section 4.3. The results will be paginated, sorted alphabetically and only showing the most recent version. The most recent version information will be determined using the Distutils LooseVersion class.
display
Displays information about the package. All fields are displayed as plain text. The "url" (or "home_page") field is hyperlinked.
submit

Accepts a POST submission of metadata about a package. The "name" and "version" fields are mandatory, as they uniquely identify an entry in the index. Submit will automatically determine whether to create a new entry or update an existing entry. The metadata is checked for correctness where appropriate - specifically the Trove discriminators are compared with the allowed set. An update will update all information about the package based on the new submitted information.

There will also be a submit/edit form that will allow manual submission and updating for those who do not use Distutils.

submit_pkg_info
Accepts a POST submission of a PKG-INFO file and performs the same function as the submit interface.
user

Registers a new user with the index. Requires username, password and email address. Passwords will be stored in the index database as SHA hashes. If the username already exists in the database:

  1. If valid HTTP Basic authentication is provided, the password and email address are updated with the submission information, or
  2. If no valid authentication is provided, the user is informed that the login is already taken.

Registration will be a three-step process, involving:

  1. User submission of details via the Distutils register command or through the web,
  2. Index server sending email to the user's email address with a URL to visit to confirm registration with a random one-time key, and
  3. User visits URL with the key and confirms registration.
roles
An interface for changing user Role assignments.
password_reset
Using a supplied email address as the key, this resets a user's password and sends an email with the new password to the user.

The submit command will require HTTP Basic authentication, preferably over an HTTPS connection.

The server interface will indicate success or failure of the commands through a subset of the standard HTTP response codes:

Code Meaning Register command implications
200 OK Everything worked just fine
400 Bad request Data provided for submission was malformed
401 Unauthorised The username or password supplied were incorrect
403 Forbidden User does not have permission to update the package information (not Owner or Maintainer)

User Roles

Three user Roles will be assignable to users:

Owner
Owns a package name, may assign Maintainer Role for that name. The first user to register information about a package is deemed Owner of the package name. The Admin user may change this if necessary. May submit updates for the package name.
Maintainer
Can submit and update info for a particular package name.
Admin
Can assign Owner Role and edit user details. Not specific to a package name.

Index Storage (Schema)

The index is stored in a set of relational database tables:

packages
Lists package names and holds package-level metadata (currently just the stable release version)
releases
Each package has an entry in releases for each version of the package that is released. A row holds the bulk of the information given in the package's PKG-INFO file. There is one row for each package (name, version).
trove_discriminators
Lists the Trove discriminator text and assigns each one a unique ID.
release_discriminators
Each entry maps a package (name, version) to a discriminator_id. We map to releases instead of packages because the set of discriminators may change between releases.
journals
Holds information about changes to package information in the index. Changes to the packages, releases, roles, and release_discriminators tables are listed here by package name and version if the change is release-specific.
users
Holds our user database - user name, email address and password.
roles
Maps user_name and role_name to a package_name.

An additional table, rego_otk holds the One Time Keys generated during registration and is not interesting in the scope of the index itself.

Distutils register Command

An additional Distutils command, register, is implemented which posts the package metadata to the central index. The register command automatically handles user registration; the user is presented with three options:

  1. login and submit package information
  2. register as a new packager
  3. send password reminder email

On systems where the $HOME environment variable is set, the user will be prompted at exit to save their username/password to a file in their $HOME directory in the file .pypirc.

Notification of changes to a package entry will be sent to all users who have submitted information about the package. That is, the original submitter and any subsequent updaters.

The register command will include a --verify option which performs a test submission to the index without actually committing the data. The index will perform its submission verification checks as usual and report any errors it would have reported during a normal submission. This is useful for verifying correctness of Trove discriminators.

Distutils Trove Classification

The Trove concept of discrimination will be added to the metadata set available to package authors through the new attribute "classifiers". The list of classifiers will be available through the web, and added to the package like so:

setup(
    name = "roundup",
    version = __version__,
    classifiers = [
        'Development Status :: 4 - Beta',
        'Environment :: Console',
        'Environment :: Web Environment',
        'Intended Audience :: End Users/Desktop',
        'Intended Audience :: Developers',
        'Intended Audience :: System Administrators',
        'License :: OSI Approved :: Python Software Foundation License',
        'Operating System :: MacOS :: MacOS X',
        'Operating System :: Microsoft :: Windows',
        'Operating System :: POSIX',
        'Programming Language :: Python',
        'Topic :: Communications :: Email',
        'Topic :: Office/Business',
        'Topic :: Software Development :: Bug Tracking',
    ],
    url = 'http://sourceforge.net/projects/roundup/',
    ...
)

It was decided that strings would be used for the classification entries due to the deep nesting that would be involved in a more formal Python structure.

The original Trove specification that classification namespaces be separated by slashes ("/") unfortunately collides with many of the names having slashes in them (e.g. "OS/2"). The double-colon solution (" :: ") implemented by SourceForge and FreshMeat gets around this limitation.

The list of classification values on the module index has been merged from FreshMeat and SourceForge (with their permission). This list will be made available both through the web interface and through the register command's --list-classifiers option as a text list which may then be copied to the setup.py file. The register command's --verify option will check classifiers values against the server's list.

Unfortunately, the addition of the "classifiers" property is not backwards-compatible. A setup.py file using it will not work under Python 2.1.3. It is hoped that a bug-fix release of Python 2.2 (most likely 2.2.3) will relax the argument checking of the setup() command to allow new keywords, even if they're not actually used. It is preferable that a warning be produced, rather than a show-stopping error. The use of the new keyword should be discouraged in situations where the package is advertised as being compatible with python versions earlier than 2.2.3 or 2.3.

In the PKG-INFO, the classifiers list items will appear as individual Classifier: entries:

Name: roundup
Version: 0.5.2
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console (Text Based)
            .
            .
Classifier: Topic :: Software Development :: Bug Tracking
Url: http://sourceforge.net/projects/roundup/

Implementation

The server is available at:

http://www.python.org/pypi

The code is available from the SourceForge project:

http://sourceforge.net/projects/pypi/

The register command has been integrated into Python 2.3.

Rejected Proposals

Originally, the index server was to return custom headers (inspired by PEP 243):

X-Pypi-Status
Either "success" or "fail".
X-Pypi-Reason
A description of the reason for failure, or additional information in the case of a success.

However, it has been pointed out [8] that this is a bad scheme to use.

Acknowledgements

Anthony Baxter, Martin v. Loewis and David Goodger for encouragement and feedback during initial drafting.

A.M. Kuchling for support including hosting the second prototype.

Greg Stein for recommending that the register command interpret the HTTP response codes rather than custom X-PyPI-* headers.

The many participants of the Distutils and Catalog SIGs for their ideas over the years.

pep-0302 New Import Hooks

PEP:302
Title:New Import Hooks
Version:$Revision$
Last-Modified:$Date$
Author:Just van Rossum <just at letterror.com>, Paul Moore <p.f.moore at gmail.com>
Status:Final
Type:Standards Track
Content-Type:text/x-rst
Created:19-Dec-2002
Python-Version:2.3
Post-History:19-Dec-2002

Warning

The language reference for import [10] and importlib documentation [11] now supercede this PEP. This document is no longer updated and provided for historical purposes only.

Abstract

This PEP proposes to add a new set of import hooks that offer better customization of the Python import mechanism. Contrary to the current __import__ hook, a new-style hook can be injected into the existing scheme, allowing for a finer grained control of how modules are found and how they are loaded.

Motivation

The only way to customize the import mechanism is currently to override the built-in __import__ function. However, overriding __import__ has many problems. To begin with:

  • An __import__ replacement needs to fully reimplement the entire import mechanism, or call the original __import__ before or after the custom code.
  • It has very complex semantics and responsibilities.
  • __import__ gets called even for modules that are already in sys.modules, which is almost never what you want, unless you're writing some sort of monitoring tool.

The situation gets worse when you need to extend the import mechanism from C: it's currently impossible, apart from hacking Python's import.c or reimplementing much of import.c from scratch.

There is a fairly long history of tools written in Python that allow extending the import mechanism in various way, based on the __import__ hook. The Standard Library includes two such tools: ihooks.py (by GvR) and imputil.py [1] (Greg Stein), but perhaps the most famous is iu.py by Gordon McMillan, available as part of his Installer package. Their usefulness is somewhat limited because they are written in Python; bootstrapping issues need to worked around as you can't load the module containing the hook with the hook itself. So if you want the entire Standard Library to be loadable from an import hook, the hook must be written in C.

Use cases

This section lists several existing applications that depend on import hooks. Among these, a lot of duplicate work was done that could have been saved if there had been a more flexible import hook at the time. This PEP should make life a lot easier for similar projects in the future.

Extending the import mechanism is needed when you want to load modules that are stored in a non-standard way. Examples include modules that are bundled together in an archive; byte code that is not stored in a pyc formatted file; modules that are loaded from a database over a network.

The work on this PEP was partly triggered by the implementation of PEP 273, which adds imports from Zip archives as a built-in feature to Python. While the PEP itself was widely accepted as a must-have feature, the implementation left a few things to desire. For one thing it went through great lengths to integrate itself with import.c, adding lots of code that was either specific for Zip file imports or not specific to Zip imports, yet was not generally useful (or even desirable) either. Yet the PEP 273 implementation can hardly be blamed for this: it is simply extremely hard to do, given the current state of import.c.

Packaging applications for end users is a typical use case for import hooks, if not the typical use case. Distributing lots of source or pyc files around is not always appropriate (let alone a separate Python installation), so there is a frequent desire to package all needed modules in a single file. So frequent in fact that multiple solutions have been implemented over the years.

The oldest one is included with the Python source code: Freeze [2]. It puts marshalled byte code into static objects in C source code. Freeze's "import hook" is hard wired into import.c, and has a couple of issues. Later solutions include Fredrik Lundh's Squeeze, Gordon McMillan's Installer, and Thomas Heller's py2exe [3]. MacPython ships with a tool called BuildApplication.

Squeeze, Installer and py2exe use an __import__ based scheme (py2exe currently uses Installer's iu.py, Squeeze used ihooks.py), MacPython has two Mac-specific import hooks hard wired into import.c, that are similar to the Freeze hook. The hooks proposed in this PEP enables us (at least in theory; it's not a short term goal) to get rid of the hard coded hooks in import.c, and would allow the __import__-based tools to get rid of most of their import.c emulation code.

Before work on the design and implementation of this PEP was started, a new BuildApplication-like tool for Mac OS X prompted one of the authors of this PEP (JvR) to expose the table of frozen modules to Python, in the imp module. The main reason was to be able to use the freeze import hook (avoiding fancy __import__ support), yet to also be able to supply a set of modules at runtime. This resulted in issue #642578 [4], which was mysteriously accepted (mostly because nobody seemed to care either way ;-). Yet it is completely superfluous when this PEP gets accepted, as it offers a much nicer and general way to do the same thing.

Rationale

While experimenting with alternative implementation ideas to get built-in Zip import, it was discovered that achieving this is possible with only a fairly small amount of changes to import.c. This allowed to factor out the Zip-specific stuff into a new source file, while at the same time creating a general new import hook scheme: the one you're reading about now.

An earlier design allowed non-string objects on sys.path. Such an object would have the necessary methods to handle an import. This has two disadvantages: 1) it breaks code that assumes all items on sys.path are strings; 2) it is not compatible with the PYTHONPATH environment variable. The latter is directly needed for Zip imports. A compromise came from Jython: allow string subclasses on sys.path, which would then act as importer objects. This avoids some breakage, and seems to work well for Jython (where it is used to load modules from .jar files), but it was perceived as an "ugly hack".

This lead to a more elaborate scheme, (mostly copied from McMillan's iu.py) in which each in a list of candidates is asked whether it can handle the sys.path item, until one is found that can. This list of candidates is a new object in the sys module: sys.path_hooks.

Traversing sys.path_hooks for each path item for each new import can be expensive, so the results are cached in another new object in the sys module: sys.path_importer_cache. It maps sys.path entries to importer objects.

To minimize the impact on import.c as well as to avoid adding extra overhead, it was chosen to not add an explicit hook and importer object for the existing file system import logic (as iu.py has), but to simply fall back to the built-in logic if no hook on sys.path_hooks could handle the path item. If this is the case, a None value is stored in sys.path_importer_cache, again to avoid repeated lookups. (Later we can go further and add a real importer object for the built-in mechanism, for now, the None fallback scheme should suffice.)

A question was raised: what about importers that don't need any entry on sys.path? (Built-in and frozen modules fall into that category.) Again, Gordon McMillan to the rescue: iu.py contains a thing he calls the metapath. In this PEP's implementation, it's a list of importer objects that is traversed before sys.path. This list is yet another new object in the sys module: sys.meta_path. Currently, this list is empty by default, and frozen and built-in module imports are done after traversing sys.meta_path, but still before sys.path.

Specification part 1: The Importer Protocol

This PEP introduces a new protocol: the "Importer Protocol". It is important to understand the context in which the protocol operates, so here is a brief overview of the outer shells of the import mechanism.

When an import statement is encountered, the interpreter looks up the __import__ function in the built-in name space. __import__ is then called with four arguments, amongst which are the name of the module being imported (may be a dotted name) and a reference to the current global namespace.

The built-in __import__ function (known as PyImport_ImportModuleEx() in import.c) will then check to see whether the module doing the import is a package or a submodule of a package. If it is indeed a (submodule of a) package, it first tries to do the import relative to the package (the parent package for a submodule). For example if a package named "spam" does "import eggs", it will first look for a module named "spam.eggs". If that fails, the import continues as an absolute import: it will look for a module named "eggs". Dotted name imports work pretty much the same: if package "spam" does "import eggs.bacon" (and "spam.eggs" exists and is itself a package), "spam.eggs.bacon" is tried. If that fails "eggs.bacon" is tried. (There are more subtleties that are not described here, but these are not relevant for implementers of the Importer Protocol.)

Deeper down in the mechanism, a dotted name import is split up by its components. For "import spam.ham", first an "import spam" is done, and only when that succeeds is "ham" imported as a submodule of "spam".

The Importer Protocol operates at this level of individual imports. By the time an importer gets a request for "spam.ham", module "spam" has already been imported.

The protocol involves two objects: a finder and a loader. A finder object has a single method:

finder.find_module(fullname, path=None)

This method will be called with the fully qualified name of the module. If the finder is installed on sys.meta_path, it will receive a second argument, which is None for a top-level module, or package.__path__ for submodules or subpackages [5]. It should return a loader object if the module was found, or None if it wasn't. If find_module() raises an exception, it will be propagated to the caller, aborting the import.

A loader object also has one method:

loader.load_module(fullname)

This method returns the loaded module or raises an exception, preferably ImportError if an existing exception is not being propagated. If load_module() is asked to load a module that it cannot, ImportError is to be raised.

In many cases the finder and loader can be one and the same object: finder.find_module() would just return self.

The fullname argument of both methods is the fully qualified module name, for example "spam.eggs.ham". As explained above, when finder.find_module("spam.eggs.ham") is called, "spam.eggs" has already been imported and added to sys.modules. However, the find_module() method isn't necessarily always called during an actual import: meta tools that analyze import dependencies (such as freeze, Installer or py2exe) don't actually load modules, so a finder shouldn't depend on the parent package being available in sys.modules.

The load_module() method has a few responsibilities that it must fulfill before it runs any code:

  • If there is an existing module object named 'fullname' in sys.modules, the loader must use that existing module. (Otherwise, the reload() builtin will not work correctly.) If a module named 'fullname' does not exist in sys.modules, the loader must create a new module object and add it to sys.modules.

    Note that the module object must be in sys.modules before the loader executes the module code. This is crucial because the module code may (directly or indirectly) import itself; adding it to sys.modules beforehand prevents unbounded recursion in the worst case and multiple loading in the best.

    If the load fails, the loader needs to remove any module it may have inserted into sys.modules. If the module was already in sys.modules then the loader should leave it alone.

  • The __file__ attribute must be set. This must be a string, but it may be a dummy value, for example "<frozen>". The privilege of not having a __file__ attribute at all is reserved for built-in modules.

  • The __name__ attribute must be set. If one uses imp.new_module() then the attribute is set automatically.

  • If it's a package, the __path__ variable must be set. This must be a list, but may be empty if __path__ has no further significance to the importer (more on this later).

  • The __loader__ attribute must be set to the loader object. This is mostly for introspection and reloading, but can be used for importer-specific extras, for example getting data associated with an importer.

  • The __package__ attribute [8] must be set.

    If the module is a Python module (as opposed to a built-in module or a dynamically loaded extension), it should execute the module's code in the module's global name space (module.__dict__).

    Here is a minimal pattern for a load_module() method:

    # Consider using importlib.util.module_for_loader() to handle
    # most of these details for you.
    def load_module(self, fullname):
        code = self.get_code(fullname)
        ispkg = self.is_package(fullname)
        mod = sys.modules.setdefault(fullname, imp.new_module(fullname))
        mod.__file__ = "<%s>" % self.__class__.__name__
        mod.__loader__ = self
        if ispkg:
            mod.__path__ = []
            mod.__package__ = fullname
        else:
            mod.__package__ = fullname.rpartition('.')[0]
        exec(code, mod.__dict__)
        return mod
    

Specification part 2: Registering Hooks

There are two types of import hooks: Meta hooks and Path hooks. Meta hooks are called at the start of import processing, before any other import processing (so that meta hooks can override sys.path processing, frozen modules, or even built-in modules). To register a meta hook, simply add the finder object to sys.meta_path (the list of registered meta hooks).

Path hooks are called as part of sys.path (or package.__path__) processing, at the point where their associated path item is encountered. A path hook is registered by adding an importer factory to sys.path_hooks.

sys.path_hooks is a list of callables, which will be checked in sequence to determine if they can handle a given path item. The callable is called with one argument, the path item. The callable must raise ImportError if it is unable to handle the path item, and return an importer object if it can handle the path item. Note that if the callable returns an importer object for a specific sys.path entry, the builtin import machinery will not be invoked to handle that entry any longer, even if the importer object later fails to find a specific module. The callable is typically the class of the import hook, and hence the class __init__() method is called. (This is also the reason why it should raise ImportError: an __init__() method can't return anything. This would be possible with a __new__() method in a new style class, but we don't want to require anything about how a hook is implemented.)

The results of path hook checks are cached in sys.path_importer_cache, which is a dictionary mapping path entries to importer objects. The cache is checked before sys.path_hooks is scanned. If it is necessary to force a rescan of sys.path_hooks, it is possible to manually clear all or part of sys.path_importer_cache.

Just like sys.path itself, the new sys variables must have specific types:

  • sys.meta_path and sys.path_hooks must be Python lists.
  • sys.path_importer_cache must be a Python dict.

Modifying these variables in place is allowed, as is replacing them with new objects.

Packages and the role of __path__

If a module has a __path__ attribute, the import mechanism will treat it as a package. The __path__ variable is used instead of sys.path when importing submodules of the package. The rules for sys.path therefore also apply to pkg.__path__. So sys.path_hooks is also consulted when pkg.__path__ is traversed. Meta importers don't necessarily use sys.path at all to do their work and may therefore ignore the value of pkg.__path__. In this case it is still advised to set it to list, which can be empty.

Optional Extensions to the Importer Protocol

The Importer Protocol defines three optional extensions. One is to retrieve data files, the second is to support module packaging tools and/or tools that analyze module dependencies (for example Freeze), while the last is to support execution of modules as scripts. The latter two categories of tools usually don't actually load modules, they only need to know if and where they are available. All three extensions are highly recommended for general purpose importers, but may safely be left out if those features aren't needed.

To retrieve the data for arbitrary "files" from the underlying storage backend, loader objects may supply a method named get_data():

loader.get_data(path)

This method returns the data as a string, or raise IOError if the "file" wasn't found. The data is always returned as if "binary" mode was used - there is no CRLF translation of text files, for example. It is meant for importers that have some file-system-like properties. The 'path' argument is a path that can be constructed by munging module.__file__ (or pkg.__path__ items) with the os.path.* functions, for example:

d = os.path.dirname(__file__)
data = __loader__.get_data(os.path.join(d, "logo.gif"))

The following set of methods may be implemented if support for (for example) Freeze-like tools is desirable. It consists of three additional methods which, to make it easier for the caller, each of which should be implemented, or none at all:

loader.is_package(fullname)
loader.get_code(fullname)
loader.get_source(fullname)

All three methods should raise ImportError if the module wasn't found.

The loader.is_package(fullname) method should return True if the module specified by 'fullname' is a package and False if it isn't.

The loader.get_code(fullname) method should return the code object associated with the module, or None if it's a built-in or extension module. If the loader doesn't have the code object but it does have the source code, it should return the compiled source code. (This is so that our caller doesn't also need to check get_source() if all it needs is the code object.)

The loader.get_source(fullname) method should return the source code for the module as a string (using newline characters for line endings) or None if the source is not available (yet it should still raise ImportError if the module can't be found by the importer at all).

To support execution of modules as scripts [6], the above three methods for finding the code associated with a module must be implemented. In addition to those methods, the following method may be provided in order to allow the runpy module to correctly set the __file__ attribute:

loader.get_filename(fullname)

This method should return the value that __file__ would be set to if the named module was loaded. If the module is not found, then ImportError should be raised.

Integration with the 'imp' module

The new import hooks are not easily integrated in the existing imp.find_module() and imp.load_module() calls. It's questionable whether it's possible at all without breaking code; it is better to simply add a new function to the imp module. The meaning of the existing imp.find_module() and imp.load_module() calls changes from: "they expose the built-in import mechanism" to "they expose the basic unhooked built-in import mechanism". They simply won't invoke any import hooks. A new imp module function is proposed (but not yet implemented) under the name get_loader(), which is used as in the following pattern:

loader = imp.get_loader(fullname, path)
if loader is not None:
    loader.load_module(fullname)

In the case of a "basic" import, one the imp.find_module() function would handle, the loader object would be a wrapper for the current output of imp.find_module(), and loader.load_module() would call imp.load_module() with that output.

Note that this wrapper is currently not yet implemented, although a Python prototype exists in the test_importhooks.py script (the ImpWrapper class) included with the patch.

Forward Compatibility

Existing __import__ hooks will not invoke new-style hooks by magic, unless they call the original __import__ function as a fallback. For example, ihooks.py, iu.py and imputil.py are in this sense not forward compatible with this PEP.

Open Issues

Modules often need supporting data files to do their job, particularly in the case of complex packages or full applications. Current practice is generally to locate such files via sys.path (or a package.__path__ attribute). This approach will not work, in general, for modules loaded via an import hook.

There are a number of possible ways to address this problem:

  • "Don't do that". If a package needs to locate data files via its __path__, it is not suitable for loading via an import hook. The package can still be located on a directory in sys.path, as at present, so this should not be seen as a major issue.
  • Locate data files from a standard location, rather than relative to the module file. A relatively simple approach (which is supported by distutils) would be to locate data files based on sys.prefix (or sys.exec_prefix). For example, looking in os.path.join(sys.prefix, "data", package_name).
  • Import hooks could offer a standard way of getting at data files relative to the module file. The standard zipimport object provides a method get_data(name) which returns the content of the "file" called name, as a string. To allow modules to get at the importer object, zipimport also adds an attribute __loader__ to the module, containing the zipimport object used to load the module. If such an approach is used, it is important that client code takes care not to break if the get_data() method is not available, so it is not clear that this approach offers a general answer to the problem.

It was suggested on python-dev that it would be useful to be able to receive a list of available modules from an importer and/or a list of available data files for use with the get_data() method. The protocol could grow two additional extensions, say list_modules() and list_files(). The latter makes sense on loader objects with a get_data() method. However, it's a bit unclear which object should implement list_modules(): the importer or the loader or both?

This PEP is biased towards loading modules from alternative places: it currently doesn't offer dedicated solutions for loading modules from alternative file formats or with alternative compilers. In contrast, the ihooks module from the standard library does have a fairly straightforward way to do this. The Quixote project [7] uses this technique to import PTL files as if they are ordinary Python modules. To do the same with the new hooks would either mean to add a new module implementing a subset of ihooks as a new-style importer, or add a hookable built-in path importer object.

There is no specific support within this PEP for "stacking" hooks. For example, it is not obvious how to write a hook to load modules from tar.gz files by combining separate hooks to load modules from .tar and .gz files. However, there is no support for such stacking in the existing hook mechanisms (either the basic "replace __import__" method, or any of the existing import hook modules) and so this functionality is not an obvious requirement of the new mechanism. It may be worth considering as a future enhancement, however.

It is possible (via sys.meta_path) to add hooks which run before sys.path is processed. However, there is no equivalent way of adding hooks to run after sys.path is processed. For now, if a hook is required after sys.path has been processed, it can be simulated by adding an arbitrary "cookie" string at the end of sys.path, and having the required hook associated with this cookie, via the normal sys.path_hooks processing. In the longer term, the path handling code will become a "real" hook on sys.meta_path, and at that stage it will be possible to insert user-defined hooks either before or after it.

Implementation

The PEP 302 implementation has been integrated with Python as of 2.3a1. An earlier version is available as patch #652586 [9], but more interestingly, the issue contains a fairly detailed history of the development and design.

PEP 273 has been implemented using PEP 302's import hooks.

References and Footnotes

[1]imputil module http://docs.python.org/library/imputil.html
[2]The Freeze tool. See also the Tools/freeze/ directory in a Python source distribution
[3]py2exe by Thomas Heller http://www.py2exe.org/
[4]imp.set_frozenmodules() patch http://bugs.python.org/issue642578
[5]The path argument to finder.find_module() is there because the pkg.__path__ variable may be needed at this point. It may either come from the actual parent module or be supplied by imp.find_module() or the proposed imp.get_loader() function.
[6]PEP 338: Executing modules as scripts http://www.python.org/dev/peps/pep-0338/
[7]Quixote, a framework for developing Web applications http://www.mems-exchange.org/software/quixote/
[8]PEP 366: Main module explicit relative imports http://www.python.org/dev/peps/pep-0366/
[9]New import hooks + Import from Zip files http://bugs.python.org/issue652586
[10]Language reference for imports http://docs.python.org/3/reference/import.html
[11]importlib documentation http://docs.python.org/3/library/importlib.html#module-importlib

pep-0303 Extend divmod() for Multiple Divisors

PEP: 303
Title: Extend divmod() for Multiple Divisors
Version: $Revision$
Last-Modified: $Date$
Author: Thomas Bellman <bellman+pep-divmod at lysator.liu.se>
Status: Rejected
Type: Standards Track
Content-Type: text/plain
Created: 31-Dec-2002
Python-Version: 2.3
Post-History: 

Abstract

    This PEP describes an extension to the built-in divmod() function,
    allowing it to take multiple divisors, chaining several calls to
    divmod() into one.

Pronouncement

    This PEP is rejected.  Most uses for chained divmod() involve a
    constant modulus (in radix conversions for example) and are more
    properly coded as a loop.  The example of splitting seconds
    into days/hours/minutes/seconds does not generalize to months
    and years; rather, the whole use case is handled more flexibly and
    robustly by date and time modules.  The other use cases mentioned
    in the PEP are somewhat rare in real code.  The proposal is also
    problematic in terms of clarity and obviousness.  In the examples,
    it is not immediately clear that the argument order is correct or
    that the target tuple is of the right length.  Users from other
    languages are more likely to understand the standard two argument
    form without having to re-read the documentation.  See python-dev
    discussion on 17 June 2005.

Specification

    The built-in divmod() function would be changed to accept multiple
    divisors, changing its signature from divmod(dividend, divisor) to
    divmod(dividend, *divisors).  The dividend is divided by the last
    divisor, giving a quotient and a remainder.  The quotient is then
    divided by the second to last divisor, giving a new quotient and
    remainder.  This is repeated until all divisors have been used,
    and divmod() then returns a tuple consisting of the quotient from
    the last step, and the remainders from all the steps.

    A Python implementation of the new divmod() behaviour could look
    like:

        def divmod(dividend, *divisors):
            modulos = ()
            q = dividend
            while divisors:
                q,r = q.__divmod__(divisors[-1])
                modulos = (r,) + modulos
                divisors = divisors[:-1]
            return (q,) + modulos


Motivation

    Occasionally one wants to perform a chain of divmod() operations,
    calling divmod() on the quotient from the previous step, with
    varying divisors.  The most common case is probably converting a
    number of seconds into weeks, days, hours, minutes and seconds.
    This would today be written as:

        def secs_to_wdhms(seconds):
            m,s = divmod(seconds, 60)
            h,m = divmod(m, 60)
            d,h = divmod(h, 24)
            w,d = divmod(d, 7)
            return (w,d,h,m,s)

    This is tedious and easy to get wrong each time you need it.

    If instead the divmod() built-in is changed according the proposal,
    the code for converting seconds to weeks, days, hours, minutes and
    seconds then become

        def secs_to_wdhms(seconds):
            w,d,h,m,s = divmod(seconds, 7, 24, 60, 60)
            return (w,d,h,m,s)

    which is easier to type, easier to type correctly, and easier to
    read.

    Other applications are:

    - Astronomical angles (declination is measured in degrees, minutes
      and seconds, right ascension is measured in hours, minutes and
      seconds).
    - Old British currency (1 pound = 20 shilling, 1 shilling = 12 pence)
    - Anglo-Saxon length units: 1 mile = 1760 yards, 1 yard = 3 feet,
      1 foot = 12 inches.
    - Anglo-Saxon weight units: 1 long ton = 160 stone, 1 stone = 14
      pounds, 1 pound = 16 ounce, 1 ounce = 16 dram
    - British volumes: 1 gallon = 4 quart, 1 quart = 2 pint, 1 pint
      = 20 fluid ounces


Rationale

    The idea comes from APL, which has an operator that does this.  (I
    don't remember what the operator looks like, and it would probably
    be impossible to render in ASCII anyway.)

    The APL operator takes a list as its second operand, while this
    PEP proposes that each divisor should be a separate argument to
    the divmod() function.  This is mainly because it is expected that
    the most common uses will have the divisors as constants right in
    the call (as the 7, 24, 60, 60 above), and adding a set of
    parentheses or brackets would just clutter the call.

    Requiring an explicit sequence as the second argument to divmod()
    would seriously break backwards compatibility.  Making divmod()
    check its second argument for being a sequence is deemed to be too
    ugly to contemplate.  And in the case where one *does* have a
    sequence that is computed other-where, it is easy enough to write
    divmod(x, *divs) instead.

    Requiring at least one divisor, i.e rejecting divmod(x), has been
    considered, but no good reason to do so has come to mind, and is
    thus allowed in the name of generality.

    Calling divmod() with no divisors should still return a tuple (of
    one element).  Code that calls divmod() with a varying number of
    divisors, and thus gets a return value with an "unknown" number of
    elements, would otherwise have to special case that case.  Code
    that *knows* it is calling divmod() with no divisors is considered
    to be too silly to warrant a special case.

    Processing the divisors in the other direction, i.e dividing with
    the first divisor first, instead of dividing with the last divisor
    first, has been considered.  However, the result comes with the
    most significant part first and the least significant part last
    (think of the chained divmod as a way of splitting a number into
    "digits", with varying weights), and it is reasonable to specify
    the divisors (weights) in the same order as the result.

    The inverse operation:

        def inverse_divmod(seq, *factors):
            product = seq[0]
            for x,y in zip(factors, seq[1:]):
                product = product * x + y
            return product

    could also be useful.  However, writing

        seconds = (((((w * 7) + d) * 24 + h) * 60 + m) * 60 + s)

    is less cumbersome both to write and to read than the chained
    divmods.  It is therefore deemed to be less important, and its
    introduction can be deferred to its own PEP.  Also, such a
    function needs a good name, and the PEP author has not managed to
    come up with one yet.

    Calling divmod("spam") does not raise an error, despite strings
    supporting neither division nor modulo.  However, unless we know
    the other object too, we can't determine whether divmod() would
    work or not, and thus it seems silly to forbid it.


Backwards Compatibility

    Any module that replaces the divmod() function in the __builtin__
    module, may cause other modules using the new syntax to break.  It
    is expected that this is very uncommon.

    Code that expects a TypeError exception when calling divmod() with
    anything but two arguments will break.  This is also expected to
    be very uncommon.

    No other issues regarding backwards compatibility are known.


Reference Implementation

    Not finished yet, but it seems a rather straightforward
    new implementation of the function builtin_divmod() in
    Python/bltinmodule.c


Copyright

    This document has been placed in the public domain.



pep-0304 Controlling Generation of Bytecode Files

PEP:304
Title:Controlling Generation of Bytecode Files
Version:$Revision$
Last-Modified:$Date$
Author:Skip Montanaro
Status:Withdrawn
Type:Standards Track
Content-Type:text/x-rst
Created:22-Jan-2003
Post-History:27-Jan-2003, 31-Jan-2003, 17-Jun-2005

Abstract

This PEP outlines a mechanism for controlling the generation and location of compiled Python bytecode files. This idea originally arose as a patch request [1] and evolved into a discussion thread on the python-dev mailing list [2]. The introduction of an environment variable will allow people installing Python or Python-based third-party packages to control whether or not bytecode files should be generated at installation time, and if so, where they should be written. It will also allow users to control whether or not bytecode files should be generated at application run-time, and if so, where they should be written.

Proposal

Add a new environment variable, PYTHONBYTECODEBASE, to the mix of environment variables which Python understands. PYTHONBYTECODEBASE is interpreted as follows:

  • If not defined, Python bytecode is generated in exactly the same way as is currently done. sys.bytecodebase is set to the root directory (either / on Unix and Mac OSX or the root directory of the startup (installation???) drive -- typically C:\ -- on Windows).

  • If defined and it refers to an existing directory to which the user has write permission, sys.bytecodebase is set to that directory and bytecode files are written into a directory structure rooted at that location.

  • If defined but empty, sys.bytecodebase is set to None and generation of bytecode files is suppressed altogether.

  • If defined and one of the following is true:

    • it does not refer to a directory,
    • it refers to a directory, but not one for which the user has write permission

    a warning is displayed, sys.bytecodebase is set to None and generation of bytecode files is suppressed altogether.

After startup initialization, all runtime references are to sys.bytecodebase, not the PYTHONBYTECODEBASE environment variable. sys.path is not modified.

From the above, we see sys.bytecodebase can only take on two valid types of values: None or a string referring to a valid directory on the system.

During import, this extension works as follows:

  • The normal search for a module is conducted. The search order is roughly: dynamically loaded extension module, Python source file, Python bytecode file. The only time this mechanism comes into play is if a Python source file is found.
  • Once we've found a source module, an attempt to read a byte-compiled file in the same directory is made. (This is the same as before.)
  • If no byte-compiled file is found, an attempt to read a byte-compiled file from the augmented directory is made.
  • If bytecode generation is required, the generated bytecode is wrtten to the augmented directory if possible.

Note that this PEP is explicitly not about providing module-by-module or directory-by-directory control over the disposition of bytecode files.

Glossary

  • "bytecode base" refers to the current setting of sys.bytecodebase.
  • "augmented directory" refers to the directory formed from the bytecode base and the directory name of the source file.
  • PYTHONBYTECODEBASE refers to the environment variable when necessary to distinguish it from "bytecode base".

Locating bytecode files

When the interpreter is searching for a module, it will use sys.path as usual. However, when a possible bytecode file is considered, an extra probe for a bytecode file may be made. First, a check is made for the bytecode file using the directory in sys.path which holds the source file (the current behavior). If a valid bytecode file is not found there (either one does not exist or exists but is out-of-date) and the bytecode base is not None, a second probe is made using the directory in sys.path prefixed appropriately by the bytecode base.

Writing bytecode files

When the bytecode base is not None, a new bytecode file is written to the appropriate augmented directory, never directly to a directory in sys.path.

Defining augmented directories

Conceptually, the augmented directory for a bytecode file is the directory in which the source file exists prefixed by the bytecode base. In a Unix environment this would be:

pcb = os.path.abspath(sys.bytecodebase)
if sourcefile[0] == os.sep: sourcefile = sourcefile[1:]
augdir = os.path.join(pcb, os.path.dirname(sourcefile))

On Windows, which does not have a single-rooted directory tree, the drive letter of the directory containing the source file is treated as a directory component after removing the trailing colon. The augmented directory is thus derived as

pcb = os.path.abspath(sys.bytecodebase)
drive, base = os.path.splitdrive(os.path.dirname(sourcefile))
drive = drive[:-1]
if base[0] == "\\": base = base[1:]
augdir = os.path.join(pcb, drive, base)

Fixing the location of the bytecode base

During program startup, the value of the PYTHONBYTECODEBASE environment variable is made absolute, checked for validity and added to the sys module, effectively:

pcb = os.path.abspath(os.environ["PYTHONBYTECODEBASE"])
probe = os.path.join(pcb, "foo")
try:
    open(probe, "w")
except IOError:
    sys.bytecodebase = None
else:
    os.unlink(probe)
    sys.bytecodebase = pcb

This allows the user to specify the bytecode base as a relative path, but not have it subject to changes to the current working directory during program execution. (I can't imagine you'd want it to move around during program execution.)

There is nothing special about sys.bytecodebase. The user may change it at runtime if desired, but normally it will not be modified.

Rationale

In many environments it is not possible for non-root users to write into directories containing Python source files. Most of the time, this is not a problem as Python source is generally byte compiled during installation. However, there are situations where bytecode files are either missing or need to be updated. If the directory containing the source file is not writable by the current user a performance penalty is incurred each time a program importing the module is run. [3] Warning messages may also be generated in certain circumstances. If the directory is writable, nearly simultaneous attempts attempts to write the bytecode file by two separate processes may occur, resulting in file corruption. [4]

In environments with RAM disks available, it may be desirable for performance reasons to write bytecode files to a directory on such a disk. Similarly, in environments where Python source code resides on network file systems, it may be desirable to cache bytecode files on local disks.

Alternatives

The only other alternative proposed so far [1] seems to be to add a -R flag to the interpreter to disable writing bytecode files altogether. This proposal subsumes that. Adding a command-line option is certainly possible, but is probably not sufficient, as the interpreter's command line is not readily available during installation (early during program startup???).

Issues

  • Interpretation of a module's __file__ attribute. I believe the __file__ attribute of a module should reflect the true location of the bytecode file. If people want to locate a module's source code, they should use imp.find_module(module).
  • Security - What if root has PYTHONBYTECODEBASE set? Yes, this can present a security risk, but so can many other things the root user does. The root user should probably not set PYTHONBYTECODEBASE except possibly during installation. Still, perhaps this problem can be minimized. When running as root the interpreter should check to see if PYTHONBYTECODEBASE refers to a directory which is writable by anyone other than root. If so, it could raise an exception or warning and set sys.bytecodebase to None. Or, see the next item.
  • More security - What if PYTHONBYTECODEBASE refers to a general directory (say, /tmp)? In this case, perhaps loading of a preexisting bytecode file should occur only if the file is owned by the current user or root. (Does this matter on Windows?)
  • The interaction of this PEP with import hooks has not been considered yet. In fact, the best way to implement this idea might be as an import hook. See PEP 302. [5]
  • In the current (pre-PEP 304) environment, it is safe to delete a source file after the corresponding bytecode file has been created, since they reside in the same directory. With PEP 304 as currently defined, this is not the case. A bytecode file in the augmented directory is only considered when the source file is present and it thus never considered when looking for module files ending in ".pyc". I think this behavior may have to change.

Examples

In the examples which follow, the urllib source code resides in /usr/lib/python2.3/urllib.py and /usr/lib/python2.3 is in sys.path but is not writable by the current user.

  • The bytecode base is /tmp. /usr/lib/python2.3/urllib.pyc exists and is valid. When urllib is imported, the contents of /usr/lib/python2.3/urllib.pyc are used. The augmented directory is not consulted. No other bytecode file is generated.
  • The bytecode base is /tmp. /usr/lib/python2.3/urllib.pyc exists, but is out-of-date. When urllib is imported, the generated bytecode file is written to urllib.pyc in the augmented directory which has the value /tmp/usr/lib/python2.3. Intermediate directories will be created as needed.
  • The bytecode base is None. No urllib.pyc file is found. When urllib is imported, no bytecode file is written.
  • The bytecode base is /tmp. No urllib.pyc file is found. When urllib is imported, the generated bytecode file is written to the augmented directory which has the value /tmp/usr/lib/python2.3. Intermediate directories will be created as needed.
  • At startup, PYTHONBYTECODEBASE is /tmp/foobar, which does not exist. A warning is emitted, sys.bytecodebase is set to None and no bytecode files are written during program execution unless sys.bytecodebase is later changed to refer to a valid, writable directory.
  • At startup, PYTHONBYTECODEBASE is set to /, which exists, but is not writable by the current user. A warning is emitted, sys.bytecodebase is set to None and no bytecode files are written during program execution unless sys.bytecodebase is later changed to refer to a valid, writable directory. Note that even though the augmented directory constructed for a particular bytecode file may be writable by the current user, what counts is that the bytecode base directory itself is writable.
  • At startup PYTHONBYTECODEBASE is set to the empty string. sys.bytecodebase is set to None. No warning is generated, however. If no urllib.pyc file is found when urllib is imported, no bytecode file is written.

In the Windows examples which follow, the urllib source code resides in C:\PYTHON22\urllib.py. C:\PYTHON22 is in sys.path but is not writable by the current user.

  • The bytecode base is set to C:\TEMP. C:\PYTHON22\urllib.pyc exists and is valid. When urllib is imported, the contents of C:\PYTHON22\urllib.pyc are used. The augmented directory is not consulted.
  • The bytecode base is set to C:\TEMP. C:\PYTHON22\urllib.pyc exists, but is out-of-date. When urllib is imported, a new bytecode file is written to the augmented directory which has the value C:\TEMP\C\PYTHON22. Intermediate directories will be created as needed.
  • At startup PYTHONBYTECODEBASE is set to TEMP and the current working directory at application startup is H:\NET. The potential bytecode base is thus H:\NET\TEMP. If this directory exists and is writable by the current user, sys.bytecodebase will be set to that value. If not, a warning will be emitted and sys.bytecodebase will be set to None.
  • The bytecode base is C:\TEMP. No urllib.pyc file is found. When urllib is imported, the generated bytecode file is written to the augmented directory which has the value C:\TEMP\C\PYTHON22. Intermediate directories will be created as needed.

Implementation

See the patch on Sourceforge. [6]

References

[1](1, 2) patch 602345, Option for not writing py.[co] files, Klose (http://www.python.org/sf/602345)
[2]python-dev thread, Disable writing .py[co], Norwitz (http://mail.python.org/pipermail/python-dev/2003-January/032270.html)
[3]Debian bug report, Mailman is writing to /usr in cron, Wegner (http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=96111)
[4]python-dev thread, Parallel pyc construction, Dubois (http://mail.python.org/pipermail/python-dev/2003-January/032060.html)
[5]PEP 302, New Import Hooks, van Rossum and Moore (http://www.python.org/dev/peps/pep-0302)
[6]patch 677103, PYTHONBYTECODEBASE patch (PEP 304), Montanaro (http://www.python.org/sf/677103)

pep-0305 CSV File API

PEP:305
Title:CSV File API
Version:$Revision$
Last-Modified:$Date$
Author:Kevin Altis <altis at semi-retired.com>, Dave Cole <djc at object-craft.com.au>, Andrew McNamara <andrewm at object-craft.com.au>, Skip Montanaro <skip at pobox.com>, Cliff Wells <LogiplexSoftware at earthlink.net>
Discussions-To:<csv at python.org>
Status:Final
Type:Standards Track
Content-Type:text/x-rst
Created:26-Jan-2003
Post-History:31-Jan-2003, 13-Feb-2003

Abstract

The Comma Separated Values (CSV) file format is the most common import and export format for spreadsheets and databases. Although many CSV files are simple to parse, the format is not formally defined by a stable specification and is subtle enough that parsing lines of a CSV file with something like line.split(",") is eventually bound to fail. This PEP defines an API for reading and writing CSV files. It is accompanied by a corresponding module which implements the API.

To Do (Notes for the Interested and Ambitious)

Application Domain

This PEP is about doing one thing well: parsing tabular data which may use a variety of field separators, quoting characters, quote escape mechanisms and line endings. The authors intend the proposed module to solve this one parsing problem efficiently. The authors do not intend to address any of these related topics:

  • data interpretation (is a field containing the string "10" supposed to be a string, a float or an int? is it a number in base 10, base 16 or base 2? is a number in quotes a number or a string?)
  • locale-specific data representation (should the number 1.23 be written as "1.23" or "1,23" or "1 23"?) -- this may eventually be addressed.
  • fixed width tabular data - can already be parsed reliably.

Rationale

Often, CSV files are formatted simply enough that you can get by reading them line-by-line and splitting on the commas which delimit the fields. This is especially true if all the data being read is numeric. This approach may work for awhile, then come back to bite you in the butt when somebody puts something unexpected in the data like a comma. As you dig into the problem you may eventually come to the conclusion that you can solve the problem using regular expressions. This will work for awhile, then break mysteriously one day. The problem grows, so you dig deeper and eventually realize that you need a purpose-built parser for the format.

CSV formats are not well-defined and different implementations have a number of subtle corner cases. It has been suggested that the "V" in the acronym stands for "Vague" instead of "Values". Different delimiters and quoting characters are just the start. Some programs generate whitespace after each delimiter which is not part of the following field. Others quote embedded quoting characters by doubling them, others by prefixing them with an escape character. The list of weird ways to do things can seem endless.

All this variability means it is difficult for programmers to reliably parse CSV files from many sources or generate CSV files designed to be fed to specific external programs without a thorough understanding of those sources and programs. This PEP and the software which accompany it attempt to make the process less fragile.

Existing Modules

This problem has been tackled before. At least three modules currently available in the Python community enable programmers to read and write CSV files:

  • Object Craft's CSV module [2]
  • Cliff Wells' Python-DSV module [3]
  • Laurence Tratt's ASV module [4]

Each has a different API, making it somewhat difficult for programmers to switch between them. More of a problem may be that they interpret some of the CSV corner cases differently, so even after surmounting the differences between the different module APIs, the programmer has to also deal with semantic differences between the packages.

Module Interface

This PEP supports three basic APIs, one to read and parse CSV files, one to write them, and one to identify different CSV dialects to the readers and writers.

Reading CSV Files

CSV readers are created with the reader factory function:

obj = reader(iterable [, dialect='excel']
             [optional keyword args])

A reader object is an iterator which takes an iterable object returning lines as the sole required parameter. If it supports a binary mode (file objects do), the iterable argument to the reader function must have been opened in binary mode. This gives the reader object full control over the interpretation of the file's contents. The optional dialect parameter is discussed below. The reader function also accepts several optional keyword arguments which define specific format settings for the parser (see the section "Formatting Parameters"). Readers are typically used as follows:

csvreader = csv.reader(file("some.csv"))
for row in csvreader:
    process(row)

Each row returned by a reader object is a list of strings or Unicode objects.

When both a dialect parameter and individual formatting parameters are passed to the constructor, first the dialect is queried for formatting parameters, then individual formatting parameters are examined.

Writing CSV Files

Creating writers is similar:

obj = writer(fileobj [, dialect='excel'],
             [optional keyword args])

A writer object is a wrapper around a file-like object opened for writing in binary mode (if such a distinction is made). It accepts the same optional keyword parameters as the reader constructor.

Writers are typically used as follows:

csvwriter = csv.writer(file("some.csv", "w"))
for row in someiterable:
    csvwriter.writerow(row)

To generate a set of field names as the first row of the CSV file, the programmer must explicitly write it, e.g.:

csvwriter = csv.writer(file("some.csv", "w"), fieldnames=names)
csvwriter.write(names)
for row in someiterable:
    csvwriter.write(row)

or arrange for it to be the first row in the iterable being written.

Managing Different Dialects

Because CSV is a somewhat ill-defined format, there are plenty of ways one CSV file can differ from another, yet contain exactly the same data. Many tools which can import or export tabular data allow the user to indicate the field delimiter, quote character, line terminator, and other characteristics of the file. These can be fairly easily determined, but are still mildly annoying to figure out, and make for fairly long function calls when specified individually.

To try and minimize the difficulty of figuring out and specifying a bunch of formatting parameters, reader and writer objects support a dialect argument which is just a convenient handle on a group of these lower level parameters. When a dialect is given as a string it identifies one of the dialects known to the module via its registration functions, otherwise it must be an instance of the Dialect class as described below.

Dialects will generally be named after applications or organizations which define specific sets of format constraints. Two dialects are defined in the module as of this writing, "excel", which describes the default format constraints for CSV file export by Excel 97 and Excel 2000, and "excel-tab", which is the same as "excel" but specifies an ASCII TAB character as the field delimiter.

Dialects are implemented as attribute only classes to enable users to construct variant dialects by subclassing. The "excel" dialect is a subclass of Dialect and is defined as follows:

class Dialect:
    # placeholders
    delimiter = None
    quotechar = None
    escapechar = None
    doublequote = None
    skipinitialspace = None
    lineterminator = None
    quoting = None

class excel(Dialect):
    delimiter = ','
    quotechar = '"'
    doublequote = True
    skipinitialspace = False
    lineterminator = '\r\n'
    quoting = QUOTE_MINIMAL

The "excel-tab" dialect is defined as:

class exceltsv(excel):
    delimiter = '\t'

(For a description of the individual formatting parameters see the section "Formatting Parameters".)

To enable string references to specific dialects, the module defines several functions:

dialect = get_dialect(name)
names = list_dialects()
register_dialect(name, dialect)
unregister_dialect(name)

get_dialect() returns the dialect instance associated with the given name. list_dialects() returns a list of all registered dialect names. register_dialects() associates a string name with a dialect class. unregister_dialect() deletes a name/dialect association.

Formatting Parameters

In addition to the dialect argument, both the reader and writer constructors take several specific formatting parameters, specified as keyword parameters. The formatting parameters understood are:

  • quotechar specifies a one-character string to use as the quoting character. It defaults to '"'. Setting this to None has the same effect as setting quoting to csv.QUOTE_NONE.
  • delimiter specifies a one-character string to use as the field separator. It defaults to ','.
  • escapechar specifies a one-character string used to escape the delimiter when quotechar is set to None.
  • skipinitialspace specifies how to interpret whitespace which immediately follows a delimiter. It defaults to False, which means that whitespace immediately following a delimiter is part of the following field.
  • lineterminator specifies the character sequence which should terminate rows.
  • quoting controls when quotes should be generated by the writer. It can take on any of the following module constants:
    • csv.QUOTE_MINIMAL means only when required, for example, when a field contains either the quotechar or the delimiter
    • csv.QUOTE_ALL means that quotes are always placed around fields.
    • csv.QUOTE_NONNUMERIC means that quotes are always placed around nonnumeric fields.
    • csv.QUOTE_NONE means that quotes are never placed around fields.
  • doublequote controls the handling of quotes inside fields. When True two consecutive quotes are interpreted as one during read, and when writing, each quote is written as two quotes.

When processing a dialect setting and one or more of the other optional parameters, the dialect parameter is processed before the individual formatting parameters. This makes it easy to choose a dialect, then override one or more of the settings without defining a new dialect class. For example, if a CSV file was generated by Excel 2000 using single quotes as the quote character and a colon as the delimiter, you could create a reader like:

csvreader = csv.reader(file("some.csv"), dialect="excel",
                       quotechar="'", delimiter=':')

Other details of how Excel generates CSV files would be handled automatically because of the reference to the "excel" dialect.

Reader Objects

Reader objects are iterables whose next() method returns a sequence of strings, one string per field in the row.

Writer Objects

Writer objects have two methods, writerow() and writerows(). The former accepts an iterable (typically a list) of fields which are to be written to the output. The latter accepts a list of iterables and calls writerow() for each.

Implementation

There is a sample implementation available. [1] The goal is for it to efficiently implement the API described in the PEP. It is heavily based on the Object Craft csv module. [2]

Testing

The sample implementation [1] includes a set of test cases.

Issues

  1. Should a parameter control how consecutive delimiters are interpreted? Our thought is "no". Consecutive delimiters should always denote an empty field.

  2. What about Unicode? Is it sufficient to pass a file object gotten from codecs.open()? For example:

    csvreader = csv.reader(codecs.open("some.csv", "r", "cp1252"))
    
    csvwriter = csv.writer(codecs.open("some.csv", "w", "utf-8"))
    

    In the first example, text would be assumed to be encoded as cp1252. Should the system be aggressive in converting to Unicode or should Unicode strings only be returned if necessary?

    In the second example, the file will take care of automatically encoding Unicode strings as utf-8 before writing to disk.

    Note: As of this writing, the csv module doesn't handle Unicode data.

  3. What about alternate escape conventions? If the dialect in use includes an escapechar parameter which is not None and the quoting parameter is set to QUOTE_NONE, delimiters appearing within fields will be prefixed by the escape character when writing and are expected to be prefixed by the escape character when reading.

  4. Should there be a "fully quoted" mode for writing? What about "fully quoted except for numeric values"? Both are implemented (QUOTE_ALL and QUOTE_NONNUMERIC, respectively).

  5. What about end-of-line? If I generate a CSV file on a Unix system, will Excel properly recognize the LF-only line terminators? Files must be opened for reading or writing as appropriate using binary mode. Specify the lineterminator sequence as '\r\n'. The resulting file will be written correctly.

  6. What about an option to generate dicts from the reader and accept dicts by the writer? See the DictReader and DictWriter classes in csv.py.

  7. Are quote character and delimiters limited to single characters? For the time being, yes.

  8. How should rows of different lengths be handled? Interpretation of the data is the application's job. There is no such thing as a "short row" or a "long row" at this level.

References

[1](1, 2) csv module, Python Sandbox (http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/python/python/nondist/sandbox/csv/)
[2](1, 2) csv module, Object Craft (http://www.object-craft.com.au/projects/csv)
[3]Python-DSV module, Wells (http://sourceforge.net/projects/python-dsv/)
[4]ASV module, Tratt (http://tratt.net/laurie/python/asv/)

There are many references to other CSV-related projects on the Web. A few are included here.

pep-0306 How to Change Python's Grammar

PEP: 306
Title: How to Change Python's Grammar
Version: $Revision$
Last-Modified: $Date$
Author: Michael Hudson <mwh at python.net>, Jack Diederich <jackdied at gmail.com>, Nick Coghlan <ncoghlan at gmail.com>, Benjamin Peterson <benjamin at python.org>
Status: Withdrawn
Type: Informational
Content-Type: text/plain
Created: 29-Jan-2003
Post-History: 30-Jan-2003

Note

    This PEP has been moved to the Python dev guide.


Abstract

    There's more to changing Python's grammar than editing
    Grammar/Grammar and Python/compile.c.  This PEP aims to be a
    checklist of places that must also be fixed.

    It is probably incomplete.  If you see omissions, just add them if
    you can -- you are not going to offend the author's sense of
    ownership.  Otherwise submit a bug or patch and assign it to mwh.

    This PEP is not intended to be an instruction manual on Python
    grammar hacking, for several reasons.


Rationale

    People are getting this wrong all the time; it took well over a
    year before someone noticed[1] that adding the floor division
    operator (//) broke the parser module.


Checklist

    __ Grammar/Grammar: OK, you'd probably worked this one out :)

    __ Parser/Python.asdl may need changes to match the Grammar.  Run
       make to regenerate Include/Python-ast.h and
       Python/Python-ast.c.

    __ Python/ast.c will need changes to create the AST objects
       involved with the Grammar change.  Lib/compiler/ast.py will
       need matching changes to the pure-python AST objects.

    __ Parser/pgen needs to be rerun to regenerate Include/graminit.h
       and Python/graminit.c. (make should handle this for you.)

    __ Python/symbtable.c: This handles the symbol collection pass
       that happens immediately before the compilation pass.

    __ Python/compile.c: You will need to create or modify the
       compiler_* functions to generate opcodes for your productions.

    __ You may need to regenerate Lib/symbol.py and/or Lib/token.py
       and/or Lib/keyword.py.

    __ The parser module.  Add some of your new syntax to test_parser,
       bang on Modules/parsermodule.c until it passes.

    __ Add some usage of your new syntax to test_grammar.py

    __ The compiler package.  A good test is to compile the standard
       library and test suite with the compiler package and then check
       it runs.  Note that this only needs to be done in Python 2.x.

    __ If you've gone so far as to change the token structure of
       Python, then the Lib/tokenizer.py library module will need to
       be changed.

    __ Certain changes may require tweaks to the library module
       pyclbr.

    __ Documentation must be written!

    __ After everything's been checked in, you're likely to see a new
       change to Python/Python-ast.c.  This is because this
       (generated) file contains the SVN version of the source from
       which it was generated.  There's no way to avoid this; you just
       have to submit this file separately.


References

    [1] SF Bug #676521, parser module validation failure
        http://www.python.org/sf/676521


Copyright

    This document has been placed in the public domain.



pep-0307 Extensions to the pickle protocol

PEP: 307
Title: Extensions to the pickle protocol
Version: $Revision$
Last-Modified: $Date$
Author: Guido van Rossum, Tim Peters
Status: Final
Type: Standards Track
Content-Type: text/plain
Created: 31-Jan-2003
Post-History: 7-Feb-2003

Introduction

    Pickling new-style objects in Python 2.2 is done somewhat clumsily
    and causes pickle size to bloat compared to classic class
    instances.  This PEP documents a new pickle protocol in Python 2.3
    that takes care of this and many other pickle issues.

    There are two sides to specifying a new pickle protocol: the byte
    stream constituting pickled data must be specified, and the
    interface between objects and the pickling and unpickling engines
    must be specified.  This PEP focuses on API issues, although it
    may occasionally touch on byte stream format details to motivate a
    choice.  The pickle byte stream format is documented formally by
    the standard library module pickletools.py (already checked into
    CVS for Python 2.3).

    This PEP attempts to fully document the interface between pickled
    objects and the pickling process, highlighting additions by
    specifying "new in this PEP".  (The interface to invoke pickling
    or unpickling is not covered fully, except for the changes to the
    API for specifying the pickling protocol to picklers.)


Motivation

    Pickling new-style objects causes serious pickle bloat.  For
    example,

        class C(object): # Omit "(object)" for classic class
            pass
        x = C()
        x.foo = 42
        print len(pickle.dumps(x, 1))

    The binary pickle for the classic object consumed 33 bytes, and for
    the new-style object 86 bytes.

    The reasons for the bloat are complex, but are mostly caused by
    the fact that new-style objects use __reduce__ in order to be
    picklable at all.  After ample consideration we've concluded that
    the only way to reduce pickle sizes for new-style objects is to
    add new opcodes to the pickle protocol.  The net result is that
    with the new protocol, the pickle size in the above example is 35
    (two extra bytes are used at the start to indicate the protocol
    version, although this isn't strictly necessary).


Protocol versions

    Previously, pickling (but not unpickling) distinguished between
    text mode and binary mode.  By design, binary mode is a
    superset of text mode, and unpicklers don't need to know in
    advance whether an incoming pickle uses text mode or binary mode.
    The virtual machine used for unpickling is the same regardless of
    the mode; certain opcodes simply aren't used in text mode.

    Retroactively, text mode is now called protocol 0, and binary mode
    protocol 1.  The new protocol is called protocol 2.  In the
    tradition of pickling protocols, protocol 2 is a superset of
    protocol 1.  But just so that future pickling protocols aren't
    required to be supersets of the oldest protocols, a new opcode is
    inserted at the start of a protocol 2 pickle indicating that it is
    using protocol 2.  To date, each release of Python has been able to
    read pickles written by all previous releases.  Of course pickles
    written under protocol N can't be read by versions of Python
    earlier than the one that introduced protocol N.

    Several functions, methods and constructors used for pickling used
    to take a positional argument named 'bin' which was a flag,
    defaulting to 0, indicating binary mode.  This argument is renamed
    to 'protocol' and now gives the protocol number, still defaulting
    to 0.

    It so happens that passing 2 for the 'bin' argument in previous
    Python versions had the same effect as passing 1.  Nevertheless, a
    special case is added here:  passing a negative number selects the
    highest protocol version supported by a particular implementation.
    This works in previous Python versions, too, and so can be used to
    select the highest protocol available in a way that's both backward
    and forward compatible.  In addition, a new module constant
    HIGHEST_PROTOCOL is supplied by both pickle and cPickle, equal to
    the highest protocol number the module can read.  This is cleaner
    than passing -1, but cannot be used before Python 2.3.

    The pickle.py module has supported passing the 'bin' value as a
    keyword argument rather than a positional argument.  (This is not
    recommended, since cPickle only accepts positional arguments, but
    it works...)  Passing 'bin' as a keyword argument is deprecated,
    and a PendingDeprecationWarning is issued in this case.  You have
    to invoke the Python interpreter with -Wa or a variation on that
    to see PendingDeprecationWarning messages.  In Python 2.4, the
    warning class may be upgraded to DeprecationWarning.


Security issues

    In previous versions of Python, unpickling would do a "safety
    check" on certain operations, refusing to call functions or
    constructors that weren't marked as "safe for unpickling" by
    either having an attribute __safe_for_unpickling__ set to 1, or by
    being registered in a global registry, copy_reg.safe_constructors.

    This feature gives a false sense of security: nobody has ever done
    the necessary, extensive, code audit to prove that unpickling
    untrusted pickles cannot invoke unwanted code, and in fact bugs in
    the Python 2.2 pickle.py module make it easy to circumvent these
    security measures.

    We firmly believe that, on the Internet, it is better to know that
    you are using an insecure protocol than to trust a protocol to be
    secure whose implementation hasn't been thoroughly checked.  Even
    high quality implementations of widely used protocols are
    routinely found flawed; Python's pickle implementation simply
    cannot make such guarantees without a much larger time investment.
    Therefore, as of Python 2.3, all safety checks on unpickling are
    officially removed, and replaced with this warning:

      *** Do not unpickle data received from an untrusted or
          unauthenticated source ***

    The same warning applies to previous Python versions, despite the
    presence of safety checks there.


Extended __reduce__ API

    There are several APIs that a class can use to control pickling.
    Perhaps the most popular of these are __getstate__ and
    __setstate__; but the most powerful one is __reduce__.  (There's
    also __getinitargs__, and we're adding __getnewargs__ below.)

    There are several ways to provide __reduce__ functionality: a
    class can implement a __reduce__ method or a __reduce_ex__ method
    (see next section), or a reduce function can be declared in
    copy_reg (copy_reg.dispatch_table maps classes to functions).  The
    return values are interpreted exactly the same, though, and we'll
    refer to these collectively as __reduce__.

    IMPORTANT: pickling of classic class instances does not look for a
    __reduce__ or __reduce_ex__ method or a reduce function in the
    copy_reg dispatch table, so that a classic class cannot provide
    __reduce__ functionality in the sense intended here.  A classic
    class must use __getinitargs__ and/or __getstate__ to customize
    pickling.  These are described below.

    __reduce__ must return either a string or a tuple.  If it returns
    a string, this is an object whose state is not to be pickled, but
    instead a reference to an equivalent object referenced by name.
    Surprisingly, the string returned by __reduce__ should be the
    object's local name (relative to its module); the pickle module
    searches the module namespace to determine the object's module.

    The rest of this section is concerned with the tuple returned by
    __reduce__.  It is a variable size tuple, of length 2 through 5.
    The first two items (function and arguments) are required.  The
    remaining items are optional and may be left off from the end;
    giving None for the value of an optional item acts the same as
    leaving it off.  The last two items are new in this PEP.  The items
    are, in order:

    function     Required.
                 A callable object (not necessarily a function) called
                 to create the initial version of the object; state
                 may be added to the object later to fully reconstruct
                 the pickled state.  This function must itself be
                 picklable.  See the section about __newobj__ for a
                 special case (new in this PEP) here.

    arguments    Required.
                 A tuple giving the argument list for the function.
                 As a special case, designed for Zope 2's
                 ExtensionClass, this may be None; in that case,
                 function should be a class or type, and
                 function.__basicnew__() is called to create the
                 initial version of the object.  This exception is
                 deprecated.

    Unpickling invokes function(*arguments) to create an initial object,
    called obj below.  If the remaining items are left off, that's the end
    of unpickling for this object and obj is the result.    Else obj is
    modified at unpickling time by each item specified, as follows.

    state        Optional.
                 Additional state.  If this is not None, the state is
                 pickled, and obj.__setstate__(state) will be called
                 when unpickling.  If no __setstate__ method is
                 defined, a default implementation is provided, which
                 assumes that state is a dictionary mapping instance
                 variable names to their values.  The default
                 implementation calls

                     obj.__dict__.update(state)

                 or, if the update() call fails,

                     for k, v in state.items():
                         setattr(obj, k, v)

    listitems    Optional, and new in this PEP.
                 If this is not None, it should be an iterator (not a
                 sequence!) yielding successive list items.  These list
                 items will be pickled, and appended to the object using
                 either obj.append(item) or obj.extend(list_of_items).
                 This is primarily used for list subclasses, but may
                 be used by other classes as long as they have append()
                 and extend() methods with the appropriate signature.
                 (Whether append() or extend() is used depends on which
                 pickle protocol version is used as well as the number
                 of items to append, so both must be supported.)

    dictitems    Optional, and new in this PEP.
                 If this is not None, it should be an iterator (not a
                 sequence!) yielding successive dictionary items, which
                 should be tuples of the form (key, value).  These items
                 will be pickled, and stored to the object using
                 obj[key] = value.  This is primarily used for dict
                 subclasses, but may be used by other classes as long
                 as they implement __setitem__.

    Note: in Python 2.2 and before, when using cPickle, state would be
    pickled if present even if it is None; the only safe way to avoid
    the __setstate__ call was to return a two-tuple from __reduce__.
    (But pickle.py would not pickle state if it was None.)  In Python
    2.3, __setstate__ will never be called at unpickling time when
    __reduce__ returns a state with value None at pickling time.

    A __reduce__ implementation that needs to work both under Python
    2.2 and under Python 2.3 could check the variable
    pickle.format_version to determine whether to use the listitems
    and dictitems features.  If this value is >= "2.0" then they are
    supported.  If not, any list or dict items should be incorporated
    somehow in the 'state' return value, and the __setstate__ method
    should be prepared to accept list or dict items as part of the
    state (how this is done is up to the application).


The __reduce_ex__ API

    It is sometimes useful to know the protocol version when
    implementing __reduce__.  This can be done by implementing a
    method named __reduce_ex__ instead of __reduce__.  __reduce_ex__,
    when it exists, is called in preference over __reduce__ (you may
    still provide __reduce__ for backwards compatibility).  The
    __reduce_ex__ method will be called with a single integer
    argument, the protocol version.

    The 'object' class implements both __reduce__ and __reduce_ex__;
    however, if a subclass overrides __reduce__ but not __reduce_ex__,
    the __reduce_ex__ implementation detects this and calls
    __reduce__.


Customizing pickling absent a __reduce__ implementation

    If no __reduce__ implementation is available for a particular
    class, there are three cases that need to be considered
    separately, because they are handled differently:

    1. classic class instances, all protocols

    2. new-style class instances, protocols 0 and 1

    3. new-style class instances, protocol 2

    Types implemented in C are considered new-style classes.  However,
    except for the common built-in types, these need to provide a
    __reduce__ implementation in order to be picklable with protocols
    0 or 1.  Protocol 2 supports built-in types providing
    __getnewargs__, __getstate__ and __setstate__ as well.


Case 1: pickling classic class instances

    This case is the same for all protocols, and is unchanged from
    Python 2.1.

    For classic classes, __reduce__ is not used.  Instead, classic
    classes can customize their pickling by providing methods named
    __getstate__, __setstate__ and __getinitargs__.  Absent these, a
    default pickling strategy for classic class instances is
    implemented that works as long as all instance variables are
    picklable.  This default strategy is documented in terms of
    default implementations of __getstate__ and __setstate__.

    The primary ways to customize pickling of classic class instances
    is by specifying __getstate__ and/or __setstate__ methods.  It is
    fine if a class implements one of these but not the other, as long
    as it is compatible with the default version.

    The __getstate__ method

      The __getstate__ method should return a picklable value
      representing the object's state without referencing the object
      itself.  If no __getstate__ method exists, a default
      implementation is used that returns self.__dict__.

    The __setstate__ method

      The __setstate__ method should take one argument; it will be
      called with the value returned by __getstate__ (or its default
      implementation).

      If no __setstate__ method exists, a default implementation is
      provided that assumes the state is a dictionary mapping instance
      variable names to values.  The default implementation tries two
      things:

      - First, it tries to call self.__dict__.update(state).

      - If the update() call fails with a RuntimeError exception, it
        calls setattr(self, key, value) for each (key, value) pair in
        the state dictionary.  This only happens when unpickling in
        restricted execution mode (see the rexec standard library
        module).

    The __getinitargs__ method

      The __setstate__ method (or its default implementation) requires
      that a new object already exists so that its __setstate__ method
      can be called.  The point is to create a new object that isn't
      fully initialized; in particular, the class's __init__ method
      should not be called if possible.

      These are the possibilities:

      - Normally, the following trick is used: create an instance of a
        trivial classic class (one without any methods or instance
        variables) and then use __class__ assignment to change its
        class to the desired class.  This creates an instance of the
        desired class with an empty __dict__ whose __init__ has not
        been called.

      - However, if the class has a method named __getinitargs__, the
        above trick is not used, and a class instance is created by
        using the tuple returned by __getinitargs__ as an argument
        list to the class constructor.  This is done even if
        __getinitargs__ returns an empty tuple -- a __getinitargs__
        method that returns () is not equivalent to not having
        __getinitargs__ at all.  __getinitargs__ *must* return a
        tuple.

      - In restricted execution mode, the trick from the first bullet
        doesn't work; in this case, the class constructor is called
        with an empty argument list if no __getinitargs__ method
        exists.  This means that in order for a classic class to be
        unpicklable in restricted execution mode, it must either
        implement __getinitargs__ or its constructor (i.e., its
        __init__ method) must be callable without arguments.


Case 2: pickling new-style class instances using protocols 0 or 1

    This case is unchanged from Python 2.2.  For better pickling of
    new-style class instances when backwards compatibility is not an
    issue, protocol 2 should be used; see case 3 below.

    New-style classes, whether implemented in C or in Python, inherit
    a default __reduce__ implementation from the universal base class
    'object'.

    This default __reduce__ implementation is not used for those
    built-in types for which the pickle module has built-in support.
    Here's a full list of those types:

    - Concrete built-in types: NoneType, bool, int, float, complex,
      str, unicode, tuple, list, dict.  (Complex is supported by
      virtue of a __reduce__ implementation registered in copy_reg.)
      In Jython, PyStringMap is also included in this list.

    - Classic instances.

    - Classic class objects, Python function objects, built-in
      function and method objects, and new-style type objects (==
      new-style class objects).  These are pickled by name, not by
      value: at unpickling time, a reference to an object with the
      same name (the fully qualified module name plus the variable
      name in that module) is substituted.

    The default __reduce__ implementation will fail at pickling time
    for built-in types not mentioned above, and for new-style classes
    implemented in C:  if they want to be picklable, they must supply
    a custom __reduce__ implementation under protocols 0 and 1.

    For new-style classes implemented in Python, the default
    __reduce__ implementation (copy_reg._reduce) works as follows:

    Let D be the class on the object to be pickled.  First, find the
    nearest base class that is implemented in C (either as a
    built-in type or as a type defined by an extension class).  Call
    this base class B, and the class of the object to be pickled D.
    Unless B is the class 'object', instances of class B must be
    picklable, either by having built-in support (as defined in the
    above three bullet points), or by having a non-default
    __reduce__ implementation.  B must not be the same class as D
    (if it were, it would mean that D is not implemented in Python).

    The callable produced by the default __reduce__ is
    copy_reg._reconstructor, and its arguments tuple is
    (D, B, basestate), where basestate is None if B is the builtin
    object class, and basestate is

        basestate = B(obj)

    if B is not the builtin object class.  This is geared toward
    pickling subclasses of builtin types, where, for example,
    list(some_list_subclass_instance) produces "the list part" of
    the list subclass instance.

    The object is recreated at unpickling time by
    copy_reg._reconstructor, like so:

        obj = B.__new__(D, basestate)
        B.__init__(obj, basestate)

    Objects using the default __reduce__ implementation can customize
    it by defining __getstate__ and/or __setstate__ methods.  These
    work almost the same as described for classic classes above, except
    that if __getstate__ returns an object (of any type) whose value is
    considered false (e.g. None, or a number that is zero, or an empty
    sequence or mapping), this state is not pickled and __setstate__
    will not be called at all.  If __getstate__ exists and returns a
    true value, that value becomes the third element of the tuple
    returned by the default __reduce__, and at unpickling time the
    value is passed to __setstate__.  If __getstate__ does not exist,
    but obj.__dict__ exists, then  obj.__dict__ becomes the third
    element of the tuple returned by  __reduce__, and again at
    unpickling time the value is passed to obj.__setstate__.  The
    default __setstate__ is the same as that for classic classes,
    described above.

    Note that this strategy ignores slots.  Instances of new-style
    classes that have slots but no __getstate__ method cannot be
    pickled by protocols 0 and 1; the code explicitly checks for
    this condition.

    Note that pickling new-style class instances ignores __getinitargs__
    if it exists (and under all protocols).  __getinitargs__ is
    useful only for classic classes.


Case 3: pickling new-style class instances using protocol 2

    Under protocol 2, the default __reduce__ implementation inherited
    from the 'object' base class is *ignored*.  Instead, a different
    default implementation is used, which allows more efficient
    pickling of new-style class instances than possible with protocols
    0 or 1, at the cost of backward incompatibility with Python 2.2
    (meaning no more than that a protocol 2 pickle cannot be unpickled
    before Python 2.3).

    The customization uses three special methods: __getstate__,
    __setstate__ and __getnewargs__ (note that __getinitargs__ is again
    ignored).  It is fine if a class implements one or more but not all
    of these, as long as it is compatible with the default
    implementations.

    The __getstate__ method

      The __getstate__ method should return a picklable value
      representing the object's state without referencing the object
      itself.  If no __getstate__ method exists, a default
      implementation is used which is described below.

      There's a subtle difference between classic and new-style
      classes here: if a classic class's __getstate__ returns None,
      self.__setstate__(None) will be called as part of unpickling.
      But if a new-style class's __getstate__ returns None, its
      __setstate__ won't be called at all as part of unpickling.

      If no __getstate__ method exists, a default state is computed.
      There are several cases:

      - For a new-style class that has no instance __dict__ and no
        __slots__, the default state is None.

      - For a new-style class that has an instance __dict__ and no
        __slots__, the default state is self.__dict__.

      - For a new-style class that has an instance __dict__ and
        __slots__, the default state is a tuple consisting of two
        dictionaries:  self.__dict__, and a dictionary mapping slot
        names to slot values.  Only slots that have a value are
        included in the latter.

      - For a new-style class that has __slots__ and no instance
        __dict__, the default state is a tuple whose first item is
        None and whose second item is a dictionary mapping slot names
        to slot values described in the previous bullet.

    The __setstate__ method

      The __setstate__ method should take one argument; it will be
      called with the value returned by __getstate__ or with the
      default state described above if no __getstate__ method is
      defined.

      If no __setstate__ method exists, a default implementation is
      provided that can handle the state returned by the default
      __getstate__, described above.

    The __getnewargs__ method

      Like for classic classes, the __setstate__ method (or its
      default implementation) requires that a new object already
      exists so that its __setstate__ method can be called.

      In protocol 2, a new pickling opcode is used that causes a new
      object to be created as follows:

        obj = C.__new__(C, *args)

      where C is the class of the pickled object, and args is either
      the empty tuple, or the tuple returned by the __getnewargs__
      method, if defined.  __getnewargs__ must return a tuple.  The
      absence of a __getnewargs__ method is equivalent to the existence
      of one that returns ().


The __newobj__ unpickling function

    When the unpickling function returned by __reduce__ (the first
    item of the returned tuple) has the name __newobj__, something
    special happens for pickle protocol 2.  An unpickling function
    named __newobj__ is assumed to have the following semantics:

      def __newobj__(cls, *args):
          return cls.__new__(cls, *args)

    Pickle protocol 2 special-cases an unpickling function with this
    name, and emits a pickling opcode that, given 'cls' and 'args',
    will return cls.__new__(cls, *args) without also pickling a
    reference to __newobj__ (this is the same pickling opcode used by
    protocol 2 for a new-style class instance when no __reduce__
    implementation exists).  This is the main reason why protocol 2
    pickles are much smaller than classic pickles.  Of course, the
    pickling code cannot verify that a function named __newobj__
    actually has the expected semantics.  If you use an unpickling
    function named __newobj__ that returns something different, you
    deserve what you get.

    It is safe to use this feature under Python 2.2; there's nothing
    in the recommended implementation of __newobj__ that depends on
    Python 2.3.


The extension registry

    Protocol 2 supports a new mechanism to reduce the size of pickles.

    When class instances (classic or new-style) are pickled, the full
    name of the class (module name including package name, and class
    name) is included in the pickle.  Especially for applications that
    generate many small pickles, this is a lot of overhead that has to
    be repeated in each pickle.  For large pickles, when using
    protocol 1, repeated references to the same class name are
    compressed using the "memo" feature; but each class name must be
    spelled in full at least once per pickle, and this causes a lot of
    overhead for small pickles.

    The extension registry allows one to represent the most frequently
    used names by small integers, which are pickled very efficiently:
    an extension code in the range 1-255 requires only two bytes
    including the opcode, one in the range 256-65535 requires only
    three bytes including the opcode.

    One of the design goals of the pickle protocol is to make pickles
    "context-free": as long as you have installed the modules
    containing the classes referenced by a pickle, you can unpickle
    it, without needing to import any of those classes ahead of time.

    Unbridled use of extension codes could jeopardize this desirable
    property of pickles.  Therefore, the main use of extension codes
    is reserved for a set of codes to be standardized by some
    standard-setting body.  This being Python, the standard-setting
    body is the PSF.  From time to time, the PSF will decide on a
    table mapping extension codes to class names (or occasionally
    names of other global objects; functions are also eligible).  This
    table will be incorporated in the next Python release(s).

    However, for some applications, like Zope, context-free pickles
    are not a requirement, and waiting for the PSF to standardize
    some codes may not be practical.  Two solutions are offered for
    such applications.

    First, a few ranges of extension codes are reserved for private
    use.  Any application can register codes in these ranges.
    Two applications exchanging pickles using codes in these ranges
    need to have some out-of-band mechanism to agree on the mapping
    between extension codes and names.

    Second, some large Python projects (e.g. Zope) can be assigned a
    range of extension codes outside the "private use" range that they
    can assign as they see fit.

    The extension registry is defined as a mapping between extension
    codes and names.  When an extension code is unpickled, it ends up
    producing an object, but this object is gotten by interpreting the
    name as a module name followed by a class (or function) name.  The
    mapping from names to objects is cached.  It is quite possible
    that certain names cannot be imported; that should not be a
    problem as long as no pickle containing a reference to such names
    has to be unpickled.  (The same issue already exists for direct
    references to such names in pickles that use protocols 0 or 1.)

    Here is the proposed initial assigment of extension code ranges:

      First  Last Count  Purpose

          0     0     1  Reserved -- will never be used
          1   127   127  Reserved for Python standard library
        128   191    64  Reserved for Zope
        192   239    48  Reserved for 3rd parties
        240   255    16  Reserved for private use (will never be assigned)
        256   MAX   MAX  Reserved for future assignment

    MAX stands for 2147483647, or 2**31-1.  This is a hard limitation
    of the protocol as currently defined.

    At the moment, no specific extension codes have been assigned yet.


Extension registry API

    The extension registry is maintained as private global variables
    in the copy_reg module.  The following three functions are defined
    in this module to manipulate the registry:

    add_extension(module, name, code)
        Register an extension code.  The module and name arguments
        must be strings; code must be an int in the inclusive range 1
        through MAX.  This must either register a new (module, name)
        pair to a new code, or be a redundant repeat of a previous
        call that was not canceled by a remove_extension() call; a
        (module, name) pair may not be mapped to more than one code,
        nor may a code be mapped to more than one (module, name)
        pair.  (XXX Aliasing may actually cause a problem for this
        requirement; we'll see as we go.)

    remove_extension(module, name, code)
        Arguments are as for add_extension().  Remove a previously
        registered mapping between (module, name) and code.

    clear_extension_cache()
        The implementation of extension codes may use a cache to speed
        up loading objects that are named frequently.  This cache can
        be emptied (removing references to cached objects) by calling
        this method.

    Note that the API does not enforce the standard range assignments.
    It is up to applications to respect these.


The copy module

    Traditionally, the copy module has supported an extended subset of
    the pickling APIs for customizing the copy() and deepcopy()
    operations.

    In particular, besides checking for a __copy__ or __deepcopy__
    method, copy() and deepcopy() have always looked for __reduce__,
    and for classic classes, have looked for __getinitargs__,
    __getstate__ and __setstate__.

    In Python 2.2, the default __reduce__ inherited from 'object' made
    copying simple new-style classes possible, but slots and various
    other special cases were not covered.

    In Python 2.3, several changes are made to the copy module:

    - __reduce_ex__ is supported (and always called with 2 as the
      protocol version argument).

    - The four- and five-argument return values of __reduce__ are
      supported.

    - Before looking for a __reduce__ method, the
      copy_reg.dispatch_table is consulted, just like for pickling.

    - When the __reduce__ method is inherited from object, it is
      (unconditionally) replaced by a better one that uses the same
      APIs as pickle protocol 2: __getnewargs__, __getstate__, and
      __setstate__, handling list and dict subclasses, and handling
      slots.

    As a consequence of the latter change, certain new-style classes
    that were copyable under Python 2.2 are not copyable under Python
    2.3.  (These classes are also not picklable using pickle protocol
    2.)  A minimal example of such a class:

        class C(object):
            def __new__(cls, a):
                return object.__new__(cls)

    The problem only occurs when __new__ is overridden and has at
    least one mandatory argument in addition to the class argument.

    To fix this, a __getnewargs__ method should be added that returns
    the appropriate argument tuple (excluding the class).


Pickling Python longs

    Pickling and unpickling Python longs takes time quadratic in
    the number of digits, in protocols 0 and 1.  Under protocol 2,
    new opcodes support linear-time pickling and unpickling of longs.


Pickling bools

    Protocol 2 introduces new opcodes for pickling True and False
    directly.  Under protocols 0 and 1, bools are pickled as integers,
    using a trick in the representation of the integer in the pickle
    so that an unpickler can recognize that a bool was intended.  That
    trick consumed 4 bytes per bool pickled.  The new bool opcodes
    consume 1 byte per bool.


Pickling small tuples

    Protocol 2 introduces new opcodes for more-compact pickling of
    tuples of lengths 1, 2 and 3.  Protocol 1 previously introduced
    an opcode for more-compact pickling of empty tuples.


Protocol identification

    Protocol 2 introduces a new opcode, with which all protocol 2
    pickles begin, identifying that the pickle is protocol 2.
    Attempting to unpickle a protocol 2 pickle under older versions
    of Python will therefore raise an "unknown opcode" exception
    immediately.


Pickling of large lists and dicts

    Protocol 1 pickles large lists and dicts "in one piece", which
    minimizes pickle size, but requires that unpickling create a temp
    object as large as the object being unpickled.  Part of the
    protocol 2 changes break large lists and dicts into pieces of no
    more than 1000 elements each, so that unpickling needn't create
    a temp object larger than needed to hold 1000 elements.  This
    isn't part of protocol 2, however:  the opcodes produced are still
    part of protocol 1.  __reduce__ implementations that return the
    optional new listitems or dictitems iterators also benefit from
    this unpickling temp-space optimization.


Copyright

    This document has been placed in the public domain.



pep-0308 Conditional Expressions

PEP: 308
Title: Conditional Expressions
Version: $Revision$
Last-Modified: $Date$
Author: Guido van Rossum, Raymond Hettinger
Status: Final
Type: Standards Track
Content-Type: text/plain
Created: 7-Feb-2003
Post-History: 7-Feb-2003, 11-Feb-2003

Adding a conditional expression

    On 9/29/2005, Guido decided to add conditional expressions in the
    form of "X if C else Y". [1]

    The motivating use case was the prevalance of error-prone attempts
    to achieve the same effect using "and" and "or". [2]
    
    Previous community efforts to add a conditional expression were
    stymied by a lack of consensus on the best syntax.  That issue was
    resolved by simply deferring to a BDFL best judgment call.

    The decision was validated by reviewing how the syntax fared when
    applied throughout the standard library (this review approximates a
    sampling of real-world use cases, across a variety of applications,
    written by a number of programmers with diverse backgrounds). [3]

    The following change will be made to the grammar.  (The or_test
    symbols is new, the others are modified.)

        test: or_test ['if' or_test 'else' test] | lambdef
        or_test: and_test ('or' and_test)*
        ...
        testlist_safe: or_test [(',' or_test)+ [',']]
        ...
        gen_for: 'for' exprlist 'in' or_test [gen_iter]

    The new syntax nearly introduced a minor syntactical backwards
    incompatibility.  In previous Python versions, the following is
    legal:

        [f for f in lambda x: x, lambda x: x**2 if f(1) == 1]

    (I.e. a list comprehension where the sequence following 'in' is an
    unparenthesized series of lambdas -- or just one lambda, even.)

    In Python 3.0, the series of lambdas will have to be
    parenthesized, e.g.:

        [f for f in (lambda x: x, lambda x: x**2) if f(1) == 1]

    This is because lambda binds less tight than the if-else
    expression, but in this context, the lambda could already be
    followed by an 'if' keyword that binds less tightly still (for
    details, consider the grammar changes shown above).

    However, in Python 2.5, a slightly different grammar is used that
    is more backwards compatible, but constrains the grammar of a
    lambda used in this position by forbidding the lambda's body to
    contain an unparenthesized condition expression.  Examples:

        [f for f in (1, lambda x: x if x >= 0 else -1)]    # OK
        [f for f in 1, (lambda x: x if x >= 0 else -1)]    # OK
        [f for f in 1, lambda x: (x if x >= 0 else -1)]    # OK
        [f for f in 1, lambda x: x if x >= 0 else -1]      # INVALID


References

    [1] Pronouncement
        http://mail.python.org/pipermail/python-dev/2005-September/056846.html

    [2] Motivating use case:
        http://mail.python.org/pipermail/python-dev/2005-September/056546.html
        http://mail.python.org/pipermail/python-dev/2005-September/056510.html        

    [3] Review in the context of real-world code fragments:
        http://mail.python.org/pipermail/python-dev/2005-September/056803.html


Introduction to earlier draft of the PEP (kept for historical purposes)

    Requests for an if-then-else ("ternary") expression keep coming up
    on comp.lang.python.  This PEP contains a concrete proposal of a
    fairly Pythonic syntax.  This is the community's one chance: if
    this PEP is approved with a clear majority, it will be implemented
    in Python 2.4.  If not, the PEP will be augmented with a summary
    of the reasons for rejection and the subject better not come up
    again.  While the BDFL is co-author of this PEP, he is neither in
    favor nor against this proposal; it is up to the community to
    decide.  If the community can't decide, the BDFL will reject the
    PEP.

    After unprecedented community response (very good arguments were
    made both pro and con) this PEP has been revised with the help of
    Raymond Hettinger.  Without going through a complete revision
    history, the main changes are a different proposed syntax, an
    overview of proposed alternatives, the state of the curent
    discussion, and a discussion of short-circuit behavior.

    Following the discussion, a vote was held.  While there was an overall
    interest in having some form of if-then-else expressions, no one
    format was able to draw majority support.  Accordingly, the PEP was
    rejected due to the lack of an overwhelming majority for change.
    Also, a Python design principle has been to prefer the status quo
    whenever there are doubts about which path to take.


Proposal

    The proposed syntax is as follows:

	(if <condition>: <expression1> else: <expression2>) 

    Note that the enclosing parentheses are not optional.
    
    The resulting expression is evaluated like this:

    - First, <condition> is evaluated.

    - If <condition> is true, <expression1> is evaluated and is the
      result of the whole thing.

    - If <condition> is false, <expression2> is evaluated and is the
      result of the whole thing.

    A natural extension of this syntax is to allow one or more 'elif'
    parts:

      (if <cond1>: <expr1> elif <cond2>: <expr2> ... else: <exprN>)

    This will be implemented if the proposal is accepted.

    The downsides to the proposal are:

    * the required parentheses
    * confusability with statement syntax
    * additional semantic loading of colons

    Note that at most one of <expression1> and <expression2> is
    evaluated.  This is called a "short-circuit expression"; it is
    similar to the way the second operand of 'and' / 'or' is only
    evaluated if the first operand is true / false.

    A common way to emulate an if-then-else expression is:

        <condition> and <expression1> or <expression2>

    However, this doesn't work the same way: it returns <expression2>
    when <expression1> is false!  See FAQ 4.16 for alternatives that
    work -- however, they are pretty ugly and require much more effort
    to understand.


Alternatives

    Holger Krekel proposed a new, minimally invasive variant:

        <condition> and <expression1> else <expression2>

    The concept behind it is that a nearly complete ternary operator
    already exists with and/or and this proposal is the least invasive
    change that makes it complete.  Many respondants on the
    newsgroup found this to be the most pleasing alternative.
    However, a couple of respondants were able to post examples
    that were mentally difficult to parse.  Later it was pointed
    out that this construct works by having the "else" change the
    existing meaning of "and".

    As a result, there is increasing support for Christian Tismer's
    proposed variant of the same idea:

        <condition> then <expression1> else <expression2>

    The advantages are simple visual parsing, no required parenthesis,
    no change in the semantics of existing keywords, not as likely
    as the proposal to be confused with statement syntax, and does
    not further overload the colon.  The disadvantage is the
    implementation costs of introducing a new keyword.  However,
    unlike other new keywords, the word "then" seems unlikely to
    have been used as a name in existing programs.

    ---

    Many C-derived languages use this syntax:

        <condition> ? <expression1> : <expression2>

    Eric Raymond even implemented this.  The BDFL rejected this for
    several reasons: the colon already has many uses in Python (even
    though it would actually not be ambiguous, because the question
    mark requires a matching colon); for people not used to C-derived
    language, it is hard to understand.

    ---

    The original version of this PEP proposed the following syntax:

        <expression1> if <condition> else <expression2>

    The out-of-order arrangement was found to be too uncomfortable
    for many of participants in the discussion; especially when
    <expression1> is long, it's easy to miss the conditional while
    skimming.

    ---

    Some have suggested adding a new builtin instead of extending the
    syntax of the language.  For example:

        cond(<condition>, <expression1>, <expression2>)

    This won't work the way a syntax extension will because both
    expression1 and expression2 must be evaluated before the function
    is called.  There's no way to short-circuit the expression
    evaluation.  It could work if 'cond' (or some other name) were
    made a keyword, but that has all the disadvantages of adding a new
    keyword, plus confusing syntax: it *looks* like a function call so
    a casual reader might expect both <expression1> and <expression2>
    to be evaluated.


Summary of the Current State of the Discussion

    Groups are falling into one of three camps:

    1.  Adopt a ternary operator built using punctuation characters:

            <condition> ? <expression1> : <expression2>

    2.  Adopt a ternary operator built using new or existing keywords.
        The leading examples are:

            <condition> then <expression1> else <expression2>
            (if <condition>: <expression1> else: <expression2>) 

    3.  Do nothing.

    The first two positions are relatively similar.

    Some find that any form of punctuation makes the language more
    cryptic.  Others find that punctuation style is appropriate for
    expressions rather than statements and helps avoid a COBOL style:
    3 plus 4 times 5.

    Adapting existing keywords attempts to improve on punctuation
    through explicit meaning and a more tidy appearance.  The downside
    is some loss of the economy-of-expression provided by punctuation
    operators.  The other downside is that it creates some degree of
    confusion between the two meanings and two usages of the keywords.

    Those difficulties are overcome by options which introduce new
    keywords which take more effort to implement.

    The last position is doing nothing.  Arguments in favor include
    keeping the language simple and concise; maintaining backwards
    compatibility; and that any every use case can already be already
    expressed in terms of "if" and "else".  Lambda expressions are an
    exception as they require the conditional to be factored out into
    a separate function definition.

    The arguments against doing nothing are that the other choices
    allow greater economy of expression and that current practices
    show a propensity for erroneous uses of "and", "or", or one their
    more complex, less visually unappealing workarounds.


Short-Circuit Behavior

    The principal difference between the ternary operator and the
    cond() function is that the latter provides an expression form but
    does not provide short-circuit evaluation.

    Short-circuit evaluation is desirable on three occasions:
                                                         
    1. When an expression has side-effects
    2. When one or both of the expressions are resource intensive
    3. When the condition serves as a guard for the validity of the
      expression.

    #  Example where all three reasons apply
    data = isinstance(source, file)  ?  source.readlines()
                                     :  source.split()

    1. readlines() moves the file pointer
    2. for long sources, both alternatives take time
    3. split() is only valid for strings and readlines() is only
       valid for file objects.

    Supporters of a cond() function point out that the need for
    short-circuit evaluation is rare.  Scanning through existing code
    directories, they found that if/else did not occur often; and of
    those only a few contained expressions that could be helped by
    cond() or a ternary operator; and that most of those had no need
    for short-circuit evaluation.  Hence, cond() would suffice for
    most needs and would spare efforts to alter the syntax of the
    language.

    More supporting evidence comes from scans of C code bases which
    show that its ternary operator used very rarely (as a percentage
    of lines of code).

    A counter point to that analysis is that the availability of a
    ternary operator helped the programmer in every case because it
    spared the need to search for side-effects.  Further, it would
    preclude errors arising from distant modifications which introduce
    side-effects.  The latter case has become more of a reality with
    the advent of properties where even attribute access can be given
    side-effects.

    The BDFL's position is that short-circuit behavior is essential
    for an if-then-else construct to be added to the language.


Detailed Results of Voting

    Votes rejecting all options:  82
    Votes with rank ordering:     436
                                  ---
    Total votes received:         518


            ACCEPT                  REJECT                  TOTAL
            ---------------------   ---------------------   -----
            Rank1   Rank2   Rank3   Rank1   Rank2   Rank3
    Letter  
    A       51      33      19      18      20      20      161
    B       45      46      21      9       24      23      168
    C       94      54      29      20      20      18      235
    D       71      40      31      5       28      31      206
    E       7       7       10              3       5       32
    F       14      19      10              7       17      67
    G       7       6       10      1       2       4       30
    H       20      22      17      4       10      25      98
    I       16      20      9       5       5       20      75
    J       6       17      5       1               10      39
    K       1               6               4       13      24
    L               1       2               3       3       9
    M       7       3       4       2       5       11      32
    N               2       3               4       2       11
    O       1       6       5       1       4       9       26
    P       5       3       6       1       5       7       27
    Q       18      7       15      6       5       11      62
    Z                                               1       1
            ---     ---     ---     ---     ---     ---     ----
    Total   363     286     202     73      149     230     1303
    RejectAll                       82      82      82      246
            ---     ---     ---     ---     ---     ---     ----
    Total   363     286     202     155     231     312     1549


    CHOICE KEY
    ----------
    A.  x if C else y
    B.  if C then x else y
    C.  (if C: x else: y)
    D.  C ? x : y
    E.  C ? x ! y
    F.  cond(C, x, y)
    G.  C ?? x || y
    H.  C then x else y
    I.  x when C else y
    J.  C ? x else y
    K.  C -> x else y
    L.  C -> (x, y)
    M.  [x if C else y]
    N.  ifelse C: x else y
    O.  <if C then x else y>
    P.  C and x else y
    Q.  any write-in vote


    Detail for write-in votes and their ranking:
    --------------------------------------------
    3:  Q reject y x C elsethenif
    2:  Q accept (C ? x ! y)
    3:  Q reject ...
    3:  Q accept  ? C : x : y
    3:  Q accept (x if C, y otherwise)
    3:  Q reject ...
    3:  Q reject NONE
    1:  Q accept   select : (<c1> : <val1>; [<cx> : <valx>; ]* elseval)
    2:  Q reject if C: t else: f
    3:  Q accept C selects x else y
    2:  Q accept iff(C, x, y)    # "if-function"
    1:  Q accept (y, x)[C]
    1:  Q accept          C true: x false: y
    3:  Q accept          C then: x else: y
    3:  Q reject
    3:  Q accept (if C: x elif C2: y else: z)
    3:  Q accept C -> x : y
    1:  Q accept  x (if C), y
    1:  Q accept if c: x else: y
    3:  Q accept (c).{True:1, False:2}
    2:  Q accept if c: x else: y
    3:  Q accept (c).{True:1, False:2}
    3:  Q accept if C: x else y
    1:  Q accept  (x if C else y)
    1:  Q accept ifelse(C, x, y)
    2:  Q reject x or y <- C
    1:  Q accept (C ? x : y) required parens
    1:  Q accept  iif(C, x, y)
    1:  Q accept ?(C, x, y)
    1:  Q accept switch-case
    2:  Q accept multi-line if/else
    1:  Q accept C: x else: y
    2:  Q accept (C): x else: y
    3:  Q accept if C: x else: y
    1:  Q accept     x if C, else y
    1:  Q reject choice: c1->a; c2->b; ...; z
    3:  Q accept [if C then x else y]
    3:  Q reject no other choice has x as the first element
    1:  Q accept (x,y) ? C
    3:  Q accept x if C else y (The "else y" being optional)
    1:  Q accept (C ? x , y)
    1:  Q accept  any outcome (i.e form or plain rejection) from a usability study
    1:  Q reject (x if C else y)
    1:  Q accept  (x if C else y)
    2:  Q reject   NONE
    3:  Q reject   NONE
    3:  Q accept  (C ? x else y)
    3:  Q accept  x when C else y
    2:  Q accept  (x if C else y)
    2:  Q accept cond(C1, x1, C2, x2, C3, x3,...)
    1:  Q accept  (if C1: x elif C2: y else: z)
    1:  Q reject cond(C, :x, :y)
    3:  Q accept  (C and [x] or [y])[0]
    2:  Q reject
    3:  Q reject
    3:  Q reject all else
    1:  Q reject no-change
    3:  Q reject deliberately omitted as I have no interest in any other proposal
    2:  Q reject (C then x else Y)
    1:  Q accept       if C: x else: y
    1:  Q reject (if C then x else y)
    3:  Q reject C?(x, y)


Copyright

    This document has been placed in the public domain.



pep-0309 Partial Function Application

PEP:309
Title:Partial Function Application
Version:$Revision$
Last-Modified:$Date$
Author:Peter Harris <scav at blueyonder.co.uk>
Status:Final
Type:Standards Track
Content-Type:text/x-rst
Created:08-Feb-2003
Python-Version:2.5
Post-History:10-Feb-2003, 27-Feb-2003, 22-Feb-2004, 28-Apr-2006

Note

Following the acceptance of this PEP, further discussion on python-dev and comp.lang.python revealed a desire for several tools that operated on function objects, but were not related to functional programming. Rather than create a new module for these tools, it was agreed [1] that the "functional" module be renamed to "functools" to reflect its newly-widened focus.

References in this PEP to a "functional" module have been left in for historical reasons.

Abstract

This proposal is for a function or callable class that allows a new callable to be constructed from a callable and a partial argument list (including positional and keyword arguments).

I propose a standard library module called "functional", to hold useful higher-order functions, including the implementation of partial().

An implementation has been submitted to SourceForge [2].

Acceptance

Patch #941881 was accepted and applied in 2005 for Py2.5. It is essentially as outlined here, a partial() type constructor binding leftmost positional arguments and any keywords. The partial object has three read-only attributes func, args, and keywords. Calls to the partial object can specify keywords that override those in the object itself.

There is a separate and continuing discussion of whether to modify the partial implementation with a __get__ method to more closely emulate the behavior of an equivalent function.

Motivation

In functional programming, function currying is a way of implementing multi-argument functions in terms of single-argument functions. A function with N arguments is really a function with 1 argument that returns another function taking (N-1) arguments. Function application in languages like Haskell and ML works such that a function call:

f x y z

actually means:

(((f x) y) z)

This would be only an obscure theoretical issue except that in actual programming it turns out to be very useful. Expressing a function in terms of partial application of arguments to another function can be both elegant and powerful, and in functional languages it is heavily used.

In some functional languages, (e.g. Miranda) you can use an expression such as (+1) to mean the equivalent of Python's (lambda x: x + 1).

In general, languages like that are strongly typed, so the compiler always knows the number of arguments expected and can do the right thing when presented with a functor and less arguments than expected.

Python does not implement multi-argument functions by currying, so if you want a function with partially-applied arguments you would probably use a lambda as above, or define a named function for each instance.

However, lambda syntax is not to everyone's taste, so say the least. Furthermore, Python's flexible parameter passing using both positional and keyword presents an opportunity to generalise the idea of partial application and do things that lambda cannot.

Example Implementation

Here is one way to do a create a callable with partially-applied arguments in Python. The implementation below is based on improvements provided by Scott David Daniels:

class partial(object):

    def __init__(*args, **kw):
        self = args[0]
        self.fn, self.args, self.kw = (args[1], args[2:], kw)

    def __call__(self, *args, **kw):
        if kw and self.kw:
            d = self.kw.copy()
            d.update(kw)
        else:
            d = kw or self.kw
        return self.fn(*(self.args + args), **d)

(A recipe similar to this has been in the Python Cookbook for some time [3].)

Note that when the object is called as though it were a function, positional arguments are appended to those provided to the constructor, and keyword arguments override and augment those provided to the constructor.

Positional arguments, keyword arguments or both can be supplied at when creating the object and when calling it.

Examples of Use

So partial(operator.add, 1) is a bit like (lambda x: 1 + x). Not an example where you see the benefits, of course.

Note too, that you could wrap a class in the same way, since classes themselves are callable factories for objects. So in some cases, rather than defining a subclass, you can specialise classes by partial application of the arguments to the constructor.

For example, partial(Tkinter.Label, fg='blue') makes Tkinter Labels that have a blue foreground by default.

Here's a simple example that uses partial application to construct callbacks for Tkinter widgets on the fly:

from Tkinter import Tk, Canvas, Button
import sys
from functional import partial

win = Tk()
c = Canvas(win,width=200,height=50)
c.pack()

for colour in sys.argv[1:]:
    b = Button(win, text=colour,
               command=partial(c.config, bg=colour))
    b.pack(side='left')

win.mainloop()

Abandoned Syntax Proposal

I originally suggested the syntax fn@(*args, **kw), meaning the same as partial(fn, *args, **kw).

The @ sign is used in some assembly languages to imply register indirection, and the use here is also a kind of indirection. f@(x) is not f(x), but a thing that becomes f(x) when you call it.

It was not well-received, so I have withdrawn this part of the proposal. In any case, @ has been taken for the new decorator syntax.

Feedback from comp.lang.python and python-dev

Among the opinions voiced were the following (which I summarise):

  • Lambda is good enough.
  • The @ syntax is ugly (unanimous).
  • It's really a curry rather than a closure. There is an almost identical implementation of a curry class on ActiveState's Python Cookbook.
  • A curry class would indeed be a useful addition to the standard library.
  • It isn't function currying, but partial application. Hence the name is now proposed to be partial().
  • It maybe isn't useful enough to be in the built-ins.
  • The idea of a module called functional was well received, and there are other things that belong there (for example function composition).
  • For completeness, another object that appends partial arguments after those supplied in the function call (maybe called rightcurry) has been suggested.

I agree that lambda is usually good enough, just not always. And I want the possibility of useful introspection and subclassing.

I disagree that @ is particularly ugly, but it may be that I'm just weird. We have dictionary, list and tuple literals neatly differentiated by special punctuation -- a way of directly expressing partially-applied function literals is not such a stretch. However, not one single person has said they like it, so as far as I'm concerned it's a dead parrot.

I concur with calling the class partial rather than curry or closure, so I have amended the proposal in this PEP accordingly. But not throughout: some incorrect references to 'curry' have been left in since that's where the discussion was at the time.

Partially applying arguments from the right, or inserting arguments at arbitrary positions creates its own problems, but pending discovery of a good implementation and non-confusing semantics, I don't think it should be ruled out.

Carl Banks posted an implementation as a real functional closure:

def curry(fn, *cargs, **ckwargs):
    def call_fn(*fargs, **fkwargs):
        d = ckwargs.copy()
        d.update(fkwargs)
        return fn(*(cargs + fargs), **d)
    return call_fn

which he assures me is more efficient.

I also coded the class in Pyrex, to estimate how the performance might be improved by coding it in C:

cdef class curry:

    cdef object fn, args, kw

    def __init__(self, fn, *args, **kw):
        self.fn=fn
        self.args=args
        self.kw = kw

    def __call__(self, *args, **kw):
        if self.kw:        # from Python Cookbook version
            d = self.kw.copy()
            d.update(kw)
        else:
            d=kw
        return self.fn(*(self.args + args), **d)

The performance gain in Pyrex is less than 100% over the nested function implementation, since to be fully general it has to operate by Python API calls. For the same reason, a C implementation will be unlikely to be much faster, so the case for a built-in coded in C is not very strong.

Summary

I prefer that some means to partially-apply functions and other callables should be present in the standard library.

A standard library module functional should contain an implementation of partial, and any other higher-order functions the community want. Other functions that might belong there fall outside the scope of this PEP though.

Patches for the implementation, documentation and unit tests (SF patches 931005 [4], 931007 [5], and 931010 [6] respectively) have been submitted but not yet checked in.

A C implementation by Hye-Shik Chang has also been submitted, although it is not expected to be included until after the Python implementation has proven itself useful enough to be worth optimising.

pep-0310 Reliable Acquisition/Release Pairs

PEP: 310
Title: Reliable Acquisition/Release Pairs
Version: $Revision$
Last-Modified: $Date$
Author: Michael Hudson <mwh at python.net>, Paul Moore <p.f.moore at gmail.com>
Status: Rejected
Type: Standards Track
Content-Type: text/plain
Created: 18-Dec-2002
Python-Version: 2.4
Post-History: 

Abstract

    It would be nice to have a less typing-intense way of writing:

        the_lock.acquire()
        try:
            ....
        finally:
            the_lock.release()

    This PEP proposes a piece of syntax (a 'with' block) and a
    "small-i" interface that generalizes the above.

Pronouncement

    This PEP is rejected in favor of PEP 343.

Rationale

    One of the advantages of Python's exception handling philosophy is
    that it makes it harder to do the "wrong" thing (e.g. failing to
    check the return value of some system call).  Currently, this does
    not apply to resource cleanup.  The current syntax for acquisition
    and release of a resource (for example, a lock) is

        the_lock.acquire()
        try:
            ....
        finally:
            the_lock.release()

    This syntax separates the acquisition and release by a (possibly
    large) block of code, which makes it difficult to confirm "at a
    glance" that the code manages the resource correctly.  Another
    common error is to code the "acquire" call within the try block,
    which incorrectly releases the lock if the acquire fails.


Basic Syntax and Semantics

    The syntax of a 'with' statement is as follows::

	'with' [ var '=' ] expr ':'
	    suite

    This statement is defined as being equivalent to the following
    sequence of statements:

	var = expr

	if hasattr(var, "__enter__"):
	    var.__enter__()

	try:
	    suite

	finally:
            var.__exit__()

    (The presence of an __exit__ method is *not* checked like that of
    __enter__ to ensure that using inappropriate objects in with:
    statements gives an error).

    If the variable is omitted, an unnamed object is allocated on the
    stack.  In that case, the suite has no access to the unnamed object.


Possible Extensions

    A number of potential extensions to the basic syntax have been
    discussed on the Python Developers list.  None of these extensions
    are included in the solution proposed by this PEP.  In many cases,
    the arguments are nearly equally strong in both directions.  In
    such cases, the PEP has always chosen simplicity, simply because
    where extra power is needed, the existing try block is available.

    Multiple expressions

    One proposal was for allowing multiple expressions within one
    'with' statement.  The __enter__ methods would be called left to
    right, and the __exit__ methods right to left.  The advantage of
    doing so is that where more than one resource is being managed,
    nested 'with' statements will result in code drifting towards the
    right margin.  The solution to this problem is the same as for any
    other deep nesting - factor out some of the code into a separate
    function.  Furthermore, the question of what happens if one of the
    __exit__ methods raises an exception (should the other __exit__
    methods be called?) needs to be addressed.

    Exception handling

    An extension to the protocol to include an optional __except__
    handler, which is called when an exception is raised, and which
    can handle or re-raise the exception, has been suggested.  It is
    not at all clear that the semantics of this extension can be made
    precise and understandable.  For example, should the equivalent
    code be try ... except ... else if an exception handler is
    defined, and try ... finally if not?  How can this be determined
    at compile time, in general?  The alternative is to define the
    code as expanding to a try ... except inside a try ... finally.
    But this may not do the right thing in real life.

    The only use case identified for exception handling is with
    transactional processing (commit on a clean finish, and rollback
    on an exception).  This is probably just as easy to handle with a
    conventional try ... except ... else block, and so the PEP does
    not include any support for exception handlers.


Implementation Notes

    There is a potential race condition in the code specified as
    equivalent to the with statement.  For example, if a
    KeyboardInterrupt exception is raised between the completion of
    the __enter__ method call and the start of the try block, the
    __exit__ method will not be called.  This can lead to resource
    leaks, or to deadlocks.  [XXX Guido has stated that he cares about
    this sort of race condition, and intends to write some C magic to
    handle them.  The implementation of the 'with' statement should
    copy this.]


Open Issues

    Should existing classes (for example, file-like objects and locks)
    gain appropriate __enter__ and __exit__ methods?  The obvious
    reason in favour is convenience (no adapter needed).  The argument
    against is that if built-in files have this but (say) StringIO
    does not, then code that uses "with" on a file object can't be
    reused with a StringIO object.  So __exit__ = close becomes a part
    of the "file-like object" protocol, which user-defined classes may
    need to support.

    The __enter__ hook may be unnecessary - for many use cases, an
    adapter class is needed and in that case, the work done by the
    __enter__ hook can just as easily be done in the __init__ hook.

    If a way of controlling object lifetimes explicitly was available,
    the function of the __exit__ hook could be taken over by the
    existing __del__ hook.  An email exchange[1] with a proponent of
    this approach left one of the authors even more convinced that
    it isn't the right idea...

    It has been suggested[2] that the "__exit__" method be called
    "close", or that a "close" method should be considered if no
    __exit__ method is found, to increase the "out-of-the-box utility"
    of the "with ..." construct.

    There are some similarities in concept between 'with ...' blocks
    and generators, which have led to proposals that for loops could
    implement the with block functionality[3].  While neat on some
    levels, we think that for loops should stick to being loops.


Alternative Ideas

    IEXEC: Holger Krekel -- generalised approach with XML-like syntax
                            (no URL found...)

    Holger has much more far-reaching ideas about "execution monitors"
    that are informed about details of control flow in the monitored
    block.  While interesting, these ideas could change the language
    in deep and subtle ways and as such belong to a different PEP.

    Any Smalltalk/Ruby anonymous block style extension obviously
    subsumes this one.

    PEP 319 is in the same area, but did not win support when aired on
    python-dev.


Backwards Compatibility

    This PEP proposes a new keyword, so the __future__ game will need
    to be played.


Cost of Adoption

    Those who claim the language is getting larger and more
    complicated have something else to complain about.  It's something
    else to teach.

    For the proposal to be useful, many file-like and lock-like
    classes in the standard library and other code will have to have

        __exit__ = close

    or similar added.


Cost of Non-Adoption

    Writing correct code continues to be more effort than writing
    incorrect code.


References

    There are various python-list and python-dev discussions that
    could be mentioned here.

    [1] Off-list conversation between Michael Hudson and Bill Soudan
        (made public with permission)
        http://starship.python.net/crew/mwh/pep310/

    [2] Samuele Pedroni on python-dev
        http://mail.python.org/pipermail/python-dev/2003-August/037795.html

    [3] Thread on python-dev with subject

        [Python-Dev] pre-PEP: Resource-Release Support for Generators

        starting at

        http://mail.python.org/pipermail/python-dev/2003-August/037803.html

Copyright

    This document has been placed in the public domain.



pep-0311 Simplified Global Interpreter Lock Acquisition for Extensions

PEP: 311
Title: Simplified Global Interpreter Lock Acquisition for Extensions
Version: $Revision$
Last-Modified: $Date$
Author: Mark Hammond <mhammond at skippinet.com.au>
Status: Final
Type: Standards Track
Content-Type: text/plain
Created: 05-Feb-2003
Post-History: 05-Feb-2003 14-Feb-2003 19-Apr-2003

Abstract

    This PEP proposes a simplified API for access to the Global
    Interpreter Lock (GIL) for Python extension modules.
    Specifically, it provides a solution for authors of complex
    multi-threaded extensions, where the current state of Python
    (i.e., the state of the GIL is unknown.

    This PEP proposes a new API, for platforms built with threading
    support, to manage the Python thread state.  An implementation
    strategy is proposed, along with an initial, platform independent
    implementation.


Rationale

    The current Python interpreter state API is suitable for simple,
    single-threaded extensions, but quickly becomes incredibly complex
    for non-trivial, multi-threaded extensions.

    Currently Python provides two mechanisms for dealing with the GIL:

    - Py_BEGIN_ALLOW_THREADS and Py_END_ALLOW_THREADS macros.
      These macros are provided primarily to allow a simple Python
      extension that already owns the GIL to temporarily release it
      while making an "external" (ie, non-Python), generally
      expensive, call.  Any existing Python threads that are blocked
      waiting for the GIL are then free to run.  While this is fine
      for extensions making calls from Python into the outside world,
      it is no help for extensions that need to make calls into Python
      when the thread state is unknown.

    - PyThreadState and PyInterpreterState APIs.
      These API functions allow an extension/embedded application to
      acquire the GIL, but suffer from a serious boot-strapping
      problem - they require you to know the state of the Python
      interpreter and of the GIL before they can be used.  One
      particular problem is for extension authors that need to deal
      with threads never before seen by Python, but need to call
      Python from this thread.  It is very difficult, delicate and
      error prone to author an extension where these "new" threads
      always know the exact state of the GIL, and therefore can
      reliably interact with this API.

    For these reasons, the question of how such extensions should
    interact with Python is quickly becoming a FAQ.  The main impetus
    for this PEP, a thread on python-dev [1], immediately identified
    the following projects with this exact issue:

    - The win32all extensions
    - Boost
    - ctypes
    - Python-GTK bindings
    - Uno
    - PyObjC
    - Mac toolbox
    - PyXPCOM

    Currently, there is no reasonable, portable solution to this
    problem, forcing each extension author to implement their own
    hand-rolled version.  Further, the problem is complex, meaning
    many implementations are likely to be incorrect, leading to a
    variety of problems that will often manifest simply as "Python has
    hung".

    While the biggest problem in the existing thread-state API is the
    lack of the ability to query the current state of the lock, it is
    felt that a more complete, simplified solution should be offered
    to extension authors.  Such a solution should encourage authors to
    provide error-free, complex extension modules that take full
    advantage of Python's threading mechanisms.


Limitations and Exclusions

    This proposal identifies a solution for extension authors with
    complex multi-threaded requirements, but that only require a
    single "PyInterpreterState".  There is no attempt to cater for
    extensions that require multiple interpreter states.  At the time
    of writing, no extension has been identified that requires
    multiple PyInterpreterStates, and indeed it is not clear if that
    facility works correctly in Python itself.

    This API will not perform automatic initialization of Python, or
    initialize Python for multi-threaded operation.  Extension authors
    must continue to call Py_Initialize(), and for multi-threaded
    applications, PyEval_InitThreads().  The reason for this is that
    the first thread to call PyEval_InitThreads() is nominated as the
    "main thread" by Python, and so forcing the extension author to
    specify the main thread (by forcing her to make this first call)
    removes ambiguity.  As Py_Initialize() must be called before
    PyEval_InitThreads(), and as both of these functions currently
    support being called multiple times, the burden this places on
    extension authors is considered reasonable.

    It is intended that this API be all that is necessary to acquire
    the Python GIL.  Apart from the existing, standard
    Py_BEGIN_ALLOW_THREADS and Py_END_ALLOW_THREADS macros, it is
    assumed that no additional thread state API functions will be used
    by the extension.  Extensions with such complicated requirements
    are free to continue to use the existing thread state API.


Proposal

    This proposal recommends a new API be added to Python to simplify
    the management of the GIL.  This API will be available on all
    platforms built with WITH_THREAD defined.

    The intent is that assuming Python has correctly been initialized,
    an extension author be able to use a small, well-defined "prologue 
    dance", at any time and on any thread, which will ensure Python 
    is ready to be used on that thread.  After the extension has 
    finished with Python, it must also perform an "epilogue dance" to 
    release any resources previously acquired.  Ideally, these dances 
    can be expressed in a single line.

    Specifically, the following new APIs are proposed:

    /* Ensure that the current thread is ready to call the Python
       C API, regardless of the current state of Python, or of its
       thread lock.  This may be called as many times as desired
       by a thread so long as each call is matched with a call to 
       PyGILState_Release().  In general, other thread-state APIs may 
       be used between _Ensure() and _Release() calls, so long as the 
       thread-state is restored to its previous state before the Release().
       For example, normal use of the Py_BEGIN_ALLOW_THREADS/
       Py_END_ALLOW_THREADS macros are acceptable.
    
       The return value is an opaque "handle" to the thread state when
       PyGILState_Acquire() was called, and must be passed to
       PyGILState_Release() to ensure Python is left in the same state. Even
       though recursive calls are allowed, these handles can *not* be 
       shared - each unique call to PyGILState_Ensure must save the handle 
       for its call to PyGILState_Release.
    
       When the function returns, the current thread will hold the GIL.
    
       Failure is a fatal error.
    */
    PyAPI_FUNC(PyGILState_STATE) PyGILState_Ensure(void);

    /* Release any resources previously acquired.  After this call, Python's
       state will be the same as it was prior to the corresponding
       PyGILState_Acquire call (but generally this state will be unknown to 
       the caller, hence the use of the GILState API.)
    
       Every call to PyGILState_Ensure must be matched by a call to 
       PyGILState_Release on the same thread.
    */
    PyAPI_FUNC(void) PyGILState_Release(PyGILState_STATE);

    Common usage will be:

    void SomeCFunction(void)
    {
        /* ensure we hold the lock */
        PyGILState_STATE state = PyGILState_Ensure();
        /* Use the Python API */
        ...
        /* Restore the state of Python */
        PyGILState_Release(state);
    }


Design and Implementation

    The general operation of PyGILState_Ensure() will be:
    - assert Python is initialized.
    - Get a PyThreadState for the current thread, creating and saving
      if necessary.
    - remember the current state of the lock (owned/not owned)
    - If the current state does not own the GIL, acquire it.
    - Increment a counter for how many calls to
      PyGILState_Ensure have been made on the current thread.
    - return

    The general operation of PyGILState_Release() will be:

    - assert our thread currently holds the lock.
    - If old state indicates lock was previously unlocked, release GIL.
    - Decrement the PyGILState_Ensure counter for the thread.
    - If counter == 0:
      - release and delete the PyThreadState.
      - forget the ThreadState as being owned by the thread.
    - return

    It is assumed that it is an error if two discrete PyThreadStates
    are used for a single thread.  Comments in pystate.h ("State
    unique per thread") support this view, although it is never
    directly stated.  Thus, this will require some implementation of
    Thread Local Storage.  Fortunately, a platform independent
    implementation of Thread Local Storage already exists in the
    Python source tree, in the SGI threading port.  This code will be
    integrated into the platform independent Python core, but in such
    a way that platforms can provide a more optimal implementation if
    desired.


Implementation

    An implementation of this proposal can be found at
    http://www.python.org/sf/684256


References

    [1] http://mail.python.org/pipermail/python-dev/2002-December/031424.html


Copyright

    This document has been placed in the public domain.



pep-0312 Simple Implicit Lambda

PEP: 312
Title: Simple Implicit Lambda
Version: $Revision$
Last-Modified: $Date$
Author: Roman Suzi <rnd at onego.ru>, Alex Martelli <aleaxit at gmail.com>
Status: Deferred
Type: Standards Track
Content-Type: text/plain
Created: 11-Feb-2003
Python-Version: 2.4
Post-History: 

Abstract

    This PEP proposes to make argumentless lambda keyword optional in
    some cases where it is not grammatically ambiguous.

Deferral

    The BDFL hates the unary colon syntax.  This PEP needs to go back
    to the drawing board and find a more Pythonic syntax (perhaps an
    alternative unary operator).  See python-dev discussion on
    17 June 2005.

    Also, it is probably a good idea to eliminate the alternative
    propositions which have no chance at all.  The examples section
    is good and highlights the readability improvements.  It would
    carry more weight with additional examples and with real-world
    referrents (instead of the abstracted dummy calls to :A and :B).

Motivation

    Lambdas are useful for defining anonymous functions, e.g. for use
    as callbacks or (pseudo)-lazy evaluation schemes.  Often, lambdas
    are not used when they would be appropriate, just because the
    keyword "lambda" makes code look complex.  Omitting lambda in some
    special cases is possible, with small and backwards compatible
    changes to the grammar, and provides a cheap cure against such
    "lambdaphobia".


Rationale

    Sometimes people do not use lambdas because they fear to introduce
    a term with a theory behind it.  This proposal makes introducing
    argumentless lambdas easier, by omitting the "lambda" keyword.
    itself.  Implementation can be done simply changing grammar so it
    lets the "lambda" keyword be implied in a few well-known cases.
    In particular, adding surrounding brackets lets you specify
    nullary lambda anywhere.


Syntax

    An argumentless "lambda" keyword can be omitted in the following
    cases:

      * immediately after "=" in named parameter assignment or default
        value assignment;

      * immediately after "(" in any expression;

      * immediately after a "," in a function argument list;

      * immediately after a ":" in a dictionary literal; (not
        implemented)

      * in an assignment statement; (not implemented)


Examples of Use

    1) Inline "if":

        def ifelse(cond, true_part, false_part):
            if cond:
                return true_part()
            else:
                return false_part()

        # old syntax:
        print ifelse(a < b, lambda:A, lambda:B)

        # new syntax:
        print ifelse(a < b, :A, :B)

        # parts A and B may require extensive processing, as in:
        print ifelse(a < b, :ext_proc1(A), :ext_proc2(B))

    2) Locking:

        def with(alock, acallable):
            alock.acquire()
            try:
                acallable()
            finally:
                alock.release()

        with(mylock, :x(y(), 23, z(), 'foo'))


Implementation

    Implementation requires some tweaking of the Grammar/Grammar file
    in the Python sources, and some adjustment of
    Modules/parsermodule.c to make syntactic and pragmatic changes.

    (Some grammar/parser guru is needed to make a full
    implementation.)

    Here are the changes needed to Grammar to allow implicit lambda:

        varargslist: (fpdef ['=' imptest] ',')* ('*' NAME [',' '**'
        NAME] | '**' NAME) | fpdef ['=' imptest] (',' fpdef ['='
        imptest])* [',']

        imptest: test | implambdef

        atom: '(' [imptestlist] ')' | '[' [listmaker] ']' |
        '{' [dictmaker] '}' | '`' testlist1 '`' | NAME | NUMBER | STRING+

        implambdef: ':' test

        imptestlist: imptest (',' imptest)* [',']

        argument: [test '='] imptest

    Three new non-terminals are needed: imptest for the place where
    implicit lambda may occur, implambdef for the implicit lambda
    definition itself, imptestlist for a place where imptest's may
    occur.

    This implementation is not complete. First, because some files in
    Parser module need to be updated. Second, some additional places
    aren't implemented, see Syntax section above.


Discussion

    This feature is not a high-visibility one (the only novel part is
    the absence of lambda). The feature is intended to make null-ary
    lambdas more appealing syntactically, to provide lazy evaluation
    of expressions in some simple cases. This proposal is not targeted
    at more advanced cases (demanding arguments for the lambda).

    There is an alternative proposition for implicit lambda: implicit
    lambda with unused arguments. In this case the function defined by
    such lambda can accept any parameters, i.e. be equivalent to:
    lambda *args: expr. This form would be more powerful.  Grep in the
    standard library revealed that such lambdas are indeed in use.

    One more extension can provide a way to have a list of parameters
    passed to a function defined by implicit lambda. However, such
    parameters need some special name to be accessed and are unlikely
    to be included in the language. Possible local names for such
    parameters are: _, __args__, __. For example:

        reduce(:_[0] + _[1], [1,2,3], 0)
        reduce(:__[0] + __[1], [1,2,3], 0)
        reduce(:__args__[0] + __args__[1], [1,2,3], 0)

    These forms do not look very nice, and in the PEP author's opinion
    do not justify the removal of the lambda keyword in such cases.


Credits

    The idea of dropping lambda was first coined by Paul Rubin at 08
    Feb 2003 16:39:30 -0800 in comp.lang.python while discussing the
    thread "For review: PEP 308 - If-then-else expression".


Copyright

    This document has been placed in the public domain.



pep-0313 Adding Roman Numeral Literals to Python

PEP: 313
Title: Adding Roman Numeral Literals to Python
Version: $Revision$
Last-Modified: $Date$
Author: Mike Meyer <mwm at mired.org>
Status: Rejected
Type: Standards Track
Content-Type: text/plain
Created: 01-Apr-2003
Python-Version: 2.4
Post-History: 

Abstract

    This PEP (also known as PEP CCCXIII) proposes adding Roman
    numerals as a literal type.  It also proposes the new built-in
    function "roman", which converts an object to an integer, then
    converts the integer to a string that is the Roman numeral literal
    equivalent to the integer.

BDFL Pronouncement

    This PEP is rejected.  While the majority of Python users deemed this
    to be a nice-to-have feature, the community was unable to reach a
    consensus on whether nine should be represented as IX, the modern
    form, or VIIII, the classic form.  Likewise, no agreement was
    reached on whether MXM or MCMXC would be considered a well-formed
    representation of 1990.  A vocal minority of users has also requested
    support for lower-cased numerals for use in (i) powerpoint slides,
    (ii) academic work, and (iii) Perl documentation.


Rationale

    Roman numerals are used in a number of areas, and adding them to
    Python as literals would make computations in those areas easier.
    For instance, Superbowls are counted with Roman numerals, and many
    older movies have copyright dates in Roman numerals.  Further,
    LISP provides a Roman numerals literal package, so adding Roman
    numerals to Python will help ease the LISP-envy sometimes seen in
    comp.lang.python.  Besides, the author thinks this is the easiest
    way to get his name on a PEP.


Syntax for Roman literals

    Roman numeral literals will consist of the characters M, D, C, L,
    X, V and I, and only those characters.  They must be in upper
    case, and represent an integer with the following rules:

    1.  Except as noted below, they must appear in the order M, D, C,
    L, X, V then I.  Each occurrence of each character adds 1000, 500,
    100, 50, 10, 5 and 1 to the value of the literal, respectively.

    2.  Only one D, V or L may appear in any given literal.

    3.  At most three each of Is, Xs and Cs may appear consecutively
    in any given literal.

    4.  A single I may appear immediately to the left of the single V,
    followed by no Is, and adds 4 to the value of the literal.

    5.  A single I may likewise appear before the last X, followed by
    no Is or Vs, and adds 9 to the value.

    6.  X is to L and C as I is to V and X, except the values are 40
    and 90, respectively.

    7.  C is to D and M as I is to V and X, except the values are 400
    and 900, respectively.

    Any literal composed entirely of M, D, C, L, X, V and I characters
    that does not follow this format will raise a syntax error,
    because explicit is better than implicit.


Built-In "roman" Function

    The new built-in function "roman" will aide the translation from
    integers to Roman numeral literals.  It will accept a single
    object as an argument, and return a string containing the literal
    of the same value.  If the argument is not an integer or a
    rational (see PEP 239 [1]) it will passed through the existing
    built-in "int" to obtain the value.  This may cause a loss of
    information if the object was a float.  If the object is a
    rational, then the result will be formatted as a rational literal
    (see PEP 240 [2]) with the integers in the string being Roman
    numeral literals.


Compatibility Issues

    No new keywords are introduced by this proposal.  Programs that
    use variable names that are all upper case and contain only the
    characters M, D, C, L, X, V and I will be affected by the new
    literals.  These programs will now have syntax errors when those
    variables are assigned, and either syntax errors or subtle bugs
    when those variables are referenced in expressions.  Since such
    variable names violate PEP 8 [3], the code is already broken, it
    just wasn't generating exceptions. This proposal corrects that
    oversight in the language.


References

    [1] PEP 239, Adding a Rational Type to Python
        http://www.python.org/dev/peps/pep-0239/

    [2] PEP 240, Adding a Rational Literal to Python
        http://www.python.org/dev/peps/pep-0240/

    [3] PEP 8, Style Guide for Python Code
        http://www.python.org/dev/peps/pep-0008/


Copyright

    This document has been placed in the public domain.



pep-0314 Metadata for Python Software Packages v1.1

PEP: 314
Title: Metadata for Python Software Packages v1.1
Version: $Revision$
Last-Modified: $Date$
Author: A.M. Kuchling, Richard Jones
Status: Final
Type: Standards Track
Content-Type: text/plain
Created: 12-Apr-2003
Python-Version: 2.5
Post-History: 29-Apr-2003
Replaces: 241

Introduction

   This PEP describes a mechanism for adding metadata to Python
   packages.  It includes specifics of the field names, and their
   semantics and usage.

   This document specifies version 1.1 of the metadata format.
   Version 1.0 is specified in PEP 241.


Including Metadata in Packages

    The Distutils 'sdist' command will extract the metadata fields
    from the arguments and write them to a file in the generated
    zipfile or tarball.  This file will be named PKG-INFO and will be
    placed in the top directory of the source distribution (where the
    README, INSTALL, and other files usually go).

    Developers may not provide their own PKG-INFO file.  The "sdist"
    command will, if it detects an existing PKG-INFO file, terminate
    with an appropriate error message.  This should prevent confusion
    caused by the PKG-INFO and setup.py files being out of sync.

    The PKG-INFO file format is a single set of RFC-822 headers
    parseable by the rfc822.py module.  The field names listed in the
    following section are used as the header names.  
    

Fields

    This section specifies the names and semantics of each of the
    supported metadata fields.
 
    Fields marked with "(Multiple use)" may be specified multiple
    times in a single PKG-INFO file.  Other fields may only occur
    once in a PKG-INFO file.  Fields marked with "(optional)" are
    not required to appear in a valid PKG-INFO file; all other
    fields must be present.

    Metadata-Version

      Version of the file format; currently "1.0" and "1.1" are the
      only legal values here.

      Example: 

           Metadata-Version: 1.1

    Name

      The name of the package.  

      Example: 

          Name: BeagleVote
      
    Version

      A string containing the package's version number.  This
      field should be parseable by one of the Version classes
      (StrictVersion or LooseVersion) in the distutils.version
      module.

      Example: 

          Version: 1.0a2
      
    Platform (multiple use)

      A comma-separated list of platform specifications, summarizing
      the operating systems supported by the package which are not
      listed in the "Operating System" Trove classifiers. See
      "Classifier" below.

      Example: 

          Platform: ObscureUnix, RareDOS

    Supported-Platform (multiple use)

      Binary distributions containing a PKG-INFO file will use the
      Supported-Platform field in their metadata to specify the OS and
      CPU for which the binary package was compiled.  The semantics of
      the Supported-Platform field are not specified in this PEP.

      Example: 

          Supported-Platform: RedHat 7.2
          Supported-Platform: i386-win32-2791

    Summary

      A one-line summary of what the package does.

      Example: 

          Summary: A module for collecting votes from beagles.
      
    Description (optional)

      A longer description of the package that can run to several
      paragraphs.  Software that deals with metadata should not assume
      any maximum size for this field, though people shouldn't include
      their instruction manual as the description.  

      The contents of this field can be written using reStructuredText
      markup [1].  For programs that work with the metadata,
      supporting markup is optional; programs can also display the
      contents of the field as-is.  This means that authors should be
      conservative in the markup they use.

      Example: 
      
          Description: This module collects votes from beagles
                       in order to determine their electoral wishes.
                       Do *not* try to use this module with basset hounds;
                       it makes them grumpy.
      
    Keywords (optional)

      A list of additional keywords to be used to assist searching
      for the package in a larger catalog.

      Example: 

          Keywords: dog puppy voting election
      
    Home-page (optional)

      A string containing the URL for the package's home page.

      Example: 

          Home-page: http://www.example.com/~cschultz/bvote/
      
    Download-URL
    
      A string containing the URL from which this version of the package 
      can be downloaded.  (This means that the URL can't be something like
      ".../package-latest.tgz", but instead must be "../package-0.45.tgz".)
      
    Author (optional)

      A string containing the author's name at a minimum; additional
      contact information may be provided.  

      Example: 

          Author: C. Schultz, Universal Features Syndicate,
                  Los Angeles, CA <cschultz@peanuts.example.com>
      
    Author-email

      A string containing the author's e-mail address.  It can contain
      a name and e-mail address in the legal forms for a RFC-822
      'From:' header.  It's not optional because cataloging systems
      can use the e-mail portion of this field as a unique key
      representing the author.  A catalog might provide authors the
      ability to store their GPG key, personal home page, and other
      additional metadata *about the author*, and optionally the
      ability to associate several e-mail addresses with the same
      person.  Author-related metadata fields are not covered by this
      PEP.  

      Example: 

          Author-email: "C. Schultz" <cschultz@example.com>
      
    License
      
      Text indicating the license covering the package where the license
      is not a selection from the "License" Trove classifiers. See
      "Classifier" below.

      Example: 

          License: This software may only be obtained by sending the
                   author a postcard, and then the user promises not
                   to redistribute it.

    Classifier (multiple use)

      Each entry is a string giving a single classification value
      for the package.  Classifiers are described in PEP 301 [2].

      Examples:

        Classifier: Development Status :: 4 - Beta
        Classifier: Environment :: Console (Text Based)

      
    Requires (multiple use)
      
      Each entry contains a string describing some other module or
      package required by this package. 

      The format of a requirement string is identical to that of a
      module or package name usable with the 'import' statement,
      optionally followed by a version declaration within parentheses.

      A version declaration is a series of conditional operators and
      version numbers, separated by commas.  Conditional operators
      must be one of "<", ">", "<=", ">=", "==", and "!=".  Version
      numbers must be in the format accepted by the
      distutils.version.StrictVersion class: two or three
      dot-separated numeric components, with an optional "pre-release"
      tag on the end consisting of the letter 'a' or 'b' followed by a
      number.  Example version numbers are "1.0", "2.3a2", "1.3.99", 

      Any number of conditional operators can be specified, e.g.
      the string ">1.0, !=1.3.4, <2.0" is a legal version declaration.

      All of the following are possible requirement strings: "rfc822",
      "zlib (>=1.1.4)", "zope".

      There's no canonical list of what strings should be used; the
      Python community is left to choose its own standards.

      Example: 

          Requires: re
          Requires: sys
          Requires: zlib
          Requires: xml.parsers.expat (>1.0)
          Requires: psycopg
       
    Provides (multiple use)

      Each entry contains a string describing a package or module that
      will be provided by this package once it is installed.  These
      strings should match the ones used in Requirements fields.  A
      version declaration may be supplied (without a comparison
      operator); the package's version number will be implied if none
      is specified.

      Example: 

          Provides: xml
          Provides: xml.utils
          Provides: xml.utils.iso8601
          Provides: xml.dom
          Provides: xmltools (1.3)

    Obsoletes (multiple use)

      Each entry contains a string describing a package or module
      that this package renders obsolete, meaning that the two packages
      should not be installed at the same time.  Version declarations
      can be supplied.  

      The most common use of this field will be in case a package name
      changes, e.g. Gorgon 2.3 gets subsumed into Torqued Python 1.0.
      When you install Torqued Python, the Gorgon package should be
      removed.
      
      Example:

          Obsoletes: Gorgon


Summary of Differences From PEP 241

    * Metadata-Version is now 1.1.

    * Added the Classifiers field from PEP 301.

    * The License and Platform files should now only be used if the 
      platform or license can't be handled by an appropriate Classifier 
      value.

    * Added fields: Download-URL, Requires, Provides, Obsoletes.


Open issues

    None.


Acknowledgements

    None.


References

    [1] reStructuredText 
        http://docutils.sourceforge.net/

    [2] PEP 301
        http://www.python.org/dev/peps/pep-0301/


Copyright

    This document has been placed in the public domain.



pep-0315 Enhanced While Loop

PEP: 315
Title: Enhanced While Loop
Version: $Revision$
Last-Modified: $Date$
Author: Raymond Hettinger <python at rcn.com>
Status: Rejected
Type: Standards Track
Content-Type: text/plain
Created: 25-Apr-2003
Python-Version: 2.5
Post-History: 

Abstract

    This PEP proposes adding an optional "do" clause to the beginning
    of the while loop to make loop code clearer and reduce errors
    caused by code duplication.


Notice

    Rejected; see
    http://mail.python.org/pipermail/python-ideas/2013-June/021610.html

    This PEP has been deferred since 2006; see
    http://mail.python.org/pipermail/python-dev/2006-February/060718.html

    Subsequent efforts to revive the PEP in April 2009 did not
    meet with success because no syntax emerged that could
    compete with the following form:

        while True:
            <setup code>
            if not <condition>:
                break
            <loop body>

    A syntax alternative to the one proposed in the PEP was found for
    a basic do-while loop but it gained little support because the
    condition was at the top:

        do ... while <cond>:
            <loop body>

    Users of the language are advised to use the while-True form with
    an inner if-break when a do-while loop would have been appropriate.


Motivation

    It is often necessary for some code to be executed before each
    evaluation of the while loop condition.  This code is often
    duplicated outside the loop, as setup code that executes once
    before entering the loop:

        <setup code>
        while <condition>:
            <loop body>
            <setup code>

    The problem is that duplicated code can be a source of errors if
    one instance is changed but the other is not.  Also, the purpose
    of the second instance of the setup code is not clear because it
    comes at the end of the loop.

    It is possible to prevent code duplication by moving the loop
    condition into a helper function, or an if statement in the loop
    body.  However, separating the loop condition from the while
    keyword makes the behavior of the loop less clear:

        def helper(args):
            <setup code>
            return <condition>

        while helper(args):
            <loop body>

    This last form has the additional drawback of requiring the loop's
    else clause to be added to the body of the if statement, further
    obscuring the loop's behavior:

        while True:
            <setup code>
            if not <condition>: break
            <loop body>

    This PEP proposes to solve these problems by adding an optional
    clause to the while loop, which allows the setup code to be
    expressed in a natural way:

        do:
            <setup code>
        while <condition>:
            <loop body>

    This keeps the loop condition with the while keyword where it
    belongs, and does not require code to be duplicated.


Syntax

    The syntax of the while statement

        while_stmt : "while" expression ":" suite
                     ["else" ":" suite]

    is extended as follows:

        while_stmt : ["do" ":" suite]
                     "while" expression ":" suite
                     ["else" ":" suite]


Semantics of break and continue

    In the do-while loop the break statement will behave the same as
    in the standard while loop: It will immediately terminate the loop
    without evaluating the loop condition or executing the else
    clause.

    A continue statement in the do-while loop jumps to the while
    condition check.

    In general, when the while suite is empty (a pass statement),
    the do-while loop and break and continue statements should match
    the semantics of do-while in other languages.

    Likewise, when the do suite is empty, the do-while loop and
    break and continue statements should match behavior found
    in regular while loops.


Future Statement

    Because of the new keyword "do", the statement

        from __future__ import do_while

    will initially be required to use the do-while form.


Implementation

    The first implementation of this PEP can compile the do-while loop
    as an infinite loop with a test that exits the loop.


Copyright

    This document is placed in the public domain.



pep-0316 Programming by Contract for Python

PEP:316
Title:Programming by Contract for Python
Version:$Revision$
Last-Modified:$Date$
Author:Terence Way <terry at wayforward.net>
Status:Deferred
Type:Standards Track
Content-Type:text/x-rst
Created:02-May-2003
Python-Version:
Post-History:

Abstract

This submission describes programming by contract for Python. Eiffel's Design By Contract(tm) is perhaps the most popular use of programming contracts [2].

Programming contracts extends the language to include invariant expressions for classes and modules, and pre- and post-condition expressions for functions and methods.

These expressions (contracts) are similar to assertions: they must be true or the program is stopped, and run-time checking of the contracts is typically only enabled while debugging. Contracts are higher-level than straight assertions and are typically included in documentation.

Motivation

Python already has assertions, why add extra stuff to the language to support something like contracts? The two best reasons are 1) better, more accurate documentation, and 2) easier testing.

Complex modules and classes never seem to be documented quite right. The documentation provided may be enough to convince a programmer to use a particular module or class over another, but the programmer almost always has to read the source code when the real debugging starts.

Contracts extend the excellent example provided by the doctest module [4]. Documentation is readable by programmers, yet has executable tests embedded in it.

Testing code with contracts is easier too. Comprehensive contracts are equivalent to unit tests [8]. Tests exercise the full range of pre-conditions, and fail if the post-conditions are triggered. Theoretically, a correctly specified function can be tested completely randomly.

So why add this to the language? Why not have several different implementations, or let programmers implement their own assertions? The answer is the behavior of contracts under inheritance.

Suppose Alice and Bob use different assertions packages. If Alice produces a class library protected by assertions, Bob cannot derive classes from Alice's library and expect proper checking of post-conditions and invariants. If they both use the same assertions package, then Bob can override Alice's methods yet still test against Alice's contract assertions. The natural place to find this assertions system is in the language's run-time library.

Specification

The docstring of any module or class can include invariant contracts marked off with a line that starts with the keyword inv followed by a colon (:). Whitespace at the start of the line and around the colon is ignored. The colon is either immediately followed by a single expression on the same line, or by a series of expressions on following lines indented past the inv keyword. The normal Python rules about implicit and explicit line continuations are followed here. Any number of invariant contracts can be in a docstring.

Some examples:

# state enumeration
START, CONNECTING, CONNECTED, CLOSING, CLOSED = range(5)

class conn:

    """A network connection

    inv: self.state in [START, CLOSED,       # closed states
                        CONNECTING, CLOSING, # transition states
                        CONNECTED]

    inv: 0 <= self.seqno < 256
    """

class circbuf:

    """A circular buffer.

    inv:
        # there can be from 0 to max items on the buffer
        0 <= self.len <= len(self.buf)

        # g is a valid index into buf
        0 <= self.g < len(self.buf)

        # p is also a valid index into buf
        0 <= self.p < len(self.buf)

        # there are len items between get and put
        (self.p - self.g) % len(self.buf) == \
              self.len % len(self.buf)
    """

Module invariants must be true after the module is loaded, and at the entry and exit of every public function within the module.

Class invariants must be true after the __init__ function returns, at the entry of the __del__ function, and at the entry and exit of every other public method of the class. Class invariants must use the self variable to access instance variables.

A method or function is public if its name doesn't start with an underscore (_), unless it starts and ends with '__' (two underscores).

The docstring of any function or method can have pre-conditions documented with the keyword pre following the same rules above. Post-conditions are documented with the keyword post optionally followed by a list of variables. The variables are in the same scope as the body of the function or method. This list declares the variables that the function/method is allowed to modify.

An example:

class circbuf:

    def __init__(self, leng):
        """Construct an empty circular buffer.

        pre: leng > 0
        post[self]:
            self.is_empty()
            len(self.buf) == leng
        """

A double-colon (::) can be used instead of a single colon (:) to support docstrings written using reStructuredText [7]. For example, the following two docstrings describe the same contract:

"""pre: leng > 0"""
"""pre:: leng > 0"""

Expressions in pre- and post-conditions are defined in the module namespace -- they have access to nearly all the variables that the function can access, except closure variables.

The contract expressions in post-conditions have access to two additional variables: __old__ which is filled with shallow copies of values declared in the variable list immediately following the post keyword, and __return__ which is bound to the return value of the function or method.

An example:

class circbuf:

    def get(self):
        """Pull an entry from a non-empty circular buffer.

        pre: not self.is_empty()
        post[self.g, self.len]:
            __return__ == self.buf[__old__.self.g]
            self.len == __old__.self.len - 1
        """

All contract expressions have access to some additional convenience functions. To make evaluating the truth of sequences easier, two functions forall and exists are defined as:

def forall(a, fn = bool):
    """Return True only if all elements in a are true.

    >>> forall([])
    1
    >>> even = lambda x: x % 2 == 0
    >>> forall([2, 4, 6, 8], even)
    1
    >>> forall('this is a test'.split(), lambda x: len(x) == 4)
    0
    """

def exists(a, fn = bool):
    """Returns True if there is at least one true value in a.

    >>> exists([])
    0
    >>> exists('this is a test'.split(), lambda x: len(x) == 4)
    1
    """

An example:

def sort(a):
    """Sort a list.

    pre: isinstance(a, type(list))
    post[a]:
        # array size is unchanged
        len(a) == len(__old__.a)

        # array is ordered
        forall([a[i] >= a[i-1] for i in range(1, len(a))])

        # all the old elements are still in the array
        forall(__old__.a, lambda e: __old__.a.count(e) == a.count(e))
    """

To make evaluating conditions easier, the function implies is defined. With two arguments, this is similar to the logical implies (=>) operator. With three arguments, this is similar to C's conditional expression (x?a:b). This is defined as:

implies(False, a) => True
implies(True, a) => a
implies(False, a, b) => b
implies(True, a, b) => a

On entry to a function, the function's pre-conditions are checked. An assertion error is raised if any pre-condition is false. If the function is public, then the class or module's invariants are also checked. Copies of variables declared in the post are saved, the function is called, and if the function exits without raising an exception, the post-conditions are checked.

Exceptions

Class/module invariants are checked even if a function or method exits by signalling an exception (post-conditions are not).

All failed contracts raise exceptions which are subclasses of the ContractViolationError exception, which is in turn a subclass of the AssertionError exception. Failed pre-conditions raise a PreconditionViolationError exception. Failed post-conditions raise a PostconditionViolationError exception, and failed invariants raise a InvariantViolationError exception.

The class hierarchy:

AssertionError
    ContractViolationError
        PreconditionViolationError
        PostconditionViolationError
        InvariantViolationError
        InvalidPreconditionError

The InvalidPreconditionError is raised when pre-conditions are illegally strengthened, see the next section on Inheritance.

Example:

try:
    some_func()
except contract.PreconditionViolationError:
    # failed pre-condition, ok
    pass

Inheritance

A class's invariants include all the invariants for all super-classes (class invariants are ANDed with super-class invariants). These invariants are checked in method-resolution order.

A method's post-conditions also include all overridden post-conditions (method post-conditions are ANDed with all overridden method post-conditions).

An overridden method's pre-conditions can be ignored if the overriding method's pre-conditions are met. However, if the overriding method's pre-conditions fail, all of the overridden method's pre-conditions must also fail. If not, a separate exception is raised, the InvalidPreconditionError. This supports weakening pre-conditions.

A somewhat contrived example:

class SimpleMailClient:

    def send(self, msg, dest):
        """Sends a message to a destination:

        pre: self.is_open() # we must have an open connection
        """

    def recv(self):
        """Gets the next unread mail message.

        Returns None if no message is available.

        pre: self.is_open() # we must have an open connection
        post: __return__ == None or isinstance(__return__, Message)
        """

 class ComplexMailClient(SimpleMailClient):
    def send(self, msg, dest):
        """Sends a message to a destination.

        The message is sent immediately if currently connected.
        Otherwise, the message is queued locally until a
        connection is made.

        pre: True # weakens the pre-condition from SimpleMailClient
        """

    def recv(self):
        """Gets the next unread mail message.

        Waits until a message is available.

        pre: True # can always be called
        post: isinstance(__return__, Message)
        """

Because pre-conditions can only be weakened, a ComplexMailClient can replace a SimpleMailClient with no fear of breaking existing code.

Rationale

Except for the following differences, programming-by-contract for Python mirrors the Eiffel DBC specification [3].

Embedding contracts in docstrings is patterned after the doctest module. It removes the need for extra syntax, ensures that programs with contracts are backwards-compatible, and no further work is necessary to have the contracts included in the docs.

The keywords pre, post, and inv were chosen instead of the Eiffel-style REQUIRE, ENSURE, and INVARIANT because they're shorter, more in line with mathematical notation, and for a more subtle reason: the word 'require' implies caller responsibilities, while 'ensure' implies provider guarantees. Yet pre-conditions can fail through no fault of the caller when using multiple inheritance, and post-conditions can fail through no fault of the function when using multiple threads.

Loop invariants as used in Eiffel are unsupported. They're a pain to implement, and not part of the documentation anyway.

The variable names __old__ and __return__ were picked to avoid conflicts with the return keyword and to stay consistent with Python naming conventions: they're public and provided by the Python implementation.

Having variable declarations after a post keyword describes exactly what the function or method is allowed to modify. This removes the need for the NoChange syntax in Eiffel, and makes the implementation of __old__ much easier. It also is more in line with Z schemas [9], which are divided into two parts: declaring what changes followed by limiting the changes.

Shallow copies of variables for the __old__ value prevent an implementation of contract programming from slowing down a system too much. If a function changes values that wouldn't be caught by a shallow copy, it can declare the changes like so:

post[self, self.obj, self.obj.p]

The forall, exists, and implies functions were added after spending some time documenting existing functions with contracts. These capture a majority of common specification idioms. It might seem that defining implies as a function might not work (the arguments are evaluated whether needed or not, in contrast with other boolean operators), but it works for contracts since there should be no side-effects for any expression in a contract.

Reference Implementation

A reference implementation is available [1]. It replaces existing functions with new functions that do contract checking, by directly changing the class' or module's namespace.

Other implementations exist that either hack __getattr__ [5] or use __metaclass__ [6].

References

[1]Implementation described in this document. (http://www.wayforward.net/pycontract/)
[2]Design By Contract is a registered trademark of Eiffel Software Inc. (http://archive.eiffel.com/doc/manuals/technology/contract/)
[3]Object-oriented Software Construction, Bertrand Meyer, ISBN 0-13-629031-0
[4]http://docs.python.org/library/doctest.html doctest -- Test docstrings represent reality
[5]Design by Contract for Python, R. Plosch IEEE Proceedings of the Joint Asia Pacific Software Engineering Conference (APSEC97/ICSC97), Hong Kong, December 2-5, 1997 (http://www.swe.uni-linz.ac.at/publications/abstract/TR-SE-97.24.html)
[6]PyDBC -- Design by Contract for Python 2.2+, Daniel Arbuckle (http://www.nongnu.org/pydbc/)
[7]ReStructuredText (http://docutils.sourceforge.net/rst.html)
[8]Extreme Programming Explained, Kent Beck, ISBN 0-201-61641-6
[9]The Z Notation, Second Edition, J.M. Spivey ISBN 0-13-978529-9

pep-0317 Eliminate Implicit Exception Instantiation

PEP:317
Title:Eliminate Implicit Exception Instantiation
Version:$Revision$
Last-Modified:$Date$
Author:Steven Taschuk <staschuk at telusplanet.net>
Status:Rejected
Type:Standards Track
Content-Type:text/x-rst
Created:06-May-2003
Python-Version:2.4
Post-History:09-Jun-2003

Abstract

"For clarity in new code, the form raise class(argument, ...) is recommended (i.e. make an explicit call to the constructor)."

—Guido van Rossum, in 1997 [1]

This PEP proposes the formal deprecation and eventual elimination of forms of the raise statement which implicitly instantiate an exception. For example, statements such as

raise HullBreachError
raise KitchenError, 'all out of baked beans'

must under this proposal be replaced with their synonyms

raise HullBreachError()
raise KitchenError('all out of baked beans')

Note that these latter statements are already legal, and that this PEP does not change their meaning.

Eliminating these forms of raise makes it impossible to use string exceptions; accordingly, this PEP also proposes the formal deprecation and eventual elimination of string exceptions.

Adoption of this proposal breaks backwards compatibility. Under the proposed implementation schedule, Python 2.4 will introduce warnings about uses of raise which will eventually become incorrect, and Python 3.0 will eliminate them entirely. (It is assumed that this transition period -- 2.4 to 3.0 -- will be at least one year long, to comply with the guidelines of PEP 5 [2].)

Motivation

String Exceptions

It is assumed that removing string exceptions will be uncontroversial, since it has been intended since at least Python 1.5, when the standard exception types were changed to classes [1].

For the record: string exceptions should be removed because the presence of two kinds of exception complicates the language without any compensation. Instance exceptions are superior because, for example,

  • the class-instance relationship more naturally expresses the relationship between the exception type and value,
  • they can be organized naturally using superclass-subclass relationships, and
  • they can encapsulate error-reporting behaviour (for example).

Implicit Instantiation

Guido's 1997 essay [1] on changing the standard exceptions into classes makes clear why raise can instantiate implicitly:

"The raise statement has been extended to allow raising a class exception without explicit instantiation. The following forms, called the "compatibility forms" of the raise statement [...] The motivation for introducing the compatibility forms was to allow backward compatibility with old code that raised a standard exception."

For example, it was desired that pre-1.5 code which used string exception syntax such as

raise TypeError, 'not an int'

would work both on versions of Python in which TypeError was a string, and on versions in which it was a class.

When no such consideration obtains -- that is, when the desired exception type is not a string in any version of the software which the code must support -- there is no good reason to instantiate implicitly, and it is clearer not to. For example:

  1. In the code

    try:
        raise MyError, raised
    except MyError, caught:
        pass
    

    the syntactic parallel between the raise and except statements strongly suggests that raised and caught refer to the same object. For string exceptions this actually is the case, but for instance exceptions it is not.

  2. When instantiation is implicit, it is not obvious when it occurs, for example, whether it occurs when the exception is raised or when it is caught. Since it actually happens at the raise, the code should say so.

    (Note that at the level of the C API, an exception can be "raised" and "caught" without being instantiated; this is used as an optimization by, for example, PyIter_Next. But in Python, no such optimization is or should be available.)

  3. An implicitly instantiating raise statement with no arguments, such as

    raise MyError
    

    simply does not do what it says: it does not raise the named object.

  4. The equivalence of

    raise MyError
    raise MyError()
    

    conflates classes and instances, creating a possible source of confusion for beginners. (Moreover, it is not clear that the interpreter could distinguish between a new-style class and an instance of such a class, so implicit instantiation may be an obstacle to any future plan to let exceptions be new-style objects.)

In short, implicit instantiation has no advantages other than backwards compatibility, and so should be phased out along with what it exists to ensure compatibility with, namely, string exceptions.

Specification

The syntax of raise_stmt [3] is to be changed from

raise_stmt ::= "raise" [expression ["," expression ["," expression]]]

to

raise_stmt ::= "raise" [expression ["," expression]]

If no expressions are present, the raise statement behaves as it does presently: it re-raises the last exception that was active in the current scope, and if no exception has been active in the current scope, a TypeError is raised indicating that this is the problem.

Otherwise, the first expression is evaluated, producing the raised object. Then the second expression is evaluated, if present, producing the substituted traceback. If no second expression is present, the substituted traceback is None.

The raised object must be an instance. The class of the instance is the exception type, and the instance itself is the exception value. If the raised object is not an instance -- for example, if it is a class or string -- a TypeError is raised.

If the substituted traceback is not None, it must be a traceback object, and it is substituted instead of the current location as the place where the exception occurred. If it is neither a traceback object nor None, a TypeError is raised.

Backwards Compatibility

Migration Plan

Future Statement

Under the future statement [4]

from __future__ import raise_with_two_args

the syntax and semantics of the raise statement will be as described above. This future feature is to appear in Python 2.4; its effect is to become standard in Python 3.0.

As the examples below illustrate, this future statement is only needed for code which uses the substituted traceback argument to raise; simple exception raising does not require it.

Warnings

Three new warnings [5], all of category DeprecationWarning, are to be issued to point out uses of raise which will become incorrect under the proposed changes.

The first warning is issued when a raise statement is executed in which the first expression evaluates to a string. The message for this warning is:

raising strings will be impossible in the future

The second warning is issued when a raise statement is executed in which the first expression evaluates to a class. The message for this warning is:

raising classes will be impossible in the future

The third warning is issued when a raise statement with three expressions is compiled. (Not, note, when it is executed; this is important because the SyntaxError which this warning presages will occur at compile-time.) The message for this warning is:

raising with three arguments will be impossible in the future

These warnings are to appear in Python 2.4, and disappear in Python 3.0, when the conditions which cause them are simply errors.

Examples

Code Using Implicit Instantiation

Code such as

class MyError(Exception):
    pass

raise MyError, 'spam'

will issue a warning when the raise statement is executed. The raise statement should be changed to instantiate explicitly:

raise MyError('spam')

Code Using String Exceptions

Code such as

MyError = 'spam'
raise MyError, 'eggs'

will issue a warning when the raise statement is executed. The exception type should be changed to a class:

class MyError(Exception):
    pass

and, as in the previous example, the raise statement should be changed to instantiate explicitly

raise MyError('eggs')

Code Supplying a Traceback Object

Code such as

raise MyError, 'spam', mytraceback

will issue a warning when compiled. The statement should be changed to

raise MyError('spam'), mytraceback

and the future statement

from __future__ import raise_with_two_args

should be added at the top of the module. Note that adding this future statement also turns the other two warnings into errors, so the changes described in the previous examples must also be applied.

The special case

raise sys.exc_type, sys.exc_info, sys.exc_traceback

(which is intended to re-raise a previous exception) should be changed simply to

raise

A Failure of the Plan

It may occur that a raise statement which raises a string or implicitly instantiates is not executed in production or testing during the phase-in period for this PEP. In that case, it will not issue any warnings, but will instead suddenly fail one day in Python 3.0 or a subsequent version. (The failure is that the wrong exception gets raised, namely a TypeError complaining about the arguments to raise, instead of the exception intended.)

Such cases can be made rarer by prolonging the phase-in period; they cannot be made impossible short of issuing at compile-time a warning for every raise statement.

Rejection

If this PEP were accepted, nearly all existing Python code would need to be reviewed and probably revised; even if all the above arguments in favour of explicit instantiation are accepted, the improvement in clarity is too minor to justify the cost of doing the revision and the risk of new bugs introduced thereby.

This proposal has therefore been rejected [6].

Note that string exceptions are slated for removal independently of this proposal; what is rejected is the removal of implicit exception instantiation.

Summary of Discussion

A small minority of respondents were in favour of the proposal, but the dominant response was that any such migration would be costly out of proportion to the putative benefit. As noted above, this point is sufficient in itself to reject the PEP.

New-Style Exceptions

Implicit instantiation might conflict with future plans to allow instances of new-style classes to be used as exceptions. In order to decide whether to instantiate implicitly, the raise machinery must determine whether the first argument is a class or an instance -- but with new-style classes there is no clear and strong distinction.

Under this proposal, the problem would be avoided because the exception would already have been instantiated. However, there are two plausible alternative solutions:

  1. Require exception types to be subclasses of Exception, and instantiate implicitly if and only if

    issubclass(firstarg, Exception)
    
  2. Instantiate implicitly if and only if

    isinstance(firstarg, type)
    

Thus eliminating implicit instantiation entirely is not necessary to solve this problem.

Ugliness of Explicit Instantiation

Some respondents felt that the explicitly instantiating syntax is uglier, especially in cases when no arguments are supplied to the exception constructor:

raise TypeError()

The problem is particularly acute when the exception instance itself is not of interest, that is, when the only relevant point is the exception type:

try:
    # ... deeply nested search loop ...
        raise Found
except Found:
    # ...

In such cases the symmetry between raise and except can be more expressive of the intent of the code.

Guido opined that the implicitly instantiating syntax is "a tad prettier" even for cases with a single argument, since it has less punctuation.

Performance Penalty of Warnings

Experience with deprecating apply() shows that use of the warning framework can incur a significant performance penalty.

Code which instantiates explicitly would not be affected, since the run-time checks necessary to determine whether to issue a warning are exactly those which are needed to determine whether to instantiate implicitly in the first place. That is, such statements are already incurring the cost of these checks.

Code which instantiates implicitly would incur a large cost: timing trials indicate that issuing a warning (whether it is suppressed or not) takes about five times more time than simply instantiating, raising, and catching an exception.

This penalty is mitigated by the fact that raise statements are rarely on performance-critical execution paths.

Traceback Argument

As the proposal stands, it would be impossible to use the traceback argument to raise conveniently with all 2.x versions of Python.

For compatibility with versions < 2.4, the three-argument form must be used; but this form would produce warnings with versions >= 2.4. Those warnings could be suppressed, but doing so is awkward because the relevant type of warning is issued at compile-time.

If this PEP were still under consideration, this objection would be met by extending the phase-in period. For example, warnings could first be issued in 3.0, and become errors in some later release.

References

[1](1, 2, 3) "Standard Exception Classes in Python 1.5", Guido van Rossum. http://www.python.org/doc/essays/stdexceptions.html
[2]"Guidelines for Language Evolution", Paul Prescod. http://www.python.org/dev/peps/pep-0005/
[3]"Python Language Reference", Guido van Rossum. http://docs.python.org/reference/simple_stmts.html#raise
[4]PEP 236 "Back to the __future__", Tim Peters. http://www.python.org/dev/peps/pep-0236/
[5]PEP 230 "Warning Framework", Guido van Rossum. http://www.python.org/dev/peps/pep-0230/
[6]Guido van Rossum, 11 June 2003 post to python-dev. http://mail.python.org/pipermail/python-dev/2003-June/036176.html

pep-0318 Decorators for Functions and Methods

PEP:318
Title:Decorators for Functions and Methods
Version:$Revision$
Last-Modified:$Date$
Author:Kevin D. Smith <Kevin.Smith at theMorgue.org>, Jim J. Jewett, Skip Montanaro, Anthony Baxter
Status:Final
Type:Standards Track
Content-Type:text/x-rst
Created:05-Jun-2003
Python-Version:2.4
Post-History:09-Jun-2003, 10-Jun-2003, 27-Feb-2004, 23-Mar-2004, 30-Aug-2004, 2-Sep-2004

WarningWarningWarning

This document is meant to describe the decorator syntax and the process that resulted in the decisions that were made. It does not attempt to cover the huge number of potential alternative syntaxes, nor is it an attempt to exhaustively list all the positives and negatives of each form.

Abstract

The current method for transforming functions and methods (for instance, declaring them as a class or static method) is awkward and can lead to code that is difficult to understand. Ideally, these transformations should be made at the same point in the code where the declaration itself is made. This PEP introduces new syntax for transformations of a function or method declaration.

Motivation

The current method of applying a transformation to a function or method places the actual transformation after the function body. For large functions this separates a key component of the function's behavior from the definition of the rest of the function's external interface. For example:

def foo(self):
    perform method operation
foo = classmethod(foo)

This becomes less readable with longer methods. It also seems less than pythonic to name the function three times for what is conceptually a single declaration. A solution to this problem is to move the transformation of the method closer to the method's own declaration. The intent of the new syntax is to replace

def foo(cls):
    pass
foo = synchronized(lock)(foo)
foo = classmethod(foo)

with an alternative that places the decoration in the function's declaration:

@classmethod
@synchronized(lock)
def foo(cls):
    pass

Modifying classes in this fashion is also possible, though the benefits are not as immediately apparent. Almost certainly, anything which could be done with class decorators could be done using metaclasses, but using metaclasses is sufficiently obscure that there is some attraction to having an easier way to make simple modifications to classes. For Python 2.4, only function/method decorators are being added.

PEP 3129 [#PEP-3129] proposes to add class decorators as of Python 2.6.

Why Is This So Hard?

Two decorators (classmethod() and staticmethod()) have been available in Python since version 2.2. It's been assumed since approximately that time that some syntactic support for them would eventually be added to the language. Given this assumption, one might wonder why it's been so difficult to arrive at a consensus. Discussions have raged off-and-on at times in both comp.lang.python and the python-dev mailing list about how best to implement function decorators. There is no one clear reason why this should be so, but a few problems seem to be most divisive.

  • Disagreement about where the "declaration of intent" belongs. Almost everyone agrees that decorating/transforming a function at the end of its definition is suboptimal. Beyond that there seems to be no clear consensus where to place this information.
  • Syntactic constraints. Python is a syntactically simple language with fairly strong constraints on what can and can't be done without "messing things up" (both visually and with regards to the language parser). There's no obvious way to structure this information so that people new to the concept will think, "Oh yeah, I know what you're doing." The best that seems possible is to keep new users from creating a wildly incorrect mental model of what the syntax means.
  • Overall unfamiliarity with the concept. For people who have a passing acquaintance with algebra (or even basic arithmetic) or have used at least one other programming language, much of Python is intuitive. Very few people will have had any experience with the decorator concept before encountering it in Python. There's just no strong preexisting meme that captures the concept.
  • Syntax discussions in general appear to cause more contention than almost anything else. Readers are pointed to the ternary operator discussions that were associated with PEP 308 for another example of this.

Background

There is general agreement that syntactic support is desirable to the current state of affairs. Guido mentioned syntactic support for decorators [2] in his DevDay keynote presentation at the 10th Python Conference [3], though he later said [5] it was only one of several extensions he proposed there "semi-jokingly". Michael Hudson raised the topic [4] on python-dev shortly after the conference, attributing the initial bracketed syntax to an earlier proposal on comp.lang.python by Gareth McCaughan [6].

Class decorations seem like an obvious next step because class definition and function definition are syntactically similar, however Guido remains unconvinced, and class decorators will almost certainly not be in Python 2.4.

The discussion continued on and off on python-dev from February 2002 through July 2004. Hundreds and hundreds of posts were made, with people proposing many possible syntax variations. Guido took a list of proposals to EuroPython 2004 [7], where a discussion took place. Subsequent to this, he decided that we'd have the Java-style [10] @decorator syntax, and this appeared for the first time in 2.4a2. Barry Warsaw named this the 'pie-decorator' syntax, in honor of the Pie-thon Parrot shootout which was occured around the same time as the decorator syntax, and because the @ looks a little like a pie. Guido outlined his case [8] on Python-dev, including this piece [9] on some of the (many) rejected forms.

On the name 'Decorator'

There's been a number of complaints about the choice of the name 'decorator' for this feature. The major one is that the name is not consistent with its use in the GoF book [11]. The name 'decorator' probably owes more to its use in the compiler area -- a syntax tree is walked and annotated. It's quite possible that a better name may turn up.

Design Goals

The new syntax should

  • work for arbitrary wrappers, including user-defined callables and the existing builtins classmethod() and staticmethod(). This requirement also means that a decorator syntax must support passing arguments to the wrapper constructor
  • work with multiple wrappers per definition
  • make it obvious what is happening; at the very least it should be obvious that new users can safely ignore it when writing their own code
  • be a syntax "that ... [is] easy to remember once explained"
  • not make future extensions more difficult
  • be easy to type; programs that use it are expected to use it very frequently
  • not make it more difficult to scan through code quickly. It should still be easy to search for all definitions, a particular definition, or the arguments that a function accepts
  • not needlessly complicate secondary support tools such as language-sensitive editors and other "toy parser tools out there [12]"
  • allow future compilers to optimize for decorators. With the hope of a JIT compiler for Python coming into existence at some point this tends to require the syntax for decorators to come before the function definition
  • move from the end of the function, where it's currently hidden, to the front where it is more in your face [13]

Andrew Kuchling has links to a bunch of the discussions about motivations and use cases in his blog [14]. Particularly notable is Jim Huginin's list of use cases [15].

Current Syntax

The current syntax for function decorators as implemented in Python 2.4a2 is:

@dec2
@dec1
def func(arg1, arg2, ...):
    pass

This is equivalent to:

def func(arg1, arg2, ...):
    pass
func = dec2(dec1(func))

without the intermediate assignment to the variable func. The decorators are near the function declaration. The @ sign makes it clear that something new is going on here.

The rationale for the order of application [16] (bottom to top) is that it matches the usual order for function-application. In mathematics, composition of functions (g o f)(x) translates to g(f(x)). In Python, @g @f def foo() translates to foo=g(f(foo).

The decorator statement is limited in what it can accept -- arbitrary expressions will not work. Guido preferred this because of a gut feeling [17].

The current syntax also allows decorator declarations to call a function that returns a decorator:

@decomaker(argA, argB, ...)
def func(arg1, arg2, ...):
    pass

This is equivalent to:

func = decomaker(argA, argB, ...)(func)

The rationale for having a function that returns a decorator is that the part after the @ sign can be considered to be an expression (though syntactically restricted to just a function), and whatever that expression returns is called. See declaration arguments [16].

Syntax Alternatives

There have been a large number [18] of different syntaxes proposed -- rather than attempting to work through these individual syntaxes, it's worthwhile to break the syntax discussion down into a number of areas. Attempting to discuss each possible syntax [19] individually would be an act of madness, and produce a completely unwieldy PEP.

Decorator Location

The first syntax point is the location of the decorators. For the following examples, we use the @syntax used in 2.4a2.

Decorators before the def statement are the first alternative, and the syntax used in 2.4a2:

@classmethod
def foo(arg1,arg2):
    pass

@accepts(int,int)
@returns(float)
def bar(low,high):
    pass

There have been a number of objections raised to this location -- the primary one is that it's the first real Python case where a line of code has an effect on a following line. The syntax available in 2.4a3 requires one decorator per line (in a2, multiple decorators could be specified on the same line), and the final decision for 2.4 final stayed one decorator per line.

People also complained that the syntax quickly got unwieldy when multiple decorators were used. The point was made, though, that the chances of a large number of decorators being used on a single function were small and thus this was not a large worry.

Some of the advantages of this form are that the decorators live outside the method body -- they are obviously executed at the time the function is defined.

Another advantage is that a prefix to the function definition fits the idea of knowing about a change to the semantics of the code before the code itself, thus you know how to interpret the code's semantics properly without having to go back and change your initial perceptions if the syntax did not come before the function definition.

Guido decided he preferred [20] having the decorators on the line before the 'def', because it was felt that a long argument list would mean that the decorators would be 'hidden'

The second form is the decorators between the def and the function name, or the function name and the argument list:

def @classmethod foo(arg1,arg2):
    pass

def @accepts(int,int),@returns(float) bar(low,high):
    pass

def foo @classmethod (arg1,arg2):
    pass

def bar @accepts(int,int),@returns(float) (low,high):
    pass

There are a couple of objections to this form. The first is that it breaks easily 'greppability' of the source -- you can no longer search for 'def foo(' and find the definition of the function. The second, more serious, objection is that in the case of multiple decorators, the syntax would be extremely unwieldy.

The next form, which has had a number of strong proponents, is to have the decorators between the argument list and the trailing : in the 'def' line:

def foo(arg1,arg2) @classmethod:
    pass

def bar(low,high) @accepts(int,int),@returns(float):
    pass

Guido summarized the arguments [13] against this form (many of which also apply to the previous form) as:

  • it hides crucial information (e.g. that it is a static method) after the signature, where it is easily missed
  • it's easy to miss the transition between a long argument list and a long decorator list
  • it's cumbersome to cut and paste a decorator list for reuse, because it starts and ends in the middle of a line

The next form is that the decorator syntax goes inside the method body at the start, in the same place that docstrings currently live:

def foo(arg1,arg2):
@classmethod pass
def bar(low,high):
@accepts(int,int) @returns(float) pass

The primary objection to this form is that it requires "peeking inside" the method body to determine the decorators. In addition, even though the code is inside the method body, it is not executed when the method is run. Guido felt that docstrings were not a good counter-example, and that it was quite possible that a 'docstring' decorator could help move the docstring to outside the function body.

The final form is a new block that encloses the method's code. For this example, we'll use a 'decorate' keyword, as it makes no sense with the @syntax.

decorate:
    classmethod
    def foo(arg1,arg2):
        pass

decorate:
    accepts(int,int)
    returns(float)
    def bar(low,high):
        pass

This form would result in inconsistent indentation for decorated and undecorated methods. In addition, a decorated method's body would start three indent levels in.

Syntax forms

  • @decorator:

    @classmethod
    def foo(arg1,arg2):
        pass
    
    @accepts(int,int)
    @returns(float)
    def bar(low,high):
        pass
    

    The major objections against this syntax are that the @ symbol is not currently used in Python (and is used in both IPython and Leo), and that the @ symbol is not meaningful. Another objection is that this "wastes" a currently unused character (from a limited set) on something that is not perceived as a major use.

  • |decorator:

    |classmethod
    def foo(arg1,arg2):
        pass
    
    |accepts(int,int)
    |returns(float)
    def bar(low,high):
        pass
    

    This is a variant on the @decorator syntax -- it has the advantage that it does not break IPython and Leo. Its major disadvantage compared to the @syntax is that the | symbol looks like both a capital I and a lowercase l.

  • list syntax:

    [classmethod]
    def foo(arg1,arg2):
        pass
    
    [accepts(int,int), returns(float)]
    def bar(low,high):
        pass
    

    The major objection to the list syntax is that it's currently meaningful (when used in the form before the method). It's also lacking any indication that the expression is a decorator.

  • list syntax using other brackets (<...>, [[...]], ...):

    <classmethod>
    def foo(arg1,arg2):
        pass
    
    <accepts(int,int), returns(float)>
    def bar(low,high):
        pass
    

    None of these alternatives gained much traction. The alternatives which involve square brackets only serve to make it obvious that the decorator construct is not a list. They do nothing to make parsing any easier. The '<...>' alternative presents parsing problems because '<' and '>' already parse as un-paired. They present a further parsing ambiguity because a right angle bracket might be a greater than symbol instead of a closer for the decorators.

  • decorate()

    The decorate() proposal was that no new syntax be implemented -- instead a magic function that used introspection to manipulate the following function. Both Jp Calderone and Philip Eby produced implementations of functions that did this. Guido was pretty firmly against this -- with no new syntax, the magicness of a function like this is extremely high:

    Using functions with "action-at-a-distance" through sys.settraceback may be okay for an obscure feature that can't be had any other way yet doesn't merit changes to the language, but that's not the situation for decorators. The widely held view here is that decorators need to be added as a syntactic feature to avoid the problems with the postfix notation used in 2.2 and 2.3. Decorators are slated to be an important new language feature and their design needs to be forward-looking, not constrained by what can be implemented in 2.3.

  • new keyword (and block)

    This idea was the consensus alternate from comp.lang.python (more on this in Community Consensus below.) Robert Brewer wrote up a detailed J2 proposal [21] document outlining the arguments in favor of this form. The initial issues with this form are:

    • It requires a new keyword, and therefore a from __future__ import decorators statement.
    • The choice of keyword is contentious. However using emerged as the consensus choice, and is used in the proposal and implementation.
    • The keyword/block form produces something that looks like a normal code block, but isn't. Attempts to use statements in this block will cause a syntax error, which may confuse users.

    A few days later, Guido rejected the proposal [22] on two main grounds, firstly:

    ... the syntactic form of an indented block strongly suggests that its contents should be a sequence of statements, but in fact it is not -- only expressions are allowed, and there is an implicit "collecting" of these expressions going on until they can be applied to the subsequent function definition. ...

    and secondly:

    ... the keyword starting the line that heads a block draws a lot of attention to it. This is true for "if", "while", "for", "try", "def" and "class". But the "using" keyword (or any other keyword in its place) doesn't deserve that attention; the emphasis should be on the decorator or decorators inside the suite, since those are the important modifiers to the function definition that follows. ...

    Readers are invited to read the full response [22].

  • Other forms

    There are plenty of other variants and proposals on the wiki page [18].

Why @?

There is some history in Java using @ initially as a marker in Javadoc comments [23] and later in Java 1.5 for annotations [10], which are similar to Python decorators. The fact that @ was previously unused as a token in Python also means it's clear there is no possibility of such code being parsed by an earlier version of Python, leading to possibly subtle semantic bugs. It also means that ambiguity of what is a decorator and what isn't is removed. That said, @ is still a fairly arbitrary choice. Some have suggested using | instead.

For syntax options which use a list-like syntax (no matter where it appears) to specify the decorators a few alternatives were proposed: [|...|], *[...]*, and <...>.

Current Implementation, History

Guido asked for a volunteer to implement his preferred syntax, and Mark Russell stepped up and posted a patch [24] to SF. This new syntax was available in 2.4a2.

@dec2
@dec1
def func(arg1, arg2, ...):
    pass

This is equivalent to:

def func(arg1, arg2, ...):
    pass
func = dec2(dec1(func))

though without the intermediate creation of a variable named func.

The version implemented in 2.4a2 allowed multiple @decorator clauses on a single line. In 2.4a3, this was tightened up to only allowing one decorator per line.

A previous patch [25] from Michael Hudson which implements the list-after-def syntax is also still kicking around.

After 2.4a2 was released, in response to community reaction, Guido stated that he'd re-examine a community proposal, if the community could come up with a community consensus, a decent proposal, and an implementation. After an amazing number of posts, collecting a vast number of alternatives in the Python wiki [18], a community consensus emerged (below). Guido subsequently rejected [22] this alternate form, but added:

In Python 2.4a3 (to be released this Thursday), everything remains as currently in CVS. For 2.4b1, I will consider a change of @ to some other single character, even though I think that @ has the advantage of being the same character used by a similar feature in Java. It's been argued that it's not quite the same, since @ in Java is used for attributes that don't change semantics. But Python's dynamic nature makes that its syntactic elements never mean quite the same thing as similar constructs in other languages, and there is definitely significant overlap. Regarding the impact on 3rd party tools: IPython's author doesn't think there's going to be much impact; Leo's author has said that Leo will survive (although it will cause him and his users some transitional pain). I actually expect that picking a character that's already used elsewhere in Python's syntax might be harder for external tools to adapt to, since parsing will have to be more subtle in that case. But I'm frankly undecided, so there's some wiggle room here. I don't want to consider further syntactic alternatives at this point: the buck has to stop at some point, everyone has had their say, and the show must go on.

Community Consensus

This section documents the rejected J2 syntax, and is included for historical completeness.

The consensus that emerged on comp.lang.python was the proposed J2 syntax (the "J2" was how it was referenced on the PythonDecorators wiki page): the new keyword using prefixing a block of decorators before the def statement. For example:

using:
    classmethod
    synchronized(lock)
def func(cls):
    pass

The main arguments for this syntax fall under the "readability counts" doctrine. In brief, they are:

  • A suite is better than multiple @lines. The using keyword and block transforms the single-block def statement into a multiple-block compound construct, akin to try/finally and others.
  • A keyword is better than punctuation for a new token. A keyword matches the existing use of tokens. No new token category is necessary. A keyword distinguishes Python decorators from Java annotations and .Net attributes, which are significantly different beasts.

Robert Brewer wrote a detailed proposal [21] for this form, and Michael Sparks produced a patch [26].

As noted previously, Guido rejected this form, outlining his problems with it in a message [22] to python-dev and comp.lang.python.

Examples

Much of the discussion on comp.lang.python and the python-dev mailing list focuses on the use of decorators as a cleaner way to use the staticmethod() and classmethod() builtins. This capability is much more powerful than that. This section presents some examples of use.

  1. Define a function to be executed at exit. Note that the function isn't actually "wrapped" in the usual sense.

    def onexit(f):
        import atexit
        atexit.register(f)
        return f
    
    @onexit
    def func():
        ...
    

    Note that this example is probably not suitable for real usage, but is for example purposes only.

  2. Define a class with a singleton instance. Note that once the class disappears enterprising programmers would have to be more creative to create more instances. (From Shane Hathaway on python-dev.)

    def singleton(cls):
        instances = {}
        def getinstance():
            if cls not in instances:
                instances[cls] = cls()
            return instances[cls]
        return getinstance
    
    @singleton
    class MyClass:
        ...
    
  3. Add attributes to a function. (Based on an example posted by Anders Munch on python-dev.)

    def attrs(**kwds):
        def decorate(f):
            for k in kwds:
                setattr(f, k, kwds[k])
            return f
        return decorate
    
    @attrs(versionadded="2.2",
           author="Guido van Rossum")
    def mymethod(f):
        ...
    
  4. Enforce function argument and return types. Note that this copies the func_name attribute from the old to the new function. func_name was made writable in Python 2.4a3:

    def accepts(*types):
        def check_accepts(f):
            assert len(types) == f.func_code.co_argcount
            def new_f(*args, **kwds):
                for (a, t) in zip(args, types):
                    assert isinstance(a, t), \
                           "arg %r does not match %s" % (a,t)
                return f(*args, **kwds)
            new_f.func_name = f.func_name
            return new_f
        return check_accepts
    
    def returns(rtype):
        def check_returns(f):
            def new_f(*args, **kwds):
                result = f(*args, **kwds)
                assert isinstance(result, rtype), \
                       "return value %r does not match %s" % (result,rtype)
                return result
            new_f.func_name = f.func_name
            return new_f
        return check_returns
    
    @accepts(int, (int,float))
    @returns((int,float))
    def func(arg1, arg2):
        return arg1 * arg2
    
  5. Declare that a class implements a particular (set of) interface(s). This is from a posting by Bob Ippolito on python-dev based on experience with PyProtocols [27].

    def provides(*interfaces):
         """
         An actual, working, implementation of provides for
         the current implementation of PyProtocols.  Not
         particularly important for the PEP text.
         """
         def provides(typ):
             declareImplementation(typ, instancesProvide=interfaces)
             return typ
         return provides
    
    class IBar(Interface):
         """Declare something about IBar here"""
    
    @provides(IBar)
    class Foo(object):
            """Implement something here..."""
    

Of course, all these examples are possible today, though without syntactic support.

(No longer) Open Issues

  1. It's not yet certain that class decorators will be incorporated into the language at a future point. Guido expressed skepticism about the concept, but various people have made some strong arguments [28] (search for PEP 318 -- posting draft) on their behalf in python-dev. It's exceedingly unlikely that class decorators will be in Python 2.4.

    PEP 3129 [#PEP-3129] proposes to add class decorators as of Python 2.6.

  2. The choice of the @ character will be re-examined before Python 2.4b1.

    In the end, the @ character was kept.

References

[1]PEP 3129, "Class Decorators", Winter http://www.python.org/dev/peps/pep-3129
[2]http://www.python.org/doc/essays/ppt/python10/py10keynote.pdf
[3]http://www.python.org/workshops/2002-02/
[4]http://mail.python.org/pipermail/python-dev/2002-February/020005.html
[5]http://mail.python.org/pipermail/python-dev/2002-February/020017.html
[6]http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&oe=UTF-8&selm=slrna40k88.2h9o.Gareth.McCaughan%40g.local
[7]http://www.python.org/doc/essays/ppt/euro2004/euro2004.pdf
[8]http://mail.python.org/pipermail/python-dev/2004-August/author.html
[9]http://mail.python.org/pipermail/python-dev/2004-August/046672.html
[10](1, 2) http://java.sun.com/j2se/1.5.0/docs/guide/language/annotations.html
[11]http://patterndigest.com/patterns/Decorator.html
[12]http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&oe=UTF-8&selm=mailman.1010809396.32158.python-list%40python.org
[13](1, 2) http://mail.python.org/pipermail/python-dev/2004-August/047112.html
[14]http://www.amk.ca/diary/archives/cat_python.html#003255
[15]http://mail.python.org/pipermail/python-dev/2004-April/044132.html
[16](1, 2) http://mail.python.org/pipermail/python-dev/2004-September/048874.html
[17]http://mail.python.org/pipermail/python-dev/2004-August/046711.html
[18](1, 2, 3) http://www.python.org/moin/PythonDecorators
[19]http://ucsu.colorado.edu/~bethard/py/decorators-output.py
[20]http://mail.python.org/pipermail/python-dev/2004-March/043756.html
[21](1, 2) http://www.aminus.org/rbre/python/pydec.html
[22](1, 2, 3, 4) http://mail.python.org/pipermail/python-dev/2004-September/048518.html
[23]http://java.sun.com/j2se/javadoc/writingdoccomments/
[24]http://www.python.org/sf/979728
[25]http://starship.python.net/crew/mwh/hacks/meth-syntax-sugar-3.diff
[26]http://www.python.org/sf/1013835
[27]http://peak.telecommunity.com/PyProtocols.html
[28]http://mail.python.org/pipermail/python-dev/2004-March/thread.html

pep-0319 Python Synchronize/Asynchronize Block

PEP: 319
Title: Python Synchronize/Asynchronize Block
Version: $Revision$
Last-Modified: $Date$
Author: Michel Pelletier <michel at users.sourceforge.net>
Status: Rejected
Type: Standards Track
Created: 24-Feb-2003
Python-Version: 2.4?
Post-History: 

Abstract

    This PEP proposes adding two new keywords to Python, `synchronize'
    and 'asynchronize'.  

Pronouncement

    This PEP is rejected in favor of PEP 343.

The `synchronize' Keyword

    The concept of code synchronization in Python is too low-level.
    To synchronize code a programmer must be aware of the details of
    the following pseudo-code pattern:

        initialize_lock()

        ...

        acquire_lock()
        try:
            change_shared_data()
        finally:
            release_lock()

    This synchronized block pattern is not the only pattern (more
    discussed below) but it is very common.  This PEP proposes
    replacing the above code with the following equivalent:

        synchronize:
            change_shared_data()

    The advantages of this scheme are simpler syntax and less room for
    user error.  Currently users are required to write code about
    acquiring and releasing thread locks in 'try/finally' blocks;
    errors in this code can cause notoriously difficult concurrent
    thread locking issues.


The `asynchronize' Keyword

    While executing a `synchronize' block of code a programmer may
    want to "drop back" to running asynchronously momentarily to run
    blocking input/output routines or something else that might take a
    indeterminate amount of time and does not require synchronization.
    This code usually follows the pattern:

        initialize_lock()

        ...

        acquire_lock()
        try:    
            change_shared_data()
            release_lock()             # become async
            do_blocking_io()
            acquire_lock()             # sync again
            change_shared_data2()

        finally:
            release_lock()

    The asynchronous section of the code is not very obvious visually,
    so it is marked up with comments.  Using the proposed
    'asynchronize' keyword this code becomes much cleaner, easier to
    understand, and less prone to error:

        synchronize:
            change_shared_data()

            asynchronize:
               do_blocking_io()

            change_shared_data2()

    Encountering an `asynchronize' keyword inside a non-synchronized
    block can raise either an error or issue a warning (as all code
    blocks are implicitly asynchronous anyway).  It is important to
    note that the above example is *not* the same as:

        synchronize:
            change_shared_data()

        do_blocking_io()

        synchronize:
            change_shared_data2()

    Because both synchronized blocks of code may be running inside the
    same iteration of a loop, Consider:

        while in_main_loop():
            synchronize:
                change_shared_data()

                asynchronize:
                   do_blocking_io()

                change_shared_data2()

    Many threads may be looping through this code.  Without the
    'asynchronize' keyword one thread cannot stay in the loop and
    release the lock at the same time while blocking IO is going on.
    This pattern of releasing locks inside a main loop to do blocking
    IO is used extensively inside the CPython interpreter itself.


Synchronization Targets

    As proposed the `synchronize' and `asynchronize' keywords
    synchronize a block of code.  However programmers may want to
    specify a target object that threads synchronize on.  Any object
    can be a synchronization target.

    Consider a two-way queue object: two different objects are used by
    the same `synchronize' code block to synchronize both queues
    separately in the 'get' method:

        class TwoWayQueue:
            def __init__(self):
                self.front = []
                self.rear = []

            def putFront(self, item):
                self.put(item, self.front)

            def getFront(self):
                item = self.get(self.front)
                return item

            def putRear(self, item):
                self.put(item, self.rear)

            def getRear(self):
                item = self.get(self.rear)
                return item

            def put(self, item, queue):
                synchronize queue:
                    queue.append(item)

            def get(self, queue):
                synchronize queue:
                    item = queue[0]
                    del queue[0]
                    return item

    Here is the equivalent code in Python as it is now without a
    `synchronize' keyword:

        import thread

        class LockableQueue:

            def __init__(self):
                self.queue = []
                self.lock = thread.allocate_lock()

        class TwoWayQueue:
            def __init__(self):
                self.front = LockableQueue()
                self.rear = LockableQueue()

            def putFront(self, item):
                self.put(item, self.front)

            def getFront(self):
                item = self.get(self.front)
                return item

            def putRear(self, item):
                self.put(item, self.rear)

            def getRear(self):
                item = self.get(self.rear)
                return item

            def put(self, item, queue):
                queue.lock.acquire()
                try:
                    queue.append(item)
                finally:
                    queue.lock.release()

            def get(self, queue):
                queue.lock.acquire()
                try:
                    item = queue[0]
                    del queue[0]
                    return item
                finally:
                    queue.lock.release()

    The last example had to define an extra class to associate a lock
    with the queue where the first example the `synchronize' keyword
    does this association internally and transparently.


Other Patterns that Synchronize

    There are some situations where the `synchronize' and
    `asynchronize' keywords cannot entirely replace the use of lock
    methods like `acquire' and `release'.  Some examples are if the
    programmer wants to provide arguments for `acquire' or if a lock
    is acquired in one code block but released in another, as shown
    below.

    Here is a class from Zope modified to use both the `synchronize'
    and `asynchronize' keywords and also uses a pool of explicit locks
    that are acquired and released in different code blocks and thus
    don't use `synchronize':

        import thread
        from ZServerPublisher import ZServerPublisher

        class ZRendevous:

            def __init__(self, n=1):
                pool=[]
                self._lists=pool, [], []

                synchronize:
                    while n > 0:
                        l=thread.allocate_lock()
                        l.acquire()
                        pool.append(l)
                        thread.start_new_thread(ZServerPublisher,
                                                (self.accept,))
                        n=n-1

            def accept(self):
                synchronize:
                    pool, requests, ready = self._lists
                    while not requests:
                        l=pool[-1]
                        del pool[-1]
                        ready.append(l)

                        asynchronize:
                            l.acquire()

                        pool.append(l)

                    r=requests[0]
                    del requests[0]
                    return r

            def handle(self, name, request, response):
                synchronize:
                    pool, requests, ready = self._lists
                    requests.append((name, request, response))
                    if ready:
                        l=ready[-1]
                        del ready[-1]
                        l.release()

    Here is the original class as found in the
    'Zope/ZServer/PubCore/ZRendevous.py' module.  The "convenience" of
    the '_a' and '_r' shortcut names obscure the code:

        import thread
        from ZServerPublisher import ZServerPublisher

        class ZRendevous:

            def __init__(self, n=1):
                sync=thread.allocate_lock()
                self._a=sync.acquire
                self._r=sync.release
                pool=[]
                self._lists=pool, [], []
                self._a()
                try:
                    while n > 0:
                        l=thread.allocate_lock()
                        l.acquire()
                        pool.append(l)
                        thread.start_new_thread(ZServerPublisher,
                                                (self.accept,))
                        n=n-1
                finally: self._r()

            def accept(self):
                self._a()
                try:
                    pool, requests, ready = self._lists
                    while not requests:
                        l=pool[-1]
                        del pool[-1]
                        ready.append(l)
                        self._r()
                        l.acquire()
                        self._a()
                        pool.append(l)

                    r=requests[0]
                    del requests[0]
                    return r
                finally: self._r()

            def handle(self, name, request, response):
                self._a()
                try:
                    pool, requests, ready = self._lists
                    requests.append((name, request, response))
                    if ready:
                        l=ready[-1]
                        del ready[-1]
                        l.release()
                finally: self._r()

    In particular the asynchronize section of the `accept' method is
    not very obvious.  To beginner programmers, `synchronize' and
    `asynchronize' remove many of the problems encountered when
    juggling multiple `acquire' and `release' methods on different
    locks in different `try/finally' blocks.


Formal Syntax

    Python syntax is defined in a modified BNF grammar notation
    described in the Python Language Reference [1].  This section
    describes the proposed synchronization syntax using this grammar:

        synchronize_stmt: 'synchronize' [test] ':' suite
        asynchronize_stmt: 'asynchronize' [test] ':' suite
        compound_stmt: ... | synchronized_stmt | asynchronize_stmt
        
    (The '...' indicates other compound statements elided).


Proposed Implementation

    The author of this PEP has not explored an implementation yet.
    There are several implementation issues that must be resolved.
    The main implementation issue is what exactly gets locked and
    unlocked during a synchronized block.

    During an unqualified synchronized block (the use of the
    `synchronize' keyword without an target argument) a lock could be
    created and associated with the synchronized code block object.
    Any threads that are to execute the block must first acquire the
    code block lock.

    When an `asynchronize' keyword is encountered in a `synchronize'
    block the code block lock is unlocked before the inner block is
    executed and re-locked when the inner block terminates.

    When a synchronized block target is specified the object is
    associated with a lock.  How this is implemented cleanly is
    probably the highest risk of this proposal.  Java Virtual Machines
    typically associate a special hidden lock object with target
    object and use it to synchronized the block around the target
    only.


Backward Compatibility

    Backward compatibility is solved with the new `from __future__'
    Python syntax [2], and the new warning framework [3] to evolve the
    Python language into phasing out any conflicting names that use
    the new keywords `synchronize' and `asynchronize'.  To use the
    syntax now, a developer could use the statement:

        from __future__ import threadsync  # or whatever

    In addition, any code that uses the keyword `synchronize' or
    `asynchronize' as an identifier will be issued a warning from
    Python.  After the appropriate period of time, the syntax would
    become standard, the above import statement would do nothing, and
    any identifiers named `synchronize' or `asynchronize' would raise
    an exception.


PEP 310 Reliable Acquisition/Release Pairs

    PEP 310 [4] proposes the 'with' keyword that can serve the same
    function as 'synchronize' (but no facility for 'asynchronize').
    The pattern:

        initialize_lock()

        with the_lock:
            change_shared_data()

    is equivalent to the proposed:

        synchronize the_lock:
            change_shared_data()

    PEP 310 must synchronize on an exsiting lock, while this PEP
    proposes that unqualified 'synchronize' statements synchronize on
    a global, internal, transparent lock in addition to qualifiled
    'synchronize' statements.  The 'with' statement also requires lock
    initialization, while the 'synchronize' statment can synchronize
    on any target object *including* locks.

    While limited in this fashion, the 'with' statment is more
    abstract and serves more purposes than synchronization.  For
    example, transactions could be used with the 'with' keyword:

        initialize_transaction()

        with my_transaction:
            do_in_transaction()

        # when the block terminates, the transaction is committed.

    The 'synchronize' and 'asynchronize' keywords cannot serve this or
    any other general acquire/release pattern other than thread
    synchronization.


How Java Does It

    Java defines a 'synchronized' keyword (note the grammatical tense
    different between the Java keyword and this PEP's 'synchronize')
    which must be qualified on any object.  The syntax is:

        synchronized (Expression) Block 

    Expression must yeild a valid object (null raises an error and
    exceptions during 'Expression' terminate the 'synchronized' block
    for the same reason) upon which 'Block' is synchronized.


How Jython Does It

    Jython uses a 'synchronize' class with the static method
    'make_synchronized' that accepts one callable argument and returns
    a newly created, synchronized, callable "wrapper" around the
    argument.


Summary of Proposed Changes to Python

    Adding new `synchronize' and `asynchronize' keywords to the
    language.


Risks

    This PEP proposes adding two keywords to the Python language. This
    may break code.

    There is no implementation to test.

    It's not the most important problem facing Python programmers
    today (although it is a fairly notorious one).

    The equivalent Java keyword is the past participle 'synchronized'.
    This PEP proposes the present tense, 'synchronize' as being more
    in spirit with Python (there being less distinction between
    compile-time and run-time in Python than Java).


Dissenting Opinion

    This PEP has not been discussed on python-dev.
        

References

    [1] The Python Language Reference
        http://docs.python.org/reference/

    [2] PEP 236, Back to the __future__, Peters
        http://www.python.org/dev/peps/pep-0236/

    [3] PEP 230, Warning Framework, van Rossum
        http://www.python.org/dev/peps/pep-0230/

    [4] PEP 310, Reliable Acquisition/Release Pairs, Hudson, Moore
        http://www.python.org/dev/peps/pep-0310/


Copyright

    This document has been placed in the public domain.



pep-0320 Python 2.4 Release Schedule

PEP: 320
Title: Python 2.4 Release Schedule
Version: $Revision$
Last-Modified: $Date$
Author: Barry Warsaw, Raymond Hettinger, Anthony Baxter
Status: Final
Type: Informational
Created: 29-Jul-2003
Python-Version: 2.4
Post-History: 1-Dec-2004

Abstract

    This document describes the development and release schedule for
    Python 2.4.  The schedule primarily concerns itself with PEP-sized
    items.  Small features may be added up to and including the first
    beta release.  Bugs may be fixed until the final release.

    There will be at least two alpha releases, two beta releases, and
    one release candidate.  The release date was 30th November, 2004.

Release Manager

    Anthony Baxter

    Martin von Lowis is building the Windows installers, Fred the 
    doc packages, Sean the RPMs.

Release Schedule

    July 9: alpha 1 [completed]

    August 5/6: alpha 2 [completed]

    Sept 3: alpha 3 [completed]

    October 15: beta 1 [completed]

    November 3: beta 2 [completed]

    November 18: release candidate 1 [completed]

    November 30: final [completed]

Completed features for 2.4

    PEP 218 Builtin Set Objects.

    PEP 289 Generator expressions.

    PEP 292 Simpler String Substitutions to be implemented as a module.

    PEP 318: Function/method decorator syntax, using @syntax

    PEP 322 Reverse Iteration.

    PEP 327: A Decimal package for fixed precision arithmetic.

    PEP 328: Multi-line Imports

    Encapsulate the decorate-sort-undecorate pattern in a keyword for
    list.sort().

    Added a builtin called sorted() which may be used in expressions.

    The itertools module has two new functions, tee() and groupby().
    
    Add a collections module with a deque() object.

    Add two statistical/reduction functions, nlargest() and nsmallest()
    to the heapq module.

    Python's windows installer now uses MSI

Deferred until 2.5:

    - Deprecate and/or remove the modules listed in PEP 4 (posixfile,
      gopherlib, pre, others)

    - Remove support for platforms as described in PEP 11.

    - Finish implementing the Distutils bdist_dpkg command.  (AMK)

    - Add support for reading shadow passwords (www.python.org/sf/579435)

    - It would be nice if the built-in SSL socket type could be used
      for non-blocking SSL I/O.  Currently packages such as Twisted 
      which implement async servers using SSL have to require third-party
      packages such as pyopenssl.  

    - AST-based compiler: this branch was not completed in time for 
      2.4, but will land on the trunk some time after 2.4 final is 
      out, for inclusion in 2.5.

    - reST is going to be used a lot in Zope3.  Maybe it could become
      a standard library module?  (Since reST's author thinks it's too
      instable, I'm inclined not to do this.)


Ongoing tasks

    The following are ongoing TO-DO items which we should attempt to
    work on without hoping for completion by any particular date.

    - Documentation: complete the distribution and installation
      manuals.

    - Documentation: complete the documentation for new-style
      classes.

    - Look over the Demos/ directory and update where required (Andrew
      Kuchling has done a lot of this)

    - New tests.

    - Fix doc bugs on SF.

    - Remove use of deprecated features in the core.

    - Document deprecated features appropriately.

    - Mark deprecated C APIs with Py_DEPRECATED.

    - Deprecate modules which are unmaintained, or perhaps make a new
      category for modules 'Unmaintained'

    - In general, lots of cleanup so it is easier to move forward.


Open issues

    None at this time.


Carryover features from Python 2.3

    - The import lock could use some redesign.  (SF 683658.)

    - A nicer API to open text files, replacing the ugly (in some
      people's eyes) "U" mode flag.  There's a proposal out there to
      have a new built-in type textfile(filename, mode, encoding).
      (Shouldn't it have a bufsize argument too?)

    - New widgets for Tkinter???

      Has anyone gotten the time for this?  *Are* there any new
      widgets in Tk 8.4?  Note that we've got better Tix support
      already (though not on Windows yet).

    - PEP 304 (Controlling Generation of Bytecode Files by Montanaro)
      seems to have lost steam.

    - For a class defined inside another class, the __name__ should be
      "outer.inner", and pickling should work.  (SF 633930.  I'm no
      longer certain this is easy or even right.)

    - Decide on a clearer deprecation policy (especially for modules)
      and act on it.  For a start, see this message from Neal Norwitz:
      http://mail.python.org/pipermail/python-dev/2002-April/023165.html
      There seems insufficient interest in moving this further in an
      organized fashion, and it's not particularly important.

    - Provide alternatives for common uses of the types module;
      Skip Montanaro has posted a proto-PEP for this idea:
      http://mail.python.org/pipermail/python-dev/2002-May/024346.html
      There hasn't been any progress on this, AFAICT.

    - Use pending deprecation for the types and string modules.  This
      requires providing alternatives for the parts that aren't
      covered yet (e.g. string.whitespace and types.TracebackType).
      It seems we can't get consensus on this.

    - PEP 262  Database of Installed Python Packages        Kuchling

      This turns out to be useful for Jack Jansen's Python installer,
      so the database is worth implementing.  Code will go in 
      sandbox/pep262.

    - PEP 269  Pgen Module for Python                       Riehl

      (Some necessary changes are in; the pgen module itself needs to
      mature more.)

    - PEP 266  Optimizing Global Variable/Attribute Access  Montanaro
      PEP 267  Optimized Access to Module Namespaces        Hylton
      PEP 280  Optimizing access to globals                 van Rossum

      These are basically three friendly competing proposals.  Jeremy
      has made a little progress with a new compiler, but it's going
      slowly and the compiler is only the first step.  Maybe we'll be
      able to refactor the compiler in this release.  I'm tempted to
      say we won't hold our breath. 

    - Lazily tracking tuples?
      http://mail.python.org/pipermail/python-dev/2002-May/023926.html
      http://www.python.org/sf/558745
      Not much enthusiasm I believe.

    - PEP 286  Enhanced Argument Tuples                     von Loewis

      I haven't had the time to review this thoroughly.  It seems a
      deep optimization hack (also makes better correctness guarantees
      though).

    - Make 'as' a keyword.  It has been a pseudo-keyword long enough.
      Too much effort to bother.


Copyright

    This document has been placed in the public domain.



pep-0321 Date/Time Parsing and Formatting

PEP:321
Title:Date/Time Parsing and Formatting
Version:$Revision$
Last-Modified:$Date$
Author:A.M. Kuchling <amk at amk.ca>
Status:Withdrawn
Type:Standards Track
Content-Type:text/x-rst
Created:16-Sep-2003
Python-Version:2.4
Post-History:

Abstract

Python 2.3 added a number of simple date and time types in the datetime module. There's no support for parsing strings in various formats and returning a corresponding instance of one of the types. This PEP proposes adding a family of predefined parsing function for several commonly used date and time formats, and a facility for generic parsing.

The types provided by the datetime module all have .isoformat() and .ctime() methods that return string representations of a time, and the .strftime() method can be used to construct new formats. There are a number of additional commonly-used formats that would be useful to have as part of the standard library; this PEP also suggests how to add them.

Input Formats

Useful formats to support include:

  • ISO8601 [2]
  • ARPA/RFC2822 [1]
  • ctime [4]
  • Formats commonly written by humans such as the American "MM/DD/YYYY", the European "YYYY/MM/DD", and variants such as "DD-Month-YYYY".
  • CVS-style or tar-style dates ("tomorrow", "12 hours ago", etc.)

XXX The Perl ParseDate.pm [3] module supports many different input formats, both absolute and relative. Should we try to support them all?

Options:

  1. Add functions to the datetime module:

    import datetime
    d = datetime.parse_iso8601("2003-09-15T10:34:54")
    
  2. Add class methods to the various types. There are already various class methods such as .now(), so this would be pretty natural.:

    import datetime
    d = datetime.date.parse_iso8601("2003-09-15T10:34:54")
    
  3. Add a separate module (possible names: date, date_parse, parse_date) or subpackage (possible names: datetime.parser) containing parsing functions:

    import datetime
    d = datetime.parser.parse_iso8601("2003-09-15T10:34:54")
    

Unresolved questions:

  • Naming convention to use.
  • What exception to raise on errors? ValueError, or a specialized exception?
  • Should you know what type you're expecting, or should the parsing figure it out? (e.g. parse_iso8601("yyyy-mm-dd") returns a date instance, but parsing "yyyy-mm-ddThh:mm:ss" returns a datetime.) Should there be an option to signal an error if a time is provided where none is expected, or if no time is provided?
  • Anything special required for I18N? For time zones?

Generic Input Parsing

Is a strptime() implementation that returns datetime types sufficient?

XXX if yes, describe strptime here. Can the existing pure-Python implementation be easily retargeted?

Output Formats

Not all input formats need to be supported as output formats, because it's pretty trivial to get the strftime() argument right for simple things such as YYYY/MM/DD. Only complicated formats need to be supported; RFC2822 is currently the only one I can think of.

Options:

  1. Provide predefined format strings, so you could write this:

    import datetime
    d = datetime.datetime(...)
    print d.strftime(d.RFC2822_FORMAT) # or datetime.RFC2822_FORMAT?
    
  2. Provide new methods on all the objects:

    d = datetime.datetime(...)
    print d.rfc822_time()
    

Relevant functionality in other languages includes the PHP date [5] function (Python implementation by Simon Willison at http://simon.incutio.com/archive/2003/10/07/dateInPython)

pep-0322 Reverse Iteration

PEP:322
Title:Reverse Iteration
Version:$Revision$
Last-Modified:$Date$
Author:Raymond Hettinger <python at rcn.com>
Status:Final
Type:Standards Track
Content-Type:text/x-rst
Created:24-Sep-2003
Python-Version:2.4
Post-History:24-Sep-2003

Abstract

This proposal is to add a builtin function to support reverse iteration over sequences.

Motivation

For indexable objects, current approaches for reverse iteration are error prone, unnatural, and not especially readable:

for i in xrange(n-1, -1, -1):
    print seqn[i]

One other current approach involves reversing a list before iterating over it. That technique wastes computer cycles, memory, and lines of code:

rseqn = list(seqn)
rseqn.reverse()
for value in rseqn:
    print value

Extended slicing is a third approach that minimizes the code overhead but does nothing for memory efficiency, beauty, or clarity.

Reverse iteration is much less common than forward iteration, but it does arise regularly in practice. See Real World Use Cases below.

Proposal

Add a builtin function called reversed() that makes a reverse iterator over sequence objects that support __getitem__() and __len__().

The above examples then simplify to:

for i in reversed(xrange(n)):
    print seqn[i]
for elem in reversed(seqn):
    print elem

The core idea is that the clearest, least error-prone way of specifying reverse iteration is to specify it in a forward direction and then say reversed.

The implementation could be as simple as:

def reversed(x):
    if hasattr(x, 'keys'):
        raise ValueError("mappings do not support reverse iteration")
    i = len(x)
    while i > 0:
        i -= 1
        yield x[i]

No language syntax changes are needed. The proposal is fully backwards compatible.

A C implementation and unit tests are at: http://www.python.org/sf/834422

BDFL Pronouncement

This PEP has been conditionally accepted for Py2.4. The condition means that if the function is found to be useless, it can be removed before Py2.4b1.

Alternative Method Names

  • reviter -- Jeremy Fincher's suggestion matches use of iter()
  • ireverse -- uses the itertools naming convention
  • inreverse -- no one seems to like this one except me

The name reverse is not a candidate because it duplicates the name of the list.reverse() which mutates the underlying list.

Discussion

The case against adoption of the PEP is a desire to keep the number of builtin functions small. This needs to weighed against the simplicity and convenience of having it as builtin instead of being tucked away in some other namespace.

Real World Use Cases

Here are some instances of reverse iteration taken from the standard library and comments on why reverse iteration was necessary:

  • atexit.exit_handlers() uses:

    while _exithandlers:
        func, targs, kargs = _exithandlers.pop()
            . . .
    

    In this application popping is required, so the new function would not help.

  • heapq.heapify() uses for i in xrange(n//2 - 1, -1, -1) because higher-level orderings are more easily formed from pairs of lower-level orderings. A forward version of this algorithm is possible; however, that would complicate the rest of the heap code which iterates over the underlying list in the opposite direction. The replacement code for i in reversed(xrange(n//2)) makes clear the range covered and how many iterations it takes.

  • mhlib.test() uses:

    testfolders.reverse();
    for t in testfolders:
        do('mh.deletefolder(%s)' % `t`)
    

    The need for reverse iteration arises because the tail of the underlying list is altered during iteration.

  • platform._dist_try_harder() uses for n in range(len(verfiles)-1,-1,-1) because the loop deletes selected elements from verfiles but needs to leave the rest of the list intact for further iteration.

  • random.shuffle() uses for i in xrange(len(x)-1, 0, -1) because the algorithm is most easily understood as randomly selecting elements from an ever diminishing pool. In fact, the algorithm can be run in a forward direction but is less intuitive and rarely presented that way in literature. The replacement code for i in reversed(xrange(1, len(x))) is much easier to verify visually.

  • rfc822.Message.__delitem__() uses:

    list.reverse()
    for i in list:
        del self.headers[i]
    

    The need for reverse iteration arises because the tail of the underlying list is altered during iteration.

Rejected Alternatives

Several variants were submitted that attempted to apply reversed() to all iterables by running the iterable to completion, saving the results, and then returning a reverse iterator over the results. While satisfying some notions of full generality, running the input to the end is contrary to the purpose of using iterators in the first place. Also, a small disaster ensues if the underlying iterator is infinite.

Putting the function in another module or attaching it to a type object is not being considered. Like its cousins, zip() and enumerate(), the function needs to be directly accessible in daily programming. Each solves a basic looping problem: lock-step iteration, loop counting, and reverse iteration. Requiring some form of dotted access would interfere with their simplicity, daily utility, and accessibility. They are core looping constructs, independent of any one application domain.

pep-0323 Copyable Iterators

PEP: 323
Title: Copyable Iterators
Version: $Revision$
Last-Modified: $Date$
Author: Alex Martelli <aleaxit at gmail.com>
Status: Deferred
Type: Standards Track
Content-Type: text/plain
Created: 25-Oct-2003
Python-Version: 2.5
Post-History: 29-Oct-2003

Deferral

  This PEP has been deferred. Copyable iterators are a nice idea, but after
  four years, no implementation or widespread interest has emerged.


Abstract

    This PEP suggests that some iterator types should support shallow
    copies of their instances by exposing a __copy__ method which meets
    some specific requirements, and indicates how code using an iterator
    might exploit such a __copy__ method when present.


Update and Comments

    Support for __copy__ was included in Py2.4's itertools.tee().

    Adding __copy__ methods to existing iterators will change the
    behavior under tee().  Currently, the copied iterators remain
    tied to the original iterator.  If the original advances, then
    so do all of the copies.  Good practice is to overwrite the
    original so that anamolies don't result:  a,b=tee(a).
    Code that doesn't follow that practice may observe a semantic
    change if a __copy__ method is added to an iterator.

Motivation

    In Python up to 2.3, most built-in iterator types don't let the user
    copy their instances.  User-coded iterators that do let their clients
    call copy.copy on their instances may, or may not, happen to return,
    as a result of the copy, a separate iterator object that may be
    iterated upon independently from the original.

    Currently, "support" for copy.copy in a user-coded iterator type is
    almost invariably "accidental" -- i.e., the standard machinery of the
    copy method in Python's standard library's copy module does build and
    return a copy.  However, the copy will be independently iterable with
    respect to the original only if calling .next() on an instance of that
    class happens to change instance state solely by rebinding some
    attributes to new values, and not by mutating some attributes'
    existing values.

    For example, an iterator whose "index" state is held as an integer
    attribute will probably give usable copies, since (integers being
    immutable) .next() presumably just rebinds that attribute.  On the
    other hand, another iterator whose "index" state is held as a list
    attribute will probably mutate the same list object when .next()
    executes, and therefore copies of such an iterator will not be
    iterable separately and independently from the original.

    Given this existing situation, copy.copy(it) on some iterator object
    isn't very useful, nor, therefore, is it at all widely used.  However,
    there are many cases in which being able to get a "snapshot" of an
    iterator, as a "bookmark", so as to be able to keep iterating along
    the sequence but later iterate again on the same sequence from the
    bookmark onwards, is useful.  To support such "bookmarking", module
    itertools, in 2.4, has grown a 'tee' function, to be used as:

        it, bookmark = itertools.tee(it)

    The previous value of 'it' must not be used again, which is why this
    typical usage idiom rebinds the name.  After this call, 'it' and
    'bookmark' are independently-iterable iterators on the same underlying
    sequence as the original value of 'it': this satisfies application
    needs for "iterator copying".

    However, when itertools.tee can make no hypotheses about the nature of
    the iterator it is passed as an argument, it must save in memory all
    items through which one of the two 'teed' iterators, but not yet both,
    have stepped.  This can be quite costly in terms of memory, if the two
    iterators get very far from each other in their stepping; indeed, in
    some cases it may be preferable to make a list from the iterator so as
    to be able to step repeatedly through the subsequence, or, if that is
    too costy in terms of memory, save items to disk, again in order to be
    able to iterate through them repeatedly.

    This PEP proposes another idea that will, in some important cases,
    allow itertools.tee to do its job with minimal cost in terms of
    memory; user code may also occasionally be able to exploit the idea in
    order to decide whether to copy an iterator, make a list from it, or
    use an auxiliary disk file.

    The key consideration is that some important iterators, such as those
    which built-in function iter builds over sequences, would be
    intrinsically easy to copy: just get another reference to the same
    sequence, and a copy of the integer index.  However, in Python 2.3,
    those iterators don't expose the state, and don't support copy.copy.

    The purpose of this PEP, therefore, is to have those iterator types
    expose a suitable __copy__ method.  Similarly, user-coded iterator
    types that can provide copies of their instances, suitable for
    separate and independent iteration, with limited costs in time and
    space, should also expose a suitable __copy__ method.  While
    copy.copy also supports other ways to let a type control the way
    its instances are copied, it is suggested, for simplicity, that
    iterator types that support copying always do so by exposing a
    __copy__ method, and not in the other ways copy.copy supports.

    Having iterators expose a suitable __copy__ when feasible will afford
    easy optimization of itertools.tee and similar user code, as in:

        def tee(it):
            it = iter(it)
            try: copier = it.__copy__
            except AttributeError:
                # non-copyable iterator, do all the needed hard work
                # [snipped!]
            else:
                return it, copier()

    Note that this function does NOT call "copy.copy(it)", which (even
    after this PEP is implemented) might well still "just happen to
    succeed". for some iterator type that is implemented as a user-coded
    class. without really supplying an adequate "independently iterable"
    copy object as its result.


Specification

    Any iterator type X may expose a method __copy__ that is callable
    without arguments on any instance x of X.  The method should be
    exposed if and only if the iterator type can provide copyability with
    reasonably little computational and memory effort.  Furthermore, the
    new object y returned by method __copy__ should be a new instance
    of X that is iterable independently and separately from x, stepping
    along the same "underlying sequence" of items.

    For example, suppose a class Iter essentially duplicated the
    functionality of the iter builtin for iterating on a sequence:

        class Iter(object):

            def __init__(self, sequence):
                self.sequence = sequence
                self.index = 0

            def __iter__(self):
                return self

            def next(self):
                try: result = self.sequence[self.index]
                except IndexError: raise StopIteration
                self.index += 1
                return result

    To make this Iter class compliant with this PEP, the following
    addition to the body of class Iter would suffice:

            def __copy__(self):
                result = self.__class__(self.sequence)
                result.index = self.index
                return result

    Note that __copy__, in this case, does not even try to copy the
    sequence; if the sequence is altered while either or both of the
    original and copied iterators are still stepping on it, the iteration
    behavior is quite likely to go awry anyway -- it is not __copy__'s
    responsibility to change this normal Python behavior for iterators
    which iterate on mutable sequences (that might, perhaps, be the
    specification for a __deepcopy__ method of iterators, which, however,
    this PEP does not deal with).

    Consider also a "random iterator", which provides a nonterminating
    sequence of results from some method of a random instance, called
    with given arguments:

        class RandomIterator(object):

            def __init__(self, bound_method, *args):
                self.call = bound_method
                self.args = args

            def __iter__(self):
                return self

            def next(self):
                return self.call(*self.args)

            def __copy__(self):
                import copy, new
                im_self = copy.copy(self.call.im_self)
                method = new.instancemethod(self.call.im_func, im_self)
                return self.__class__(method, *self.args)

    This iterator type is slightly more general than its name implies, as
    it supports calls to any bound method (or other callable, but if the
    callable is not a bound method, then method __copy__ will fail).  But
    the use case is for the purpose of generating random streams, as in:

            import random

            def show5(it):
                for i, result in enumerate(it):
                    print '%6.3f'%result,
                    if i==4: break
                print

            normit = RandomIterator(random.Random().gauss, 0, 1)
            show5(normit)
            copit = normit.__copy__()
            show5(normit)
            show5(copit)

    which will display some output such as:

            -0.536  1.936 -1.182 -1.690 -1.184
             0.666 -0.701  1.214  0.348  1.373
             0.666 -0.701  1.214  0.348  1.373

    the key point being that the second and third lines are equal, because
    the normit and copit iterators will step along the same "underlying
    sequence".  (As an aside, note that to get a copy of self.call.im_self
    we must use copy.copy, NOT try getting at a __copy__ method directly,
    because for example instances of random.Random support copying via
    __getstate__ and __setstate__, NOT via __copy__; indeed, using
    copy.copy is the normal way to get a shallow copy of any object --
    copyable iterators are different because of the already-mentioned
    uncertainty about the result of copy.copy supporting these "copyable
    iterator" specs).


Details

    Besides adding to the Python docs a recommendation that user-coded
    iterator types support a __copy__ method (if and only if it can be
    implemented with small costs in memory and runtime, and produce an
    independently-iterable copy of an iterator object), this PEP's
    implementation will specifically include the addition of copyability
    to the iterators over sequences that built-in iter returns, and also
    to the iterators over a dictionary returned by the methods __iter__,
    iterkeys, itervalues, and iteritems of built-in type dict.

    Iterators produced by generator functions will not be copyable.
    However, iterators produced by the new "generator expressions" of
    Python 2.4 (PEP 289 [3]) should be copyable if their underlying
    iterator[s] are; the strict limitations on what is possible in a
    generator expression, compared to the much vaster generality of a
    generator, should make that feasible.  Similarly, the iterators
    produced by the built-in function enumerate, and certain functions
    suppiled by module itertools, should be copyable if the underlying
    iterators are.

    The implementation of this PEP will also include the optimization of
    the new itertools.tee function mentioned in the Motivation section.


Rationale

    The main use case for (shallow) copying of an iterator is the same as
    for the function itertools.tee (new in 2.4).  User code will not
    directly attempt to copy an iterator, because it would have to deal
    separately with uncopyable cases; calling itertools.tee will
    internally perform the copy when appropriate, and implicitly fallback
    to a maximally efficient non-copying strategy for iterators that are
    not copyable.  (Occasionally, user code may want more direct control,
    specifically in order to deal with non-copyable iterators by other
    strategies, such as making a list or saving the sequence to disk).

    A tee'd iterator may serve as a "reference point", allowing processing
    of a sequence to continue or resume from a known point, while the
    other independent iterator can be freely advanced to "explore" a
    further part of the sequence as needed.  A simple example: a generator
    function which, given an iterator of numbers (assumed to be positive),
    returns a corresponding iterator, each of whose items is the fraction
    of the total corresponding to each corresponding item of the input
    iterator.  The caller may pass the total as a value, if known in
    advance; otherwise, the iterator returned by calling this generator
    function will first compute the total.

        def fractions(numbers, total=None):
            if total is None:
                numbers, aux = itertools.tee(numbers)
                total = sum(aux)
            total = float(total)
            for item in numbers:
                yield item / total

    The ability to tee the numbers iterator allows this generator to
    precompute the total, if needed, without necessarily requiring
    O(N) auxiliary memory if the numbers iterator is copyable.

    As another example of "iterator bookmarking", consider a stream of
    numbers with an occasional string as a "postfix operator" now and
    then.  By far most frequent such operator is a '+', whereupon we must
    sum all previous numbers (since the last previous operator if any, or
    else since the start) and yield the result.  Sometimes we find a '*'
    instead, which is the same except that the previous numbers must
    instead be multiplied, not summed.

        def filter_weird_stream(stream):
            it = iter(stream)
            while True:
                it, bookmark = itertools.tee(it)
                total = 0
                for item in it:
                    if item=='+':
                        yield total
                        break
                    elif item=='*':
                        product = 1
                        for item in bookmark:
                            if item=='*':
                                yield product
                                break
                            else:
                                product *= item
                   else:
                       total += item

    Similar use cases of itertools.tee can support such tasks as
    "undo" on a stream of commands represented by an iterator,
    "backtracking" on the parse of a stream of tokens, and so on.
    (Of course, in each case, one should also consider simpler
    possibilities such as saving relevant portions of the sequence
    into lists while stepping on the sequence with just one iterator,
    depending on the details of one's task).


    Here is an example, in pure Python, of how the 'enumerate'
    built-in could be extended to support __copy__ if its underlying
    iterator also supported __copy__:

        class enumerate(object):

            def __init__(self, it):
                self.it = iter(it)
                self.i = -1

            def __iter__(self):
                return self

            def next(self):
                self.i += 1
                return self.i, self.it.next()

            def __copy__(self):
                result = self.__class__.__new__()
                result.it = self.it.__copy__()
                result.i = self.i
                return result


    Here is an example of the kind of "fragility" produced by "accidental
    copyability" of an iterator -- the reason why one must NOT use
    copy.copy expecting, if it succeeds, to receive as a result an
    iterator which is iterable-on independently from the original.  Here
    is an iterator class that iterates (in preorder) on "trees" which, for
    simplicity, are just nested lists -- any item that's a list is treated
    as a subtree, any other item as a leaf.

    class ListreeIter(object):

        def __init__(self, tree):
            self.tree = [tree]
            self.indx = [-1]

        def __iter__(self):
            return self

        def next(self):
            if not self.indx:
                raise StopIteration
            self.indx[-1] += 1
            try:
                result = self.tree[-1][self.indx[-1]]
            except IndexError:
                self.tree.pop()
                self.indx.pop()
                return self.next()
            if type(result) is not list:
                return result
            self.tree.append(result)
            self.indx.append(-1)
            return self.next()

    Now, for example, the following code:

        import copy
        x = [ [1,2,3], [4, 5, [6, 7, 8], 9], 10, 11, [12] ]

        print 'showing all items:',
        it = ListreeIter(x)
        for i in it:
            print i,
            if i==6: cop = copy.copy(it)
        print

        print 'showing items >6 again:'
        for i in cop: print i,
        print

    does NOT work as intended -- the "cop" iterator gets consumed, and
    exhausted, step by step as the original "it" iterator is, because
    the accidental (rather than deliberate) copying performed by
    copy.copy shares, rather than duplicating the "index" list, which
    is the mutable attribute it.indx (a list of numerical indices).
    Thus, this "client code" of the iterator, which attemps to iterate
    twice over a portion of the sequence via a copy.copy on the
    iterator, is NOT correct.

    Some correct solutions include using itertools.tee, i.e., changing
    the first for loop into:

        for i in it:
            print i,
            if i==6:
                it, cop = itertools.tee(it)
                break
        for i in it: print i,

    (note that we MUST break the loop in two, otherwise we'd still
    be looping on the ORIGINAL value of it, which must NOT be used
    further after the call to tee!!!); or making a list, i.e.:

        for i in it:
            print i,
            if i==6:
                cop = lit = list(it)
                break
        for i in lit: print i,
    
    (again, the loop must be broken in two, since iterator 'it'
    gets exhausted by the call list(it)).

    Finally, all of these solutions would work if Listiter supplied
    a suitable __copy__ method, as this PEP recommends:

            def __copy__(self):
                result = self.__class__.new()
                result.tree = copy.copy(self.tree)
                result.indx = copy.copy(self.indx)
                return result

    There is no need to get any "deeper" in the copy, but the two
    mutable "index state" attributes must indeed be copied in order
    to achieve a "proper" (independently iterable) iterator-copy.

    The recommended solution is to have class Listiter supply this
    __copy__ method AND have client code use itertools.tee (with
    the split-in-two-parts loop as shown above).  This will make
    client code maximally tolerant of different iterator types it
    might be using AND achieve good performance for tee'ing of this
    specific iterator type at the same time.


References

    [1] Discussion on python-dev starting at post:
        http://mail.python.org/pipermail/python-dev/2003-October/038969.html

    [2] Online documentation for the copy module of the standard library:
        http://docs.python.org/library/copy.html

    [3] PEP 289, Generator Expressions, Hettinger
        http://www.python.org/dev/peps/pep-0289/

Copyright

    This document has been placed in the public domain.



pep-0324 subprocess - New process module

PEP: 324
Title: subprocess - New process module
Version: $Revision$
Last-Modified: $Date$
Author: Peter Astrand <astrand at lysator.liu.se>
Status: Final
Type: Standards Track
Content-Type: text/plain
Created: 19-Nov-2003
Python-Version: 2.4
Post-History: 

Abstract

    This PEP describes a new module for starting and communicating
    with processes.


Motivation

    Starting new processes is a common task in any programming
    language, and very common in a high-level language like Python.
    Good support for this task is needed, because:

    - Inappropriate functions for starting processes could mean a
      security risk: If the program is started through the shell, and
      the arguments contain shell meta characters, the result can be
      disastrous. [1]

    - It makes Python an even better replacement language for
      over-complicated shell scripts.

    Currently, Python has a large number of different functions for
    process creation.  This makes it hard for developers to choose.

    The subprocess module provides the following enhancements over
    previous functions:

    - One "unified" module provides all functionality from previous
      functions.

    - Cross-process exceptions: Exceptions happening in the child
      before the new process has started to execute are re-raised in
      the parent.  This means that it's easy to handle exec()
      failures, for example.  With popen2, for example, it's
      impossible to detect if the execution failed.

    - A hook for executing custom code between fork and exec.  This
      can be used for, for example, changing uid.

    - No implicit call of /bin/sh.  This means that there is no need
      for escaping dangerous shell meta characters.

    - All combinations of file descriptor redirection is possible.
      For example, the "python-dialog" [2] needs to spawn a process
      and redirect stderr, but not stdout.  This is not possible with
      current functions, without using temporary files.

    - With the subprocess module, it's possible to control if all open
      file descriptors should be closed before the new program is
      executed.

    - Support for connecting several subprocesses (shell "pipe").

    - Universal newline support.

    - A communicate() method, which makes it easy to send stdin data
      and read stdout and stderr data, without risking deadlocks.
      Most people are aware of the flow control issues involved with
      child process communication, but not all have the patience or
      skills to write a fully correct and deadlock-free select loop.
      This means that many Python applications contain race
      conditions.  A communicate() method in the standard library
      solves this problem.


Rationale

    The following points summarizes the design:

    - subprocess was based on popen2, which is tried-and-tested.

    - The factory functions in popen2 have been removed, because I
      consider the class constructor equally easy to work with.

    - popen2 contains several factory functions and classes for
      different combinations of redirection.  subprocess, however,
      contains one single class.  Since the subprocess module supports
      12 different combinations of redirection, providing a class or
      function for each of them would be cumbersome and not very
      intuitive.  Even with popen2, this is a readability problem.
      For example, many people cannot tell the difference between
      popen2.popen2 and popen2.popen4 without using the documentation.

    - One small utility function is provided: subprocess.call(). It
      aims to be an enhancement over os.system(), while still very
      easy to use:

        - It does not use the Standard C function system(), which has
          limitations.

        - It does not call the shell implicitly.

        - No need for quoting; using an argument list.

        - The return value is easier to work with.

      The call() utility function accepts an 'args' argument, just
      like the Popen class constructor.  It waits for the command to
      complete, then returns the returncode attribute.  The
      implementation is very simple:

      def call(*args, **kwargs):
          return Popen(*args, **kwargs).wait()

      The motivation behind the call() function is simple: Starting a
      process and wait for it to finish is a common task.

      While Popen supports a wide range of options, many users have
      simple needs.  Many people are using os.system() today, mainly
      because it provides a simple interface.  Consider this example:

          os.system("stty sane -F " + device)

      With subprocess.call(), this would look like:

          subprocess.call(["stty", "sane", "-F", device])

      or, if executing through the shell:

          subprocess.call("stty sane -F " + device, shell=True)

    - The "preexec" functionality makes it possible to run arbitrary
      code between fork and exec.  One might ask why there are special
      arguments for setting the environment and current directory, but
      not for, for example, setting the uid.  The answer is:

        - Changing environment and working directory is considered
          fairly common.

        - Old functions like spawn() has support for an
          "env"-argument.

        - env and cwd are considered quite cross-platform: They make
          sense even on Windows.

     - On POSIX platforms, no extension module is required: the module
       uses os.fork(), os.execvp() etc.

     - On Windows platforms, the module requires either Mark Hammond's
       Windows extensions[5], or a small extension module called
       _subprocess.


Specification

    This module defines one class called Popen:

        class Popen(args, bufsize=0, executable=None,
                    stdin=None, stdout=None, stderr=None,
                    preexec_fn=None, close_fds=False, shell=False,
                    cwd=None, env=None, universal_newlines=False,
                    startupinfo=None, creationflags=0):


      Arguments are:

    - args should be a string, or a sequence of program arguments.
      The program to execute is normally the first item in the args
      sequence or string, but can be explicitly set by using the
      executable argument.
      
      On UNIX, with shell=False (default): In this case, the Popen
      class uses os.execvp() to execute the child program.  args
      should normally be a sequence.  A string will be treated as a
      sequence with the string as the only item (the program to
      execute).
      
      On UNIX, with shell=True: If args is a string, it specifies the
      command string to execute through the shell.  If args is a
      sequence, the first item specifies the command string, and any
      additional items will be treated as additional shell arguments.
      
      On Windows: the Popen class uses CreateProcess() to execute the
      child program, which operates on strings.  If args is a
      sequence, it will be converted to a string using the
      list2cmdline method.  Please note that not all MS Windows
      applications interpret the command line the same way: The
      list2cmdline is designed for applications using the same rules
      as the MS C runtime.

    - bufsize, if given, has the same meaning as the corresponding
      argument to the built-in open() function: 0 means unbuffered, 1
      means line buffered, any other positive value means use a buffer
      of (approximately) that size.  A negative bufsize means to use
      the system default, which usually means fully buffered.  The
      default value for bufsize is 0 (unbuffered).

    - stdin, stdout and stderr specify the executed programs' standard
      input, standard output and standard error file handles,
      respectively.  Valid values are PIPE, an existing file
      descriptor (a positive integer), an existing file object, and
      None.  PIPE indicates that a new pipe to the child should be
      created.  With None, no redirection will occur; the child's file
      handles will be inherited from the parent.  Additionally, stderr
      can be STDOUT, which indicates that the stderr data from the
      applications should be captured into the same file handle as for
      stdout.

    - If preexec_fn is set to a callable object, this object will be
      called in the child process just before the child is executed.

    - If close_fds is true, all file descriptors except 0, 1 and 2
      will be closed before the child process is executed.

    - If shell is true, the specified command will be executed through
      the shell.  

    - If cwd is not None, the current directory will be changed to cwd
      before the child is executed.

    - If env is not None, it defines the environment variables for the
      new process.

    - If universal_newlines is true, the file objects stdout and
      stderr are opened as a text file, but lines may be terminated
      by any of '\n', the Unix end-of-line convention, '\r', the
      Macintosh convention or '\r\n', the Windows convention.  All of
      these external representations are seen as '\n' by the Python
      program.  Note: This feature is only available if Python is
      built with universal newline support (the default).  Also, the
      newlines attribute of the file objects stdout, stdin and stderr
      are not updated by the communicate() method.

    - The startupinfo and creationflags, if given, will be passed to
      the underlying CreateProcess() function.  They can specify
      things such as appearance of the main window and priority for
      the new process.  (Windows only)


      This module also defines two shortcut functions:

    - call(*args, **kwargs):
          Run command with arguments.  Wait for command to complete,
          then return the returncode attribute.

          The arguments are the same as for the Popen constructor.
          Example:

          retcode = call(["ls", "-l"])


    Exceptions
    ----------

    Exceptions raised in the child process, before the new program has
    started to execute, will be re-raised in the parent.
    Additionally, the exception object will have one extra attribute
    called 'child_traceback', which is a string containing traceback
    information from the child's point of view.

    The most common exception raised is OSError.  This occurs, for
    example, when trying to execute a non-existent file.  Applications
    should prepare for OSErrors.

    A ValueError will be raised if Popen is called with invalid
    arguments.


    Security
    --------

    Unlike some other popen functions, this implementation will never
    call /bin/sh implicitly.  This means that all characters,
    including shell meta-characters, can safely be passed to child
    processes.


    Popen objects
    -------------

    Instances of the Popen class have the following methods:

    poll()
        Check if child process has terminated.  Returns returncode
        attribute.

    wait()
        Wait for child process to terminate.  Returns returncode
        attribute.

    communicate(input=None)
        Interact with process: Send data to stdin.  Read data from
        stdout and stderr, until end-of-file is reached.  Wait for
        process to terminate.  The optional stdin argument should be a
        string to be sent to the child process, or None, if no data
        should be sent to the child.

        communicate() returns a tuple (stdout, stderr).

        Note: The data read is buffered in memory, so do not use this
        method if the data size is large or unlimited.

    The following attributes are also available:

    stdin
        If the stdin argument is PIPE, this attribute is a file object
        that provides input to the child process.  Otherwise, it is
        None.

    stdout
        If the stdout argument is PIPE, this attribute is a file
        object that provides output from the child process.
        Otherwise, it is None.

    stderr
        If the stderr argument is PIPE, this attribute is file object
        that provides error output from the child process.  Otherwise,
        it is None.

    pid
        The process ID of the child process.

    returncode
        The child return code.  A None value indicates that the
        process hasn't terminated yet.  A negative value -N indicates
        that the child was terminated by signal N (UNIX only).


Replacing older functions with the subprocess module

    In this section, "a ==> b" means that b can be used as a
    replacement for a.

    Note: All functions in this section fail (more or less) silently
    if the executed program cannot be found; this module raises an
    OSError exception.

    In the following examples, we assume that the subprocess module is
    imported with "from subprocess import *".


    Replacing /bin/sh shell backquote
    ---------------------------------

    output=`mycmd myarg`
    ==>
    output = Popen(["mycmd", "myarg"], stdout=PIPE).communicate()[0]


    Replacing shell pipe line
    -------------------------

    output=`dmesg | grep hda`
    ==>
    p1 = Popen(["dmesg"], stdout=PIPE)
    p2 = Popen(["grep", "hda"], stdin=p1.stdout, stdout=PIPE)
    output = p2.communicate()[0]


    Replacing os.system()
    ---------------------

    sts = os.system("mycmd" + " myarg")
    ==>
    p = Popen("mycmd" + " myarg", shell=True)
    sts = os.waitpid(p.pid, 0)

    Note:

    * Calling the program through the shell is usually not required.

    * It's easier to look at the returncode attribute than the
      exit status.

    A more real-world example would look like this:

    try:
        retcode = call("mycmd" + " myarg", shell=True)
        if retcode < 0:
            print >>sys.stderr, "Child was terminated by signal", -retcode
        else:
            print >>sys.stderr, "Child returned", retcode
    except OSError, e:
        print >>sys.stderr, "Execution failed:", e


    Replacing os.spawn*
    -------------------

    P_NOWAIT example:

    pid = os.spawnlp(os.P_NOWAIT, "/bin/mycmd", "mycmd", "myarg")
    ==>
    pid = Popen(["/bin/mycmd", "myarg"]).pid


    P_WAIT example:

    retcode = os.spawnlp(os.P_WAIT, "/bin/mycmd", "mycmd", "myarg")
    ==>
    retcode = call(["/bin/mycmd", "myarg"])


    Vector example:

    os.spawnvp(os.P_NOWAIT, path, args)
    ==>
    Popen([path] + args[1:])


    Environment example:

    os.spawnlpe(os.P_NOWAIT, "/bin/mycmd", "mycmd", "myarg", env)
    ==>
    Popen(["/bin/mycmd", "myarg"], env={"PATH": "/usr/bin"})


    Replacing os.popen*
    -------------------

    pipe = os.popen(cmd, mode='r', bufsize)
    ==>
    pipe = Popen(cmd, shell=True, bufsize=bufsize, stdout=PIPE).stdout

    pipe = os.popen(cmd, mode='w', bufsize)
    ==>
    pipe = Popen(cmd, shell=True, bufsize=bufsize, stdin=PIPE).stdin


    (child_stdin, child_stdout) = os.popen2(cmd, mode, bufsize)
    ==>
    p = Popen(cmd, shell=True, bufsize=bufsize,
              stdin=PIPE, stdout=PIPE, close_fds=True)
    (child_stdin, child_stdout) = (p.stdin, p.stdout)


    (child_stdin,
     child_stdout,
     child_stderr) = os.popen3(cmd, mode, bufsize)
    ==>
    p = Popen(cmd, shell=True, bufsize=bufsize,
              stdin=PIPE, stdout=PIPE, stderr=PIPE, close_fds=True)
    (child_stdin,
     child_stdout,
     child_stderr) = (p.stdin, p.stdout, p.stderr)


    (child_stdin, child_stdout_and_stderr) = os.popen4(cmd, mode, bufsize)
    ==>
    p = Popen(cmd, shell=True, bufsize=bufsize,
              stdin=PIPE, stdout=PIPE, stderr=STDOUT, close_fds=True)
    (child_stdin, child_stdout_and_stderr) = (p.stdin, p.stdout)


    Replacing popen2.*
    ------------------

    Note: If the cmd argument to popen2 functions is a string, the
    command is executed through /bin/sh.  If it is a list, the command
    is directly executed.

    (child_stdout, child_stdin) = popen2.popen2("somestring", bufsize, mode)
    ==>
    p = Popen(["somestring"], shell=True, bufsize=bufsize
              stdin=PIPE, stdout=PIPE, close_fds=True)
    (child_stdout, child_stdin) = (p.stdout, p.stdin)


    (child_stdout, child_stdin) = popen2.popen2(["mycmd", "myarg"], bufsize, mode)
    ==>
    p = Popen(["mycmd", "myarg"], bufsize=bufsize,
              stdin=PIPE, stdout=PIPE, close_fds=True)
    (child_stdout, child_stdin) = (p.stdout, p.stdin)

    The popen2.Popen3 and popen3.Popen4 basically works as
    subprocess.Popen, except that:

    * subprocess.Popen raises an exception if the execution fails
    * the capturestderr argument is replaced with the stderr argument.
    * stdin=PIPE and stdout=PIPE must be specified.
    * popen2 closes all file descriptors by default, but you have to
      specify close_fds=True with subprocess.Popen.


Open Issues

    Some features have been requested but is not yet implemented.
    This includes:

    * Support for managing a whole flock of subprocesses

    * Support for managing "daemon" processes

    * Built-in method for killing subprocesses

    While these are useful features, it's expected that these can be
    added later without problems.

    * expect-like functionality, including pty support.

    pty support is highly platform-dependent, which is a
    problem.  Also, there are already other modules that provide this
    kind of functionality[6].


Backwards Compatibility

    Since this is a new module, no major backward compatible issues
    are expected.  The module name "subprocess" might collide with
    other, previous modules[3] with the same name, but the name
    "subprocess" seems to be the best suggested name so far.  The
    first name of this module was "popen5", but this name was
    considered too unintuitive.  For a while, the module was called
    "process", but this name is already used by Trent Mick's
    module[4].

    The functions and modules that this new module is trying to
    replace (os.system, os.spawn*, os.popen*, popen2.*, commands.*)
    are expected to be available in future Python versions for a long
    time, to preserve backwards compatibility.


Reference Implementation

    A reference implementation is available from
    http://www.lysator.liu.se/~astrand/popen5/.


References

    [1] Secure Programming for Linux and Unix HOWTO, section 8.3.
        http://www.dwheeler.com/secure-programs/

    [2] Python Dialog
        http://pythondialog.sourceforge.net/

    [3] http://www.iol.ie/~padraiga/libs/subProcess.py

    [4] http://starship.python.net/crew/tmick/

    [5] http://starship.python.net/crew/mhammond/win32/

    [6] http://www.lysator.liu.se/~ceder/pcl-expect/


Copyright

    This document has been placed in the public domain.


pep-0325 Resource-Release Support for Generators

PEP: 325
Title: Resource-Release Support for Generators
Version: $Revision$
Last-Modified: $Date$
Author: Samuele Pedroni <pedronis at python.org>
Status: Rejected
Type: Standards Track
Content-Type: text/plain
Created: 25-Aug-2003
Python-Version: 2.4
Post-History: 

Abstract

    Generators allow for natural coding and abstraction of traversal
    over data.  Currently if external resources needing proper timely
    release are involved, generators are unfortunately not adequate.
    The typical idiom for timely release is not supported, a yield
    statement is not allowed in the try clause of a try-finally
    statement inside a generator.  The finally clause execution can be
    neither guaranteed nor enforced.

    This PEP proposes that the built-in generator type implement a
    close method and destruction semantics, such that the restriction
    on yield placement can be lifted, expanding the applicability of
    generators.

Pronouncement

    Rejected in favor of PEP 342 which includes substantially all of
    the requested behavior in a more refined form.

Rationale

    Python generators allow for natural coding of many data traversal
    scenarios.  Their instantiation produces iterators,
    i.e. first-class objects abstracting traversal (with all the
    advantages of first- classness).  In this respect they match in
    power and offer some advantages over the approach using iterator
    methods taking a (smalltalkish) block.  On the other hand, given
    current limitations (no yield allowed in a try clause of a
    try-finally inside a generator) the latter approach seems better
    suited to encapsulating not only traversal but also exception
    handling and proper resource acquisition and release.

    Let's consider an example (for simplicity, files in read-mode are
    used):

        def all_lines(index_path):
            for path in file(index_path, "r"):
                for line in file(path.strip(), "r"):
                    yield line

    this is short and to the point, but the try-finally for timely
    closing of the files cannot be added.  (While instead of a path, a
    file, whose closing then would be responsibility of the caller,
    could be passed in as argument, the same is not applicable for the
    files opened depending on the contents of the index).

    If we want timely release, we have to sacrifice the simplicity and
    directness of the generator-only approach: (e.g.)

        class AllLines:

            def __init__(self,index_path):
                self.index_path = index_path
                self.index = None
                self.document = None

            def __iter__(self):
                self.index = file(self.index_path,"r")
                for path in self.index:
                    self.document = file(path.strip(),"r")
                    for line in self.document:
                        yield line
                    self.document.close()
                    self.document = None

            def close(self):
                if self.index:
                   self.index.close()
                if self.document:
                   self.document.close()

    to be used as:

        all_lines = AllLines("index.txt")
        try:
            for line in all_lines:
                ...
        finally:
            all_lines.close()

    The more convoluted solution implementing timely release, seems
    to offer a precious hint.  What we have done is encapsulate our
    traversal in an object (iterator) with a close method.

    This PEP proposes that generators should grow such a close method
    with such semantics that the example could be rewritten as:

        # Today this is not valid Python: yield is not allowed between
        # try and finally, and generator type instances support no
        # close method.

        def all_lines(index_path):
            index = file(index_path,"r")
            try:
                for path in index:
                    document = file(path.strip(),"r")
                    try:
                        for line in document:
                            yield line
                    finally:
                       document.close()
            finally:
                index.close()

        all = all_lines("index.txt")
        try:
            for line in all:
                ...
        finally:
            all.close() # close on generator

    Currently PEP 255 [1] disallows yield inside a try clause of a
    try-finally statement, because the execution of the finally clause
    cannot be guaranteed as required by try-finally semantics.

    The semantics of the proposed close method should be such that
    while the finally clause execution still cannot be guaranteed, it
    can be enforced when required.  Specifically, the close method
    behavior should trigger the execution of the finally clauses
    inside the generator, either by forcing a return in the generator
    frame or by throwing an exception in it.  In situations requiring
    timely resource release, close could then be explicitly invoked.

    The semantics of generator destruction on the other hand should be
    extended in order to implement a best-effort policy for the
    general case.  Specifically, destruction should invoke close().
    The best-effort limitation comes from the fact that the
    destructor's execution is not guaranteed in the first place.

    This seems to be a reasonable compromise, the resulting global
    behavior being similar to that of files and closing.


Possible Semantics

    The built-in generator type should have a close method
    implemented, which can then be invoked as:

       gen.close()

    where gen is an instance of the built-in generator type.
    Generator destruction should also invoke close method behavior.

    If a generator is already terminated, close should be a no-op.

    Otherwise, there are two alternative solutions, Return or
    Exception Semantics:

    A - Return Semantics: The generator should be resumed, generator
    execution should continue as if the instruction at the re-entry
    point is a return.  Consequently finally clauses surrounding the
    re-entry point would be executed, in the case of a then allowed
    try-yield-finally pattern.

    Issues: is it important to be able to distinguish forced
    termination by close, normal termination, exception propagation
    from generator or generator-called code?  In the normal case it
    seems not, finally clauses should be there to work the same in all
    these cases, still this semantics could make such a distinction
    hard.

    Except-clauses, like by a normal return, are not executed, such
    clauses in legacy generators expect to be executed for exceptions
    raised by the generator or by code called from it.  Not executing
    them in the close case seems correct.

    B - Exception Semantics: The generator should be resumed and
    execution should continue as if a special-purpose exception
    (e.g. CloseGenerator) has been raised at re-entry point.  Close
    implementation should consume and not propagate further this
    exception.

    Issues: should StopIteration be reused for this purpose?  Probably
    not.  We would like close to be a harmless operation for legacy
    generators, which could contain code catching StopIteration to
    deal with other generators/iterators.

    In general, with exception semantics, it is unclear what to do if
    the generator does not terminate or we do not receive the special
    exception propagated back.  Other different exceptions should
    probably be propagated, but consider this possible legacy
    generator code:

        try:
            ...
            yield ...
            ...
        except: # or except Exception:, etc
            raise Exception("boom")

    If close is invoked with the generator suspended after the yield,
    the except clause would catch our special purpose exception, so we
    would get a different exception propagated back, which in this
    case ought to be reasonably consumed and ignored but in general
    should be propagated, but separating these scenarios seems hard.

    The exception approach has the advantage to let the generator
    distinguish between termination cases and have more control.  On
    the other hand clear-cut semantics seem harder to define.


Remarks

    If this proposal is accepted, it should become common practice to
    document whether a generator acquires resources, so that its close
    method ought to be called.  If a generator is no longer used,
    calling close should be harmless.

    On the other hand, in the typical scenario the code that
    instantiated the generator should call close if required by it.
    Generic code dealing with iterators/generators instantiated
    elsewhere should typically not be littered with close calls.

    The rare case of code that has acquired ownership of and need to
    properly deal with all of iterators, generators and generators
    acquiring resources that need timely release, is easily solved:

        if hasattr(iterator, 'close'):
            iterator.close()


Open Issues

    Definitive semantics ought to be chosen.  Currently Guido favors
    Exception Semantics.  If the generator yields a value instead of
    terminating, or propagating back the special exception, a special
    exception should be raised again on the generator side.

    It is still unclear whether spuriously converted special
    exceptions (as discussed in Possible Semantics) are a problem and
    what to do about them.

    Implementation issues should be explored.


Alternative Ideas

    The idea that the yield placement limitation should be removed and
    that generator destruction should trigger execution of finally
    clauses has been proposed more than once.  Alone it cannot
    guarantee that timely release of resources acquired by a generator
    can be enforced.

    PEP 288 [2] proposes a more general solution, allowing custom
    exception passing to generators.  The proposal in this PEP
    addresses more directly the problem of resource release.  Were PEP
    288 implemented, Exceptions Semantics for close could be layered
    on top of it, on the other hand PEP 288 should make a separate
    case for the more general functionality.


References

    [1] PEP 255 Simple Generators
        http://www.python.org/dev/peps/pep-0255/

    [2] PEP 288 Generators Attributes and Exceptions
        http://www.python.org/dev/peps/pep-0288/


Copyright

    This document has been placed in the public domain.



pep-0326 A Case for Top and Bottom Values

PEP:326
Title:A Case for Top and Bottom Values
Version:$Revision$
Last-Modified:$Date$
Author:Josiah Carlson <jcarlson at uci.edu>, Terry Reedy <tjreedy at udel.edu>
Status:Rejected
Type:Standards Track
Content-Type:text/x-rst
Created:20-Dec-2003
Python-Version:2.4
Post-History:20-Dec-2003, 03-Jan-2004, 05-Jan-2004, 07-Jan-2004, 21-Feb-2004

Results

This PEP has been rejected by the BDFL [12]. As per the pseudo-sunset clause [13], PEP 326 is being updated one last time with the latest suggestions, code modifications, etc., and includes a link to a module [14] that implements the behavior described in the PEP. Users who desire the behavior listed in this PEP are encouraged to use the module for the reasons listed in Independent Implementations?.

Abstract

This PEP proposes two singleton constants that represent a top and bottom [3] value: Max and Min (or two similarly suggestive names [4]; see Open Issues).

As suggested by their names, Max and Min would compare higher or lower than any other object (respectively). Such behavior results in easier to understand code and fewer special cases in which a temporary minimum or maximum value is required, and an actual minimum or maximum numeric value is not limited.

Rationale

While None can be used as an absolute minimum that any value can attain [1], this may be deprecated [4] in Python 3.0 and shouldn't be relied upon.

As a replacement for None being used as an absolute minimum, as well as the introduction of an absolute maximum, the introduction of two singleton constants Max and Min address concerns for the constants to be self-documenting.

What is commonly done to deal with absolute minimum or maximum values, is to set a value that is larger than the script author ever expects the input to reach, and hope that it isn't reached.

Guido has brought up [2] the fact that there exists two constants that can be used in the interim for maximum values: sys.maxint and floating point positive infinity (1e309 will evaluate to positive infinity). However, each has their drawbacks.

  • On most architectures sys.maxint is arbitrarily small (2**31-1 or 2**63-1) and can be easily eclipsed by large 'long' integers or floating point numbers.

  • Comparing long integers larger than the largest floating point number representable against any float will result in an exception being raised:

    >>> cmp(1.0, 10**309)
    Traceback (most recent call last):
    File "<stdin>", line 1, in ?
    OverflowError: long int too large to convert to float
    

    Even when large integers are compared against positive infinity:

    >>> cmp(1e309, 10**309)
    Traceback (most recent call last):
    File "<stdin>", line 1, in ?
    OverflowError: long int too large to convert to float
    
  • These same drawbacks exist when numbers are negative.

Introducing Max and Min that work as described above does not take much effort. A sample Python reference implementation of both is included.

Motivation

There are hundreds of algorithms that begin by initializing some set of values to a logical (or numeric) infinity or negative infinity. Python lacks either infinity that works consistently or really is the most extreme value that can be attained. By adding Max and Min, Python would have a real maximum and minimum value, and such algorithms can become clearer due to the reduction of special cases.

Max Examples

When testing various kinds of servers, it is sometimes necessary to only serve a certain number of clients before exiting, which results in code like the following:

count = 5

def counts(stop):
    i = 0
    while i < stop:
        yield i
        i += 1

for client_number in counts(count):
    handle_one_client()

When using Max as the value assigned to count, our testing server becomes a production server with minimal effort.

As another example, in Dijkstra's shortest path algorithm on a graph with weighted edges (all positive).

  1. Set distances to every node in the graph to infinity.
  2. Set the distance to the start node to zero.
  3. Set visited to be an empty mapping.
  4. While shortest distance of a node that has not been visited is less than infinity and the destination has not been visited.
    1. Get the node with the shortest distance.
    2. Visit the node.
    3. Update neighbor distances and parent pointers if necessary for neighbors that have not been visited.
  5. If the destination has been visited, step back through parent pointers to find the reverse of the path to be taken.

Below is an example of Dijkstra's shortest path algorithm on a graph with weighted edges using a table (a faster version that uses a heap is available, but this version is offered due to its similarity to the description above, the heap version is available via older versions of this document).

def DijkstraSP_table(graph, S, T):
    table = {}                                                 #3
    for node in graph.iterkeys():
        #(visited, distance, node, parent)
        table[node] = (0, Max, node, None)                     #1
    table[S] = (0, 0, S, None)                                 #2
    cur = min(table.values())                                  #4a
    while (not cur[0]) and cur[1] < Max:                       #4
        (visited, distance, node, parent) = cur
        table[node] = (1, distance, node, parent)              #4b
        for cdist, child in graph[node]:                       #4c
            ndist = distance+cdist                             #|
            if not table[child][0] and ndist < table[child][1]:#|
                table[child] = (0, ndist, child, node)         #|_
        cur = min(table.values())                              #4a
    if not table[T][0]:
        return None
    cur = T                                                    #5
    path = [T]                                                 #|
    while table[cur][3] is not None:                           #|
        path.append(table[cur][3])                             #|
        cur = path[-1]                                         #|
    path.reverse()                                             #|
    return path                                                #|_

Readers should note that replacing Max in the above code with an arbitrarily large number does not guarantee that the shortest path distance to a node will never exceed that number. Well, with one caveat: one could certainly sum up the weights of every edge in the graph, and set the 'arbitrarily large number' to that total. However, doing so does not make the algorithm any easier to understand and has potential problems with numeric overflows.

Gustavo Niemeyer [9] points out that using a more Pythonic data structure than tuples, to store information about node distances, increases readability. Two equivalent node structures (one using None, the other using Max) and their use in a suitably modified Dijkstra's shortest path algorithm is given below.

class SuperNode:
    def __init__(self, node, parent, distance, visited):
        self.node = node
        self.parent = parent
        self.distance = distance
        self.visited = visited

class MaxNode(SuperNode):
    def __init__(self, node, parent=None, distance=Max,
                 visited=False):
        SuperNode.__init__(self, node, parent, distance, visited)
    def __cmp__(self, other):
        return cmp((self.visited, self.distance),
                   (other.visited, other.distance))

class NoneNode(SuperNode):
    def __init__(self, node, parent=None, distance=None,
                 visited=False):
        SuperNode.__init__(self, node, parent, distance, visited)
    def __cmp__(self, other):
        pair = ((self.visited, self.distance),
                (other.visited, other.distance))
        if None in (self.distance, other.distance):
            return -cmp(*pair)
        return cmp(*pair)

def DijkstraSP_table_node(graph, S, T, Node):
    table = {}                                                 #3
    for node in graph.iterkeys():
        table[node] = Node(node)                               #1
    table[S] = Node(S, distance=0)                             #2
    cur = min(table.values())                                  #4a
    sentinel = Node(None).distance
    while not cur.visited and cur.distance != sentinel:        #4
        cur.visited = True                                     #4b
        for cdist, child in graph[node]:                       #4c
            ndist = distance+cdist                             #|
            if not table[child].visited and\                   #|
               ndist < table[child].distance:                  #|
                table[child].distance = ndist                  #|_
        cur = min(table.values())                              #4a
    if not table[T].visited:
        return None
    cur = T                                                    #5
    path = [T]                                                 #|
    while table[cur].parent is not None:                       #|
        path.append(table[cur].parent)                         #|
        cur = path[-1]                                         #|
    path.reverse()                                             #|
    return path                                                #|_

In the above, passing in either NoneNode or MaxNode would be sufficient to use either None or Max for the node distance 'infinity'. Note the additional special case required for None being used as a sentinel in NoneNode in the __cmp__ method.

This example highlights the special case handling where None is used as a sentinel value for maximum values "in the wild", even though None itself compares smaller than any other object in the standard distribution.

As an aside, it is not clear to to the author that using Nodes as a replacement for tuples has increased readability significantly, if at all.

A Min Example

An example of usage for Min is an algorithm that solves the following problem [6]:

Suppose you are given a directed graph, representing a communication network. The vertices are the nodes in the network, and each edge is a communication channel. Each edge (u, v) has an associated value r(u, v), with 0 <= r(u, v) <= 1, which represents the reliability of the channel from u to v (i.e., the probability that the channel from u to v will not fail). Assume that the reliability probabilities of the channels are independent. (This implies that the reliability of any path is the product of the reliability of the edges along the path.) Now suppose you are given two nodes in the graph, A and B.

Such an algorithm is a 7 line modification to the DijkstraSP_table algorithm given above (modified lines prefixed with *):

def DijkstraSP_table(graph, S, T):
    table = {}                                                 #3
    for node in graph.iterkeys():
        #(visited, distance, node, parent)
*       table[node] = (0, Min, node, None)                     #1
*   table[S] = (0, 1, S, None)                                 #2
*   cur = max(table.values())                                  #4a
*   while (not cur[0]) and cur[1] > Min:                       #4
        (visited, distance, node, parent) = cur
        table[node] = (1, distance, node, parent)              #4b
        for cdist, child in graph[node]:                       #4c
*           ndist = distance*cdist                             #|
*           if not table[child][0] and ndist > table[child][1]:#|
                table[child] = (0, ndist, child, node)         #|_
*       cur = max(table.values())                              #4a
    if not table[T][0]:
        return None
    cur = T                                                    #5
    path = [T]                                                 #|
    while table[cur][3] is not None:                           #|
        path.append(table[cur][3])                             #|
        cur = path[-1]                                         #|
    path.reverse()                                             #|
    return path                                                #|_

Note that there is a way of translating the graph to so that it can be passed unchanged into the original DijkstraSP_table algorithm. There also exists a handful of easy methods for constructing Node objects that would work with DijkstraSP_table_node. Such translations are left as an exercise to the reader.

Other Examples

Andrew P. Lentvorski, Jr. [7] has pointed out that various data structures involving range searching have immediate use for Max and Min values. More specifically; Segment trees, Range trees, k-d trees and database keys:

...The issue is that a range can be open on one side and does not always have an initialized case.

The solutions I have seen are to either overload None as the extremum or use an arbitrary large magnitude number. Overloading None means that the built-ins can't really be used without special case checks to work around the undefined (or "wrongly defined") ordering of None. These checks tend to swamp the nice performance of built-ins like max() and min().

Choosing a large magnitude number throws away the ability of Python to cope with arbitrarily large integers and introduces a potential source of overrun/underrun bugs.

Further use examples of both Max and Min are available in the realm of graph algorithms, range searching algorithms, computational geometry algorithms, and others.

Independent Implementations?

Independent implementations of the Min/Max concept by users desiring such functionality are not likely to be compatible, and certainly will produce inconsistent orderings. The following examples seek to show how inconsistent they can be.

  • Let us pretend we have created proper separate implementations of MyMax, MyMin, YourMax and YourMin with the same code as given in the sample implementation (with some minor renaming):

    >>> lst = [YourMin, MyMin, MyMin, YourMin, MyMax, YourMin, MyMax,
    YourMax, MyMax]
    >>> lst.sort()
    >>> lst
    [YourMin, YourMin, MyMin, MyMin, YourMin, MyMax, MyMax, YourMax,
    MyMax]
    

    Notice that while all the "Min"s are before the "Max"s, there is no guarantee that all instances of YourMin will come before MyMin, the reverse, or the equivalent MyMax and YourMax.

  • The problem is also evident when using the heapq module:

    >>> lst = [YourMin, MyMin, MyMin, YourMin, MyMax, YourMin, MyMax,
    YourMax, MyMax]
    >>> heapq.heapify(lst)  #not needed, but it can't hurt
    >>> while lst: print heapq.heappop(lst),
    ...
    YourMin MyMin YourMin YourMin MyMin MyMax MyMax YourMax MyMax
    
  • Furthermore, the findmin_Max code and both versions of Dijkstra could result in incorrect output by passing in secondary versions of Max.

It has been pointed out [9] that the reference implementation given below would be incompatible with independent implementations of Max/Min. The point of this PEP is for the introduction of "The One True Implementation" of "The One True Maximum" and "The One True Minimum". User-based implementations of Max and Min objects would thusly be discouraged, and use of "The One True Implementation" would obviously be encouraged. Ambiguous behavior resulting from mixing users' implementations of Max and Min with "The One True Implementation" should be easy to discover through variable and/or source code introspection.

Reference Implementation

class _ExtremeType(object):

    def __init__(self, cmpr, rep):
        object.__init__(self)
        self._cmpr = cmpr
        self._rep = rep

    def __cmp__(self, other):
        if isinstance(other, self.__class__) and\
           other._cmpr == self._cmpr:
            return 0
        return self._cmpr

    def __repr__(self):
        return self._rep

Max = _ExtremeType(1, "Max")
Min = _ExtremeType(-1, "Min")

Results of Test Run:

>>> max(Max, 2**65536)
Max
>>> min(Max, 2**65536)
20035299304068464649790...
(lines removed for brevity)
...72339445587895905719156736L
>>> min(Min, -2**65536)
Min
>>> max(Min, -2**65536)
-2003529930406846464979...
(lines removed for brevity)
...072339445587895905719156736L

Open Issues

As the PEP was rejected, all open issues are now closed and inconsequential. The module will use the names UniversalMaximum and UniversalMinimum due to the fact that it would be very difficult to mistake what each does. For those who require a shorter name, renaming the singletons during import is suggested:

from extremes import UniversalMaximum as uMax,
                     UniversalMinimum as uMin

References

[1]RE: [Python-Dev] Re: Got None. Maybe Some?, Peters, Tim (http://mail.python.org/pipermail/python-dev/2003-December/041374.html)
[2]Re: [Python-Dev] Got None. Maybe Some?, van Rossum, Guido (http://mail.python.org/pipermail/python-dev/2003-December/041352.html)
[3]RE: [Python-Dev] Got None. Maybe Some?, Peters, Tim (http://mail.python.org/pipermail/python-dev/2003-December/041332.html)
[4](1, 2) [Python-Dev] Re: PEP 326 now online, Reedy, Terry (http://mail.python.org/pipermail/python-dev/2004-January/041685.html)
[5][Python-Dev] PEP 326 now online, Chermside, Michael (http://mail.python.org/pipermail/python-dev/2004-January/041704.html)
[6]Homework 6, Problem 7, Dillencourt, Michael (link may not be valid in the future) (http://www.ics.uci.edu/~dillenco/ics161/hw/hw6.pdf)
[7]RE: [Python-Dev] PEP 326 now online, Lentvorski, Andrew P., Jr. (http://mail.python.org/pipermail/python-dev/2004-January/041727.html)
[8]Re: It's not really Some is it?, Ippolito, Bob (http://www.livejournal.com/users/chouyu_31/138195.html?thread=274643#t274643)
[9](1, 2) [Python-Dev] Re: PEP 326 now online, Niemeyer, Gustavo (http://mail.python.org/pipermail/python-dev/2004-January/042261.html); [Python-Dev] Re: PEP 326 now online, Carlson, Josiah (http://mail.python.org/pipermail/python-dev/2004-January/042272.html)
[11][Python-Dev] PEP 326 (quick location possibility), Carlson, Josiah (http://mail.python.org/pipermail/python-dev/2004-January/042275.html)
[12](1, 2) [Python-Dev] PEP 326 (quick location possibility), van Rossum, Guido (http://mail.python.org/pipermail/python-dev/2004-January/042306.html)
[13][Python-Dev] PEP 326 (quick location possibility), Carlson, Josiah (http://mail.python.org/pipermail/python-dev/2004-January/042300.html)
[14]Recommended standard implementation of PEP 326, extremes.py, Carlson, Josiah (http://www.ics.uci.edu/~jcarlson/pep326/extremes.py)

Changes

pep-0327 Decimal Data Type

PEP:327
Title:Decimal Data Type
Version:$Revision$
Last-Modified:$Date$
Author:Facundo Batista <facundo at taniquetil.com.ar>
Status:Final
Type:Standards Track
Content-Type:text/x-rst
Created:17-Oct-2003
Python-Version:2.4
Post-History:30-Nov-2003, 02-Jan-2004, 29-Jan-2004

Abstract

The idea is to have a Decimal data type, for every use where decimals are needed but binary floating point is too inexact.

The Decimal data type will support the Python standard functions and operations, and must comply with the decimal arithmetic ANSI standard X3.274-1996 [1].

Decimal will be floating point (as opposed to fixed point) and will have bounded precision (the precision is the upper limit on the number of significant digits in a result). However, precision is user-settable, and a notion of significant trailing zeroes is supported so that fixed-point usage is also possible.

This work is based on code and test functions written by Eric Price, Aahz and Tim Peters. Just before Python 2.4a1, the decimal.py reference implementation was moved into the standard library; along with the documentation and the test suite, this was the work of Raymond Hettinger. Much of the explanation in this PEP is taken from Cowlishaw's work [2], comp.lang.python and python-dev.

Motivation

Here I'll expose the reasons of why I think a Decimal data type is needed and why other numeric data types are not enough.

I wanted a Money data type, and after proposing a pre-PEP in comp.lang.python, the community agreed to have a numeric data type with the needed arithmetic behaviour, and then build Money over it: all the considerations about quantity of digits after the decimal point, rounding, etc., will be handled through Money. It is not the purpose of this PEP to have a data type that can be used as Money without further effort.

One of the biggest advantages of implementing a standard is that someone already thought out all the creepy cases for you. And to a standard GvR redirected me: Mike Cowlishaw's General Decimal Arithmetic specification [2]. This document defines a general purpose decimal arithmetic. A correct implementation of this specification will conform to the decimal arithmetic defined in ANSI/IEEE standard 854-1987, except for some minor restrictions, and will also provide unrounded decimal arithmetic and integer arithmetic as proper subsets.

The problem with binary float

In decimal math, there are many numbers that can't be represented with a fixed number of decimal digits, e.g. 1/3 = 0.3333333333.......

In base 2 (the way that standard floating point is calculated), 1/2 = 0.1, 1/4 = 0.01, 1/8 = 0.001, etc. Decimal 0.2 equals 2/10 equals 1/5, resulting in the binary fractional number 0.001100110011001... As you can see, the problem is that some decimal numbers can't be represented exactly in binary, resulting in small roundoff errors.

So we need a decimal data type that represents exactly decimal numbers. Instead of a binary data type, we need a decimal one.

Why floating point?

So we go to decimal, but why floating point?

Floating point numbers use a fixed quantity of digits (precision) to represent a number, working with an exponent when the number gets too big or too small. For example, with a precision of 5:

  1234 ==>   1234e0
 12345 ==>  12345e0
123456 ==>  12346e1

(note that in the last line the number got rounded to fit in five digits).

In contrast, we have the example of a long integer with infinite precision, meaning that you can have the number as big as you want, and you'll never lose any information.

In a fixed point number, the position of the decimal point is fixed. For a fixed point data type, check Tim Peter's FixedPoint at SourceForge [4]. I'll go for floating point because it's easier to implement the arithmetic behaviour of the standard, and then you can implement a fixed point data type over Decimal.

But why can't we have a floating point number with infinite precision? It's not so easy, because of inexact divisions. E.g.: 1/3 = 0.3333333333333... ad infinitum. In this case you should store a infinite amount of 3s, which takes too much memory, ;).

John Roth proposed to eliminate the division operator and force the user to use an explicit method, just to avoid this kind of trouble. This generated adverse reactions in comp.lang.python, as everybody wants to have support for the / operator in a numeric data type.

With this exposed maybe you're thinking "Hey! Can we just store the 1 and the 3 as numerator and denominator?", which takes us to the next point.

Why not rational?

Rational numbers are stored using two integer numbers, the numerator and the denominator. This implies that the arithmetic operations can't be executed directly (e.g. to add two rational numbers you first need to calculate the common denominator).

Quoting Alex Martelli:

The performance implications of the fact that summing two rationals (which take O(M) and O(N) space respectively) gives a rational which takes O(M+N) memory space is just too troublesome. There are excellent Rational implementations in both pure Python and as extensions (e.g., gmpy), but they'll always be a "niche market" IMHO. Probably worth PEPping, not worth doing without Decimal -- which is the right way to represent sums of money, a truly major use case in the real world.

Anyway, if you're interested in this data type, you maybe will want to take a look at PEP 239: Adding a Rational Type to Python.

So, what do we have?

The result is a Decimal data type, with bounded precision and floating point.

Will it be useful? I can't say it better than Alex Martelli:

Python (out of the box) doesn't let you have binary floating point numbers with whatever precision you specify: you're limited to what your hardware supplies. Decimal, be it used as a fixed or floating point number, should suffer from no such limitation: whatever bounded precision you may specify on number creation (your memory permitting) should work just as well. Most of the expense of programming simplicity can be hidden from application programs and placed in a suitable decimal arithmetic type. As per http://speleotrove.com/decimal/, a single data type can be used for integer, fixed-point, and floating-point decimal arithmetic -- and for money arithmetic which doesn't drive the application programmer crazy.

There are several uses for such a data type. As I said before, I will use it as base for Money. In this case the bounded precision is not an issue; quoting Tim Peters:

A precision of 20 would be way more than enough to account for total world economic output, down to the penny, since the beginning of time.

General Decimal Arithmetic Specification

Here I'll include information and descriptions that are part of the specification [2] (the structure of the number, the context, etc.). All the requirements included in this section are not for discussion (barring typos or other mistakes), as they are in the standard, and the PEP is just for implementing the standard.

Because of copyright restrictions, I can not copy here explanations taken from the specification, so I'll try to explain it in my own words. I firmly encourage you to read the original specification document [2] for details or if you have any doubt.

The Arithmetic Model

The specification is based on a decimal arithmetic model, as defined by the relevant standards: IEEE 854 [3], ANSI X3-274 [1], and the proposed revision [5] of IEEE 754 [6].

The model has three components:

  • Numbers: just the values that the operation uses as input or output.
  • Operations: addition, multiplication, etc.
  • Context: a set of parameters and rules that the user can select and which govern the results of operations (for example, the precision to be used).

Numbers

Numbers may be finite or special values. The former can be represented exactly. The latter are infinites and undefined (such as 0/0).

Finite numbers are defined by three parameters:

  • Sign: 0 (positive) or 1 (negative).
  • Coefficient: a non-negative integer.
  • Exponent: a signed integer, the power of ten of the coefficient multiplier.

The numerical value of a finite number is given by:

(-1)**sign * coefficient * 10**exponent

Special values are named as following:

  • Infinity: a value which is infinitely large. Could be positive or negative.
  • Quiet NaN ("qNaN"): represent undefined results (Not a Number). Does not cause an Invalid operation condition. The sign in a NaN has no meaning.
  • Signaling NaN ("sNaN"): also Not a Number, but will cause an Invalid operation condition if used in any operation.

Context

The context is a set of parameters and rules that the user can select and which govern the results of operations (for example, the precision to be used).

The context gets that name because it surrounds the Decimal numbers, with parts of context acting as input to, and output of, operations. It's up to the application to work with one or several contexts, but definitely the idea is not to get a context per Decimal number. For example, a typical use would be to set the context's precision to 20 digits at the start of a program, and never explicitly use context again.

These definitions don't affect the internal storage of the Decimal numbers, just the way that the arithmetic operations are performed.

The context is mainly defined by the following parameters (see Context Attributes for all context attributes):

  • Precision: The maximum number of significant digits that can result from an arithmetic operation (integer > 0). There is no maximum for this value.
  • Rounding: The name of the algorithm to be used when rounding is necessary, one of "round-down", "round-half-up", "round-half-even", "round-ceiling", "round-floor", "round-half-down", and "round-up". See Rounding Algorithms below.
  • Flags and trap-enablers: Exceptional conditions are grouped into signals, controllable individually, each consisting of a flag (boolean, set when the signal occurs) and a trap-enabler (a boolean that controls behavior). The signals are: "clamped", "division-by-zero", "inexact", "invalid-operation", "overflow", "rounded", "subnormal" and "underflow".

Default Contexts

The specification defines two default contexts, which should be easily selectable by the user.

Basic Default Context:

  • flags: all set to 0
  • trap-enablers: inexact, rounded, and subnormal are set to 0; all others are set to 1
  • precision: is set to 9
  • rounding: is set to round-half-up

Extended Default Context:

  • flags: all set to 0
  • trap-enablers: all set to 0
  • precision: is set to 9
  • rounding: is set to round-half-even

Exceptional Conditions

The table below lists the exceptional conditions that may arise during the arithmetic operations, the corresponding signal, and the defined result. For details, see the specification [2].

Condition Signal Result
Clamped clamped see spec [2]
Division by zero division-by-zero [sign,inf]
Inexact inexact unchanged
Invalid operation invalid-operation [0,qNaN] (or [s,qNaN] or [s,qNaN,d] when the cause is a signaling NaN)
Overflow overflow depends on the rounding mode
Rounded rounded unchanged
Subnormal subnormal unchanged
Underflow underflow see spec [2]

Note: when the standard talks about "Insufficient storage", as long as this is implementation-specific behaviour about not having enough storage to keep the internals of the number, this implementation will raise MemoryError.

Regarding Overflow and Underflow, there's been a long discussion in python-dev about artificial limits. The general consensus is to keep the artificial limits only if there are important reasons to do that. Tim Peters gives us three:

...eliminating bounds on exponents effectively means overflow (and underflow) can never happen. But overflow is a valuable safety net in real life fp use, like a canary in a coal mine, giving danger signs early when a program goes insane.

Virtually all implementations of 854 use (and as IBM's standard even suggests) "forbidden" exponent values to encode non-finite numbers (infinities and NaNs). A bounded exponent can do this at virtually no extra storage cost. If the exponent is unbounded, then additional bits have to be used instead. This cost remains hidden until more time- and space- efficient implementations are attempted.

Big as it is, the IBM standard is a tiny start at supplying a complete numeric facility. Having no bound on exponent size will enormously complicate the implementations of, e.g., decimal sin() and cos() (there's then no a priori limit on how many digits of pi effectively need to be known in order to perform argument reduction).

Edward Loper give us an example of when the limits are to be crossed: probabilities.

That said, Robert Brewer and Andrew Lentvorski want the limits to be easily modifiable by the users. Actually, this is quite posible:

>>> d1 = Decimal("1e999999999")     # at the exponent limit
>>> d1
Decimal("1E+999999999")
>>> d1 * 10                         # exceed the limit, got infinity
Traceback (most recent call last):
  File "<pyshell#3>", line 1, in ?
    d1 * 10
  ...
  ...
Overflow: above Emax
>>> getcontext().Emax = 1000000000  # increase the limit
>>> d1 * 10                         # does not exceed any more
Decimal("1.0E+1000000000")
>>> d1 * 100                        # exceed again
Traceback (most recent call last):
  File "<pyshell#3>", line 1, in ?
    d1 * 100
  ...
  ...
Overflow: above Emax

Rounding Algorithms

round-down: The discarded digits are ignored; the result is unchanged (round toward 0, truncate):

1.123 --> 1.12
1.128 --> 1.12
1.125 --> 1.12
1.135 --> 1.13

round-half-up: If the discarded digits represent greater than or equal to half (0.5) then the result should be incremented by 1; otherwise the discarded digits are ignored:

1.123 --> 1.12
1.128 --> 1.13
1.125 --> 1.13
1.135 --> 1.14

round-half-even: If the discarded digits represent greater than half (0.5) then the result coefficient is incremented by 1; if they represent less than half, then the result is not adjusted; otherwise the result is unaltered if its rightmost digit is even, or incremented by 1 if its rightmost digit is odd (to make an even digit):

1.123 --> 1.12
1.128 --> 1.13
1.125 --> 1.12
1.135 --> 1.14

round-ceiling: If all of the discarded digits are zero or if the sign is negative the result is unchanged; otherwise, the result is incremented by 1 (round toward positive infinity):

 1.123 -->  1.13
 1.128 -->  1.13
-1.123 --> -1.12
-1.128 --> -1.12

round-floor: If all of the discarded digits are zero or if the sign is positive the result is unchanged; otherwise, the absolute value of the result is incremented by 1 (round toward negative infinty):

 1.123 -->  1.12
 1.128 -->  1.12
-1.123 --> -1.13
-1.128 --> -1.13

round-half-down: If the discarded digits represent greater than half (0.5) then the result is incremented by 1; otherwise the discarded digits are ignored:

1.123 --> 1.12
1.128 --> 1.13
1.125 --> 1.12
1.135 --> 1.13

round-up: If all of the discarded digits are zero the result is unchanged, otherwise the result is incremented by 1 (round away from 0):

1.123 --> 1.13
1.128 --> 1.13
1.125 --> 1.13
1.135 --> 1.14

Rationale

I must separate the requirements in two sections. The first is to comply with the ANSI standard. All the requirements for this are specified in the Mike Cowlishaw's work [2]. He also provided a very large suite of test cases.

The second section of requirements (standard Python functions support, usability, etc.) is detailed from here, where I'll include all the decisions made and why, and all the subjects still being discussed.

Explicit construction

The explicit construction does not get affected by the context (there is no rounding, no limits by the precision, etc.), because the context affects just operations' results. The only exception to this is when you're Creating from Context.

From int or long

There's no loss and no need to specify any other information:

Decimal(35)
Decimal(-124)

From string

Strings containing Python decimal integer literals and Python float literals will be supported. In this transformation there is no loss of information, as the string is directly converted to Decimal (there is not an intermediate conversion through float):

Decimal("-12")
Decimal("23.2e-7")

Also, you can construct in this way all special values (Infinity and Not a Number):

Decimal("Inf")
Decimal("NaN")

From float

The initial discussion on this item was what should happen when passing floating point to the constructor:

  1. Decimal(1.1) == Decimal('1.1')
  2. Decimal(1.1) == Decimal('110000000000000008881784197001252...e-51')
  3. an exception is raised

Several people alleged that (1) is the better option here, because it's what you expect when writing Decimal(1.1). And quoting John Roth, it's easy to implement:

It's not at all difficult to find where the actual number ends and where the fuzz begins. You can do it visually, and the algorithms to do it are quite well known.

But If I really want my number to be Decimal('110000000000000008881784197001252...e-51'), why can't I write Decimal(1.1)? Why should I expect Decimal to be "rounding" it? Remember that 1.1 is binary floating point, so I can predict the result. It's not intuitive to a beginner, but that's the way it is.

Anyway, Paul Moore showed that (1) can't work, because:

(1) says  D(1.1) == D('1.1')
but       1.1 == 1.1000000000000001
so        D(1.1) == D(1.1000000000000001)
together: D(1.1000000000000001) == D('1.1')

which is wrong, because if I write Decimal('1.1') it is exact, not D(1.1000000000000001). He also proposed to have an explicit conversion to float. bokr says you need to put the precision in the constructor and mwilson agreed:

d = Decimal (1.1, 1)  # take float value to 1 decimal place
d = Decimal (1.1)  # gets `places` from pre-set context

But Alex Martelli says that:

Constructing with some specified precision would be fine. Thus, I think "construction from float with some default precision" runs a substantial risk of tricking naive users.

So, the accepted solution through c.l.p is that you can not call Decimal with a float. Instead you must use a method: Decimal.from_float(). The syntax:

Decimal.from_float(floatNumber, [decimal_places])

where floatNumber is the float number origin of the construction and decimal_places are the number of digits after the decimal point where you apply a round-half-up rounding, if any. In this way you can do, for example:

Decimal.from_float(1.1, 2): The same as doing Decimal('1.1').
Decimal.from_float(1.1, 16): The same as doing Decimal('1.1000000000000001').
Decimal.from_float(1.1): The same as doing Decimal('1100000000000000088817841970012523233890533447265625e-51').

Based on later discussions, it was decided to omit from_float() from the API for Py2.4. Several ideas contributed to the thought process:

  • Interactions between decimal and binary floating point force the user to deal with tricky issues of representation and round-off. Avoidance of those issues is a primary reason for having the module in the first place.

  • The first release of the module should focus on that which is safe, minimal, and essential.

  • While theoretically nice, real world use cases for interactions between floats and decimals are lacking. Java included float/decimal conversions to handle an obscure case where calculations are best performed in decimal eventhough a legacy data structure requires the inputs and outputs to be stored in binary floating point.

  • If the need arises, users can use string representations as an intermediate type. The advantage of this approach is that it makes explicit the assumptions about precision and representation (no wondering what is going on under the hood).

  • The Java docs for BigDecimal(double val) reflected their experiences with the constructor:

    The results of this constructor can be somewhat
    unpredictable and its use is generally not recommended.
    

From tuples

Aahz suggested to construct from tuples: it's easier to implement eval()'s round trip and "someone who has numeric values representing a Decimal does not need to convert them to a string."

The structure will be a tuple of three elements: sign, number and exponent. The sign is 1 or 0, the number is a tuple of decimal digits and the exponent is a signed int or long:

Decimal((1, (3, 2, 2, 5), -2))     # for -32.25

Of course, you can construct in this way all special values:

Decimal( (0, (0,), 'F') )          # for Infinity
Decimal( (0, (0,), 'n') )          # for Not a Number

From Decimal

No mystery here, just a copy.

Syntax for All Cases

Decimal(value1)
Decimal.from_float(value2, [decimal_places])

where value1 can be int, long, string, 3-tuple or Decimal, value2 can only be float, and decimal_places is an optional non negative int.

Creating from Context

This item arose in python-dev from two sources in parallel. Ka-Ping Yee proposes to pass the context as an argument at instance creation (he wants the context he passes to be used only in creation time: "It would not be persistent"). Tony Meyer asks from_string to honor the context if it receives a parameter "honour_context" with a True value. (I don't like it, because the doc specifies that the context be honored and I don't want the method to comply with the specification regarding the value of an argument.)

Tim Peters gives us a reason to have a creation that uses context:

In general number-crunching, literals may be given to high precision, but that precision isn't free and usually isn't needed

Casey Duncan wants to use another method, not a bool arg:

I find boolean arguments a general anti-pattern, especially given we have class methods. Why not use an alternate constructor like Decimal.rounded_to_context("3.14159265").

In the process of deciding the syntax of that, Tim came up with a better idea: he proposes not to have a method in Decimal to create with a different context, but having instead a method in Context to create a Decimal instance. Basically, instead of:

D.using_context(number, context)

it will be:

context.create_decimal(number)

From Tim:

While all operations in the spec except for the two to-string operations use context, no operations in the spec support an optional local context. That the Decimal() constructor ignores context by default is an extension to the spec. We must supply a context-honoring from-string operation to meet the spec. I recommend against any concept of "local context" in any operation -- it complicates the model and isn't necessary.

So, we decided to use a context method to create a Decimal that will use (only to be created) that context in particular (for further operations it will use the context of the thread). But, a method with what name?

Tim Peters proposes three methods to create from diverse sources (from_string, from_int, from_float). I proposed to use one method, create_decimal(), without caring about the data type. Michael Chermside: "The name just fits my brain. The fact that it uses the context is obvious from the fact that it's Context method".

The community agreed with that. I think that it's OK because a newbie will not be using the creation method from Context (the separate method in Decimal to construct from float is just to prevent newbies from encountering binary floating point issues).

So, in short, if you want to create a Decimal instance using a particular context (that will be used just at creation time and not any further), you'll have to use a method of that context:

# n is any datatype accepted in Decimal(n) plus float
mycontext.create_decimal(n)

Example:

>>> # create a standard decimal instance
>>> Decimal("11.2233445566778899")
Decimal("11.2233445566778899")
>>>
>>> # create a decimal instance using the thread context
>>> thread_context = getcontext()
>>> thread_context.prec
28
>>> thread_context.create_decimal("11.2233445566778899")
Decimal("11.2233445566778899")
>>>
>>> # create a decimal instance using other context
>>> other_context = thread_context.copy()
>>> other_context.prec = 4
>>> other_context.create_decimal("11.2233445566778899")
Decimal("11.22")

Implicit construction

As the implicit construction is the consequence of an operation, it will be affected by the context as is detailed in each point.

John Roth suggested that "The other type should be handled in the same way the decimal() constructor would handle it". But Alex Martelli thinks that

this total breach with Python tradition would be a terrible mistake. 23+"43" is NOT handled in the same way as 23+int("45"), and a VERY good thing that is too. It's a completely different thing for a user to EXPLICITLY indicate they want construction (conversion) and to just happen to sum two objects one of which by mistake could be a string.

So, here I define the behaviour again for each data type.

From int or long

An int or long is a treated like a Decimal explicitly constructed from Decimal(str(x)) in the current context (meaning that the to-string rules for rounding are applied and the appropriate flags are set). This guarantees that expressions like Decimal('1234567') + 13579 match the mental model of Decimal('1234567') + Decimal('13579'). That model works because all integers are representable as strings without representation error.

From string

Everybody agrees to raise an exception here.

From float

Aahz is strongly opposed to interact with float, suggesting an explicit conversion:

The problem is that Decimal is capable of greater precision, accuracy, and range than float.

The example of the valid python expression, 35 + 1.1, seems to suggest that Decimal(35) + 1.1 should also be valid. However, a closer look shows that it only demonstrates the feasibility of integer to floating point conversions. Hence, the correct analog for decimal floating point is 35 + Decimal(1.1). Both coercions, int-to-float and int-to-Decimal, can be done without incurring representation error.

The question of how to coerce between binary and decimal floating point is more complex. I proposed allowing the interaction with float, making an exact conversion and raising ValueError if exceeds the precision in the current context (this is maybe too tricky, because for example with a precision of 9, Decimal(35) + 1.2 is OK but Decimal(35) + 1.1 raises an error).

This resulted to be too tricky. So tricky, that c.l.p agreed to raise TypeError in this case: you could not mix Decimal and float.

From Decimal

There isn't any issue here.

Use of Context

In the last pre-PEP I said that "The Context must be omnipresent, meaning that changes to it affects all the current and future Decimal instances". I was wrong. In response, John Roth said:

The context should be selectable for the particular usage. That is, it should be possible to have several different contexts in play at one time in an application.

In comp.lang.python, Aahz explained that the idea is to have a "context per thread". So, all the instances of a thread belongs to a context, and you can change a context in thread A (and the behaviour of the instances of that thread) without changing nothing in thread B.

Also, and again correcting me, he said:

(the) Context applies only to operations, not to Decimal instances; changing the Context does not affect existing instances if there are no operations on them.

Arguing about special cases when there's need to perform operations with other rules that those of the current context, Tim Peters said that the context will have the operations as methods. This way, the user "can create whatever private context object(s) it needs, and spell arithmetic as explicit method calls on its private context object(s), so that the default thread context object is neither consulted nor modified".

Python Usability

  • Decimal should support the basic arithmetic (+, -, *, /, //, **, %, divmod) and comparison (==, !=, <, >, <=, >=, cmp) operators in the following cases (check Implicit Construction to see what types could OtherType be, and what happens in each case):

    • Decimal op Decimal
    • Decimal op otherType
    • otherType op Decimal
    • Decimal op= Decimal
    • Decimal op= otherType
  • Decimal should support unary operators (-, +, abs).

  • repr() should round trip, meaning that:

    m = Decimal(...)
    m == eval(repr(m))
    
  • Decimal should be immutable.

  • Decimal should support the built-in methods:

    • min, max
    • float, int, long
    • str, repr
    • hash
    • bool (0 is false, otherwise true)

There's been some discussion in python-dev about the behaviour of hash(). The community agrees that if the values are the same, the hashes of those values should also be the same. So, while Decimal(25) == 25 is True, hash(Decimal(25)) should be equal to hash(25).

The detail is that you can NOT compare Decimal to floats or strings, so we should not worry about them giving the same hashes. In short:

hash(n) == hash(Decimal(n))   # Only if n is int, long, or Decimal

Regarding str() and repr() behaviour, Ka-Ping Yee proposes that repr() have the same behaviour as str() and Tim Peters proposes that str() behave like the to-scientific-string operation from the Spec.

This is posible, because (from Aahz): "The string form already contains all the necessary information to reconstruct a Decimal object".

And it also complies with the Spec; Tim Peters:

There's no requirement to have a method named "to_sci_string", the only requirement is that some way to spell to-sci-string's functionality be supplied. The meaning of to-sci-string is precisely specified by the standard, and is a good choice for both str(Decimal) and repr(Decimal).

Documentation

This section explains all the public methods and attributes of Decimal and Context.

Decimal Attributes

Decimal has no public attributes. The internal information is stored in slots and should not be accessed by end users.

Decimal Methods

Following are the conversion and arithmetic operations defined in the Spec, and how that functionality can be achieved with the actual implementation.

  • to-scientific-string: Use builtin function str():

    >>> d = Decimal('123456789012.345')
    >>> str(d)
    '1.23456789E+11'
    
  • to-engineering-string: Use method to_eng_string():

    >>> d = Decimal('123456789012.345')
    >>> d.to_eng_string()
    '123.456789E+9'
    
  • to-number: Use Context method create_decimal(). The standard constructor or from_float() constructor cannot be used because these do not use the context (as is specified in the Spec for this conversion).

  • abs: Use builtin function abs():

    >>> d = Decimal('-15.67')
    >>> abs(d)
    Decimal('15.67')
    
  • add: Use operator +:

    >>> d = Decimal('15.6')
    >>> d + 8
    Decimal('23.6')
    
  • subtract: Use operator -:

    >>> d = Decimal('15.6')
    >>> d - 8
    Decimal('7.6')
    
  • compare: Use method compare(). This method (and not the built-in function cmp()) should only be used when dealing with special values:

    >>> d = Decimal('-15.67')
    >>> nan = Decimal('NaN')
    >>> d.compare(23)
    '-1'
    >>> d.compare(nan)
    'NaN'
    >>> cmp(d, 23)
    -1
    >>> cmp(d, nan)
    1
    
  • divide: Use operator /:

    >>> d = Decimal('-15.67')
    >>> d / 2
    Decimal('-7.835')
    
  • divide-integer: Use operator //:

    >>> d = Decimal('-15.67')
    >>> d // 2
    Decimal('-7')
    
  • max: Use method max(). Only use this method (and not the built-in function max()) when dealing with special values:

    >>> d = Decimal('15')
    >>> nan = Decimal('NaN')
    >>> d.max(8)
    Decimal('15')
    >>> d.max(nan)
    Decimal('NaN')
    
  • min: Use method min(). Only use this method (and not the built-in function min()) when dealing with special values:

    >>> d = Decimal('15')
    >>> nan = Decimal('NaN')
    >>> d.min(8)
    Decimal('8')
    >>> d.min(nan)
    Decimal('NaN')
    
  • minus: Use unary operator -:

    >>> d = Decimal('-15.67')
    >>> -d
    Decimal('15.67')
    
  • plus: Use unary operator +:

    >>> d = Decimal('-15.67')
    >>> +d
    Decimal('-15.67')
    
  • multiply: Use operator *:

    >>> d = Decimal('5.7')
    >>> d * 3
    Decimal('17.1')
    
  • normalize: Use method normalize():

    >>> d = Decimal('123.45000')
    >>> d.normalize()
    Decimal('123.45')
    >>> d = Decimal('120.00')
    >>> d.normalize()
    Decimal('1.2E+2')
    
  • quantize: Use method quantize():

    >>> d = Decimal('2.17')
    >>> d.quantize(Decimal('0.001'))
    Decimal('2.170')
    >>> d.quantize(Decimal('0.1'))
    Decimal('2.2')
    
  • remainder: Use operator %:

    >>> d = Decimal('10')
    >>> d % 3
    Decimal('1')
    >>> d % 6
    Decimal('4')
    
  • remainder-near: Use method remainder_near():

    >>> d = Decimal('10')
    >>> d.remainder_near(3)
    Decimal('1')
    >>> d.remainder_near(6)
    Decimal('-2')
    
  • round-to-integral-value: Use method to_integral():

    >>> d = Decimal('-123.456')
    >>> d.to_integral()
    Decimal('-123')
    
  • same-quantum: Use method same_quantum():

    >>> d = Decimal('123.456')
    >>> d.same_quantum(Decimal('0.001'))
    True
    >>> d.same_quantum(Decimal('0.01'))
    False
    
  • square-root: Use method sqrt():

    >>> d = Decimal('123.456')
    >>> d.sqrt()
    Decimal('11.1110756')
    
  • power: User operator **:

    >>> d = Decimal('12.56')
    >>> d ** 2
    Decimal('157.7536')
    

Following are other methods and why they exist:

  • adjusted(): Returns the adjusted exponent. This concept is defined in the Spec: the adjusted exponent is the value of the exponent of a number when that number is expressed as though in scientific notation with one digit before any decimal point:

    >>> d = Decimal('12.56')
    >>> d.adjusted()
    1
    
  • from_float(): Class method to create instances from float data types:

    >>> d = Decimal.from_float(12.35)
    >>> d
    Decimal('12.3500000')
    
  • as_tuple(): Show the internal structure of the Decimal, the triple tuple. This method is not required by the Spec, but Tim Peters proposed it and the community agreed to have it (it's useful for developing and debugging):

    >>> d = Decimal('123.4')
    >>> d.as_tuple()
    (0, (1, 2, 3, 4), -1)
    >>> d = Decimal('-2.34e5')
    >>> d.as_tuple()
    (1, (2, 3, 4), 3)
    

Context Attributes

These are the attributes that can be changed to modify the context.

  • prec (int): the precision:

    >>> c.prec
    9
    
  • rounding (str): rounding type (how to round):

    >>> c.rounding
    'half_even'
    
  • trap_enablers (dict): if trap_enablers[exception] = 1, then an exception is raised when it is caused:

    >>> c.trap_enablers[Underflow]
    0
    >>> c.trap_enablers[Clamped]
    0
    
  • flags (dict): when an exception is caused, flags[exception] is incremented (whether or not the trap_enabler is set). Should be reset by the user of Decimal instance:

    >>> c.flags[Underflow]
    0
    >>> c.flags[Clamped]
    0
    
  • Emin (int): minimum exponent:

    >>> c.Emin
    -999999999
    
  • Emax (int): maximum exponent:

    >>> c.Emax
    999999999
    
  • capitals (int): boolean flag to use 'E' (True/1) or 'e' (False/0) in the string (for example, '1.32e+2' or '1.32E+2'):

    >>> c.capitals
    1
    

Context Methods

The following methods comply with Decimal functionality from the Spec. Be aware that the operations that are called through a specific context use that context and not the thread context.

To use these methods, take note that the syntax changes when the operator is binary or unary, for example:

>>> mycontext.abs(Decimal('-2'))
'2'
>>> mycontext.multiply(Decimal('2.3'), 5)
'11.5'

So, the following are the Spec operations and conversions and how to achieve them through a context (where d is a Decimal instance and n a number that can be used in an Implicit construction):

  • to-scientific-string: to_sci_string(d)
  • to-engineering-string: to_eng_string(d)
  • to-number: create_decimal(number), see Explicit construction for number.
  • abs: abs(d)
  • add: add(d, n)
  • subtract: subtract(d, n)
  • compare: compare(d, n)
  • divide: divide(d, n)
  • divide-integer: divide_int(d, n)
  • max: max(d, n)
  • min: min(d, n)
  • minus: minus(d)
  • plus: plus(d)
  • multiply: multiply(d, n)
  • normalize: normalize(d)
  • quantize: quantize(d, d)
  • remainder: remainder(d)
  • remainder-near: remainder_near(d)
  • round-to-integral-value: to_integral(d)
  • same-quantum: same_quantum(d, d)
  • square-root: sqrt(d)
  • power: power(d, n)

The divmod(d, n) method supports decimal functionality through Context.

These are methods that return useful information from the Context:

  • Etiny(): Minimum exponent considering precision.

    >>> c.Emin
    -999999999
    >>> c.Etiny()
    -1000000007
    
  • Etop(): Maximum exponent considering precision.

    >>> c.Emax
    999999999
    >>> c.Etop()
    999999991
    
  • copy(): Returns a copy of the context.

Reference Implementation

As of Python 2.4-alpha, the code has been checked into the standard library. The latest version is available from:

http://svn.python.org/view/python/trunk/Lib/decimal.py

The test cases are here:

http://svn.python.org/view/python/trunk/Lib/test/test_decimal.py

References

[1](1, 2) ANSI standard X3.274-1996 (Programming Language REXX): http://www.rexxla.org/Standards/ansi.html
[2](1, 2, 3, 4, 5, 6, 7, 8) General Decimal Arithmetic specification (Cowlishaw): http://speleotrove.com/decimal/decarith.html (related documents and links at http://speleotrove.com/decimal/)
[3]ANSI/IEEE standard 854-1987 (Radix-Independent Floating-Point Arithmetic): http://www.cs.berkeley.edu/~ejr/projects/754/private/drafts/854-1987/dir.html (unofficial text; official copies can be ordered from http://standards.ieee.org/catalog/ordering.html)
[4]Tim Peter's FixedPoint at SourceForge: http://fixedpoint.sourceforge.net/
[5]IEEE 754 revision: http://grouper.ieee.org/groups/754/revision.html
[6]IEEE 754 references: http://babbage.cs.qc.edu/courses/cs341/IEEE-754references.html

pep-0328 Imports: Multi-Line and Absolute/Relative

PEP:328
Title:Imports: Multi-Line and Absolute/Relative
Version:$Revision$
Last-Modified:$Date$
Author:Aahz <aahz at pythoncraft.com>
Status:Final
Type:Standards Track
Content-Type:text/x-rst
Created:21-Dec-2003
Python-Version:2.4, 2,5, 2.6
Post-History:8-Mar-2004

Abstract

The import statement has two problems:

  • Long import statements can be difficult to write, requiring various contortions to fit Pythonic style guidelines.
  • Imports can be ambiguous in the face of packages; within a package, it's not clear whether import foo refers to a module within the package or some module outside the package. (More precisely, a local module or package can shadow another hanging directly off sys.path.)

For the first problem, it is proposed that parentheses be permitted to enclose multiple names, thus allowing Python's standard mechanisms for multi-line values to apply. For the second problem, it is proposed that all import statements be absolute by default (searching sys.path only) with special syntax (leading dots) for accessing package-relative imports.

Timeline

In Python 2.5, you must enable the new absolute import behavior with

from __future__ import absolute_import

You may use relative imports freely. In Python 2.6, any import statement that results in an intra-package import will raise DeprecationWarning (this also applies to from <> import that fails to use the relative import syntax).

Rationale for Parentheses

Currently, if you want to import a lot of names from a module or package, you have to choose one of several unpalatable options:

  • Write a long line with backslash continuations:

    from Tkinter import Tk, Frame, Button, Entry, Canvas, Text, \
        LEFT, DISABLED, NORMAL, RIDGE, END
    
  • Write multiple import statements:

    from Tkinter import Tk, Frame, Button, Entry, Canvas, Text
    from Tkinter import LEFT, DISABLED, NORMAL, RIDGE, END
    

(import * is not an option ;-)

Instead, it should be possible to use Python's standard grouping mechanism (parentheses) to write the import statement:

from Tkinter import (Tk, Frame, Button, Entry, Canvas, Text,
    LEFT, DISABLED, NORMAL, RIDGE, END)

This part of the proposal had BDFL approval from the beginning.

Parentheses support was added to Python 2.4.

Rationale for Absolute Imports

In Python 2.4 and earlier, if you're reading a module located inside a package, it is not clear whether

import foo

refers to a top-level module or to another module inside the package. As Python's library expands, more and more existing package internal modules suddenly shadow standard library modules by accident. It's a particularly difficult problem inside packages because there's no way to specify which module is meant. To resolve the ambiguity, it is proposed that foo will always be a module or package reachable from sys.path. This is called an absolute import.

The python-dev community chose absolute imports as the default because they're the more common use case and because absolute imports can provide all the functionality of relative (intra-package) imports -- albeit at the cost of difficulty when renaming package pieces higher up in the hierarchy or when moving one package inside another.

Because this represents a change in semantics, absolute imports will be optional in Python 2.5 and 2.6 through the use of

from __future__ import absolute_import

This part of the proposal had BDFL approval from the beginning.

Rationale for Relative Imports

With the shift to absolute imports, the question arose whether relative imports should be allowed at all. Several use cases were presented, the most important of which is being able to rearrange the structure of large packages without having to edit sub-packages. In addition, a module inside a package can't easily import itself without relative imports.

Guido approved of the idea of relative imports, but there has been a lot of disagreement on the spelling (syntax). There does seem to be agreement that relative imports will require listing specific names to import (that is, import foo as a bare term will always be an absolute import).

Here are the contenders:

  • One from Guido:

    from .foo import bar
    

    and

    from ...foo import bar
    

    These two forms have a couple of different suggested semantics. One semantic is to make each dot represent one level. There have been many complaints about the difficulty of counting dots. Another option is to only allow one level of relative import. That misses a lot of functionality, and people still complained about missing the dot in the one-dot form. The final option is to define an algorithm for finding relative modules and packages; the objection here is "Explicit is better than implicit". (The algorithm proposed is "search up from current package directory until the ultimate package parent gets hit".)

    Some people have suggested other punctuation as the separator, such as "-" or "^".

    Some people have suggested using "*":

    from *.foo import bar
    
  • The next set of options is conflated from several posters:

    from __pkg__.__pkg__ import
    

    and

    from .__parent__.__parent__ import
    

    Many people (Guido included) think these look ugly, but they are clear and explicit. Overall, more people prefer __pkg__ as the shorter option.

  • One suggestion was to allow only sibling references. In other words, you would not be able to use relative imports to refer to modules higher in the package tree. You would then be able to do either

    from .spam import eggs
    

    or

    import .spam.eggs
    
  • Some people favor allowing indexed parents:

    from -2.spam import eggs
    

    In this scenario, importing from the current directory would be a simple

    from .spam import eggs
    
  • Finally, some people dislike the way you have to change import to from ... import when you want to dig inside a package. They suggest completely rewriting the import syntax:

    from MODULE import NAMES as RENAME searching HOW
    

    or

    import NAMES as RENAME from MODULE searching HOW
        [from NAMES] [in WHERE] import ...
    

    However, this most likely could not be implemented for Python 2.5 (too big a change), and allowing relative imports is sufficiently critical that we need something now (given that the standard import will change to absolute import). More than that, this proposed syntax has several open questions:

    • What is the precise proposed syntax? (Which clauses are optional under which circumstances?)

    • How strongly does the searching clause bind? In other words, do you write:

      import foo as bar searching XXX, spam as ham searching XXX
      

      or:

      import foo as bar, spam as ham searching XXX
      

Guido's Decision

Guido has Pronounced [1] that relative imports will use leading dots. A single leading dot indicates a relative import, starting with the current package. Two or more leading dots give a relative import to the parent(s) of the current package, one level per dot after the first. Here's a sample package layout:

package/
    __init__.py
    subpackage1/
        __init__.py
        moduleX.py
        moduleY.py
    subpackage2/
        __init__.py
        moduleZ.py
    moduleA.py

Assuming that the current file is either moduleX.py or subpackage1/__init__.py, following are correct usages of the new syntax:

from .moduleY import spam
from .moduleY import spam as ham
from . import moduleY
from ..subpackage1 import moduleY
from ..subpackage2.moduleZ import eggs
from ..moduleA import foo
from ...package import bar
from ...sys import path

Note that while that last case is legal, it is certainly discouraged ("insane" was the word Guido used).

Relative imports must always use from <> import; import <> is always absolute. Of course, absolute imports can use from <> import by omitting the leading dots. The reason import .foo is prohibited is because after

import XXX.YYY.ZZZ

then

XXX.YYY.ZZZ

is usable in an expression. But

.moduleY

is not usable in an expression.

Relative Imports and __name__

Relative imports use a module's __name__ attribute to determine that module's position in the package hierarchy. If the module's name does not contain any package information (e.g. it is set to '__main__') then relative imports are resolved as if the module were a top level module, regardless of where the module is actually located on the file system.

Relative Imports and Indirection Entries in sys.modules

When packages were introduced, the concept of an indirection entry in sys.modules came into existence [2]. When an entry in sys.modules for a module within a package had a value of None, it represented that the module actually referenced the top-level module. For instance, 'Sound.Effects.string' might have a value of None in sys.modules. That meant any import that resolved to that name actually was to import the top-level 'string' module.

This introduced an optimization for when a relative import was meant to resolve to an absolute import. But since this PEP makes a very clear delineation between absolute and relative imports, this optimization is no longer needed. When absolute/relative imports become the only import semantics available then indirection entries in sys.modules will no longer be supported.

pep-0329 Treating Builtins as Constants in the Standard Library

PEP:329
Title:Treating Builtins as Constants in the Standard Library
Version:$Revision$
Last-Modified:$Date$
Author:Raymond Hettinger <python at rcn.com>
Status:Rejected
Type:Standards Track
Content-Type:text/x-rst
Created:18-Apr-2004
Python-Version:2.4
Post-History:18-Apr-2004

Abstract

The proposal is to add a function for treating builtin references as constants and to apply that function throughout the standard library.

Status

The PEP is self rejected by the author. Though the ASPN recipe was well received, there was less willingness to consider this for inclusion in the core distribution.

The Jython implementation does not use byte codes, so its performance would suffer if the current _len=len optimizations were removed.

Also, altering byte codes is one of the least clean ways to improve performance and enable cleaner coding. A more robust solution would likely involve compiler pragma directives or metavariables indicating what can be optimized (similar to const/volatile declarations).

Motivation

The library contains code such as _len=len which is intended to create fast local references instead of slower global lookups. Though necessary for performance, these constructs clutter the code and are usually incomplete (missing many opportunities).

If the proposal is adopted, those constructs could be eliminated from the code base and at the same time improve upon their results in terms of performance.

There are currently over a hundred instances of while 1 in the library. They were not replaced with the more readable while True because of performance reasons (the compiler cannot eliminate the test because True is not known to always be a constant). Conversion of True to a constant will clarify the code while retaining performance.

Many other basic Python operations run much slower because of global lookups. In try/except statements, the trapped exceptions are dynamically looked up before testing whether they match. Similarly, simple identity tests such as while x is not None require the None variable to be re-looked up on every pass. Builtin lookups are especially egregious because the enclosing global scope must be checked first. These lookup chains devour cache space that is best used elsewhere.

In short, if the proposal is adopted, the code will become cleaner and performance will improve across the board.

Proposal

Add a module called codetweaks.py which contains two functions, bind_constants() and bind_all(). The first function performs constant binding and the second recursively applies it to every function and class in a target module.

For most modules in the standard library, add a pair of lines near the end of the script:

import codetweaks, sys
codetweaks.bind_all(sys.modules[__name__])

In addition to binding builtins, there are some modules (like sre_compile) where it also makes sense to bind module variables as well as builtins into constants.

Questions and Answers

  1. Will this make everyone divert their attention to optimization issues?

    Because it is done automatically, it reduces the need to think about optimizations.

  2. In a nutshell, how does it work?

    Every function has attributes with its bytecodes (the language of the Python virtual machine) and a table of constants. The bind function scans the bytecodes for a LOAD_GLOBAL instruction and checks to see whether the value is already known. If so, it adds that value to the constants table and replaces the opcode with LOAD_CONSTANT.

  3. When does it work?

    When a module is imported for the first time, python compiles the bytecode and runs the binding optimization. Subsequent imports just re-use the previous work. Each session repeats this process (the results are not saved in pyc files).

  4. How do you know this works?

    I implemented it, applied it to every module in library, and the test suite ran without exception.

  5. What if the module defines a variable shadowing a builtin?

    This does happen. For instance, True can be redefined at the module level as True = (1==1). The sample implementation below detects the shadowing and leaves the global lookup unchanged.

  6. Are you the first person to recognize that most global lookups are for values that never change?

    No, this has long been known. Skip Montanaro provides an eloquent explanation in [1].

  7. What if I want to replace the builtins module and supply my own implementations?

    Either do this before importing a module, or just reload the module, or disable codetweaks.py (it will have a disable flag).

  8. How susceptible is this module to changes in Python's byte coding?

    It imports opcode.py to protect against renumbering. Also, it uses LOAD_CONST and LOAD_GLOBAL which are fundamental and have been around forever. That notwithstanding, the coding scheme could change and this implementation would have to change along with modules like dis which also rely on the current coding scheme.

  9. What is the effect on startup time?

    I could not measure a difference. None of the startup modules are bound except for warnings.py. Also, the binding function is very fast, making just a single pass over the code string in search of the LOAD_GLOBAL opcode.

Sample Implementation

Here is a sample implementation for codetweaks.py:

from types import ClassType, FunctionType
from opcode import opmap, HAVE_ARGUMENT, EXTENDED_ARG
LOAD_GLOBAL, LOAD_CONST = opmap['LOAD_GLOBAL'], opmap['LOAD_CONST']
ABORT_CODES = (EXTENDED_ARG, opmap['STORE_GLOBAL'])

def bind_constants(f, builtin_only=False, stoplist=[], verbose=False):
    """ Return a new function with optimized global references.

    Replaces global references with their currently defined values.
    If not defined, the dynamic (runtime) global lookup is left undisturbed.
    If builtin_only is True, then only builtins are optimized.
    Variable names in the stoplist are also left undisturbed.
    If verbose is True, prints each substitution as is occurs.

    """
    import __builtin__
    env = vars(__builtin__).copy()
    stoplist = dict.fromkeys(stoplist)
    if builtin_only:
        stoplist.update(f.func_globals)
    else:
        env.update(f.func_globals)

    co = f.func_code
    newcode = map(ord, co.co_code)
    newconsts = list(co.co_consts)
    codelen = len(newcode)

    i = 0
    while i < codelen:
        opcode = newcode[i]
        if opcode in ABORT_CODES:
            return f    # for simplicity, only optimize common cases
        if opcode == LOAD_GLOBAL:
            oparg = newcode[i+1] + (newcode[i+2] << 8)
            name = co.co_names[oparg]
            if name in env and name not in stoplist:
                value = env[name]
                try:
                    pos = newconsts.index(value)
                except ValueError:
                    pos = len(newconsts)
                    newconsts.append(value)
                newcode[i] = LOAD_CONST
                newcode[i+1] = pos & 0xFF
                newcode[i+2] = pos >> 8
                if verbose:
                    print name, '-->', value
        i += 1
        if opcode >= HAVE_ARGUMENT:
            i += 2

    codestr = ''.join(map(chr, newcode))
    codeobj = type(co)(co.co_argcount, co.co_nlocals, co.co_stacksize,
                    co.co_flags, codestr, tuple(newconsts), co.co_names,
                    co.co_varnames, co.co_filename, co.co_name,
                    co.co_firstlineno, co.co_lnotab, co.co_freevars,
                    co.co_cellvars)
    return type(f)(codeobj, f.func_globals, f.func_name, f.func_defaults,
                    f.func_closure)


def bind_all(mc, builtin_only=False, stoplist=[], verbose=False):
    """Recursively apply bind_constants() to functions in a module or class.

    Use as the last line of the module (after everything is defined, but
    before test code).

    In modules that need modifiable globals, set builtin_only to True.

    """
    for k, v in vars(mc).items():
        if type(v) is FunctionType:
            newv = bind_constants(v, builtin_only, stoplist, verbose)
            setattr(mc, k, newv)
        elif type(v) in (type, ClassType):
            bind_all(v, builtin_only, stoplist, verbose)


def f(): pass
try:
    f.func_code.code
except AttributeError:                  # detect non-CPython environments
    bind_all = lambda *args, **kwds: 0
del f

import sys
bind_all(sys.modules[__name__])         # Optimizer, optimize thyself!

Note the automatic detection of a non-CPython environment that does not have bytecodes [3]. In that situation, the bind functions would simply return the original function unchanged. This assures that the two line additions to library modules do not impact other implementations.

The final code should add a flag to make it easy to disable binding.

References

[1]Optimizing Global Variable/Attribute Access http://www.python.org/dev/peps/pep-0266/
[2]ASPN Recipe for a non-private implementation http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/277940
[3]Differences between CPython and Jython http://www.jython.org/cgi-bin/faqw.py?req=show&file=faq01.003.htp

pep-0330 Python Bytecode Verification

PEP: 330
Title: Python Bytecode Verification
Version: $Revision$
Last-Modified: $Date$
Author: Michel Pelletier <michel at users.sourceforge.net>
Status: Rejected
Type: Standards Track
Content-Type: text/plain
Created: 17-Jun-2004
Python-Version: 2.6?
Post-History: 

Abstract

    If Python Virtual Machine (PVM) bytecode is not "well-formed" it
    is possible to crash or exploit the PVM by causing various errors
    such as under/overflowing the value stack or reading/writing into
    arbitrary areas of the PVM program space.  Most of these kinds of
    errors can be eliminated by verifying that PVM bytecode does not
    violate a set of simple constraints before execution.

    This PEP proposes a set of constraints on the format and structure
    of Python Virtual Machine (PVM) bytecode and provides an
    implementation in Python of this verification process.

Pronouncement

    Guido believes that a verification tool has some value.  If
    someone wants to add it to Tools/scripts, no PEP is required.

    Such a tool may have value for validating the output from
    "bytecodehacks" or from direct edits of PYC files.  As security
    measure, its value is somewhat limited because perfectly valid
    bytecode can still do horrible things.  That situation could
    change if the concept of restricted execution were to be
    successfully resurrected.

Motivation

    The Python Virtual Machine executes Python programs that have been
    compiled from the Python language into a bytecode representation.
    The PVM assumes that any bytecode being executed is "well-formed"
    with regard to a number implicit constraints.  Some of these
    constraints are checked at run-time, but most of them are not due
    to the overhead they would create.

    When running in debug mode the PVM does do several run-time checks
    to ensure that any particular bytecode cannot violate these
    constraints that, to a degree, prevent bytecode from crashing or
    exploiting the interpreter.  These checks add a measurable
    overhead to the interpreter, and are typically turned off in
    common use.

    Bytecode that is not well-formed and executed by a PVM not running
    in debug mode may create a variety of fatal and non-fatal errors.
    Typically, ill-formed code will cause the PVM to seg-fault and
    cause the OS to immediately and abruptly terminate the
    interpreter.

    Conceivably, ill-formed bytecode could exploit the interpreter and
    allow Python bytecode to execute arbitrary C-level machine
    instructions or to modify private, internal data structures in the
    interpreter.  If used cleverly this could subvert any form of
    security policy an application may want to apply to its objects.

    Practically, it would be difficult for a malicious user to
    "inject" invalid bytecode into a PVM for the purposes of
    exploitation, but not impossible.  Buffer overflow and memory
    overwrite attacks are commonly understood, particularly when the
    exploit payload is transmitted unencrypted over a network or when
    a file or network security permission weakness is used as a
    foothold for further attacks.

    Ideally, no bytecode should ever be allowed to read or write
    underlying C-level data structures to subvert the operation of the
    PVM, whether the bytecode was maliciously crafted or not.  A
    simple pre-execution verification step could ensure that bytecode
    cannot over/underflow the value stack or access other sensitive
    areas of PVM program space at run-time.

    This PEP proposes several validation steps that should be taken on
    Python bytecode before it is executed by the PVM so that it
    compiles with static and structure constraints on its instructions
    and their operands.  These steps are simple and catch a large
    class of invalid bytecode that can cause crashes.  There is also
    some possibility that some run-time checks can be eliminated up
    front by a verification pass.

    There is, of course, no way to verify that bytecode is "completely
    safe", for every definition of complete and safe.  Even with
    bytecode verification, Python programs can and most likely in the
    future will seg-fault for a variety of reasons and continue to
    cause many different classes of run-time errors, fatal or not.
    The verification step proposed here simply plugs an easy hole that
    can cause a large class of fatal and subtle errors at the bytecode
    level.

    Currently, the Java Virtual Machine (JVM) verifies Java bytecode
    in a way very similar to what is proposed here.  The JVM
    Specification version 2 [1], Sections 4.8 and 4.9 were therefore
    used as a basis for some of the constraints explained below.  Any
    Python bytecode verification implementation at a minimum must
    enforce these constraints, but may not be limited to them.


Static Constraints on Bytecode Instructions

    1. The bytecode string must not be empty. (len(co_code) > 0).

    2. The bytecode string cannot exceed a maximum size
       (len(co_code) < sizeof(unsigned char) - 1).

    3. The first instruction in the bytecode string begins at index 0.

    4. Only valid byte-codes with the correct number of operands can
       be in the bytecode string.


Static Constraints on Bytecode Instruction Operands

    1. The target of a jump instruction must be within the code
       boundaries and must fall on an instruction, never between an
       instruction and its operands.

    2. The operand of a LOAD_* instruction must be an valid index into
       its corresponding data structure.

    3. The operand of a STORE_* instruction must be an valid index
       into its corresponding data structure.


Structural Constraints between Bytecode Instructions

    1. Each instruction must only be executed with the appropriate
       number of arguments in the value stack, regardless of the
       execution path that leads to its invocation.

    2. If an instruction can be executed along several different
       execution paths, the value stack must have the same depth prior
       to the execution of the instruction, regardless of the path
       taken.

    3. At no point during execution can the value stack grow to a
       depth greater than that implied by co_stacksize.

    4. Execution never falls off the bottom of co_code.


Implementation

    This PEP is the working document for an Python bytecode
    verification implementation written in Python.  This
    implementation is not used implicitly by the PVM before executing
    any bytecode, but is to be used explicitly by users concerned
    about possibly invalid bytecode with the following snippet:

        import verify
        verify.verify(object)

    The `verify` module provides a `verify` function which accepts the
    same kind of arguments as `dis.dis`: classes, methods, functions,
    or code objects.  It verifies that the object's bytecode is
    well-formed according to the specifications of this PEP.

    If the code is well-formed the call to `verify` returns silently
    without error.  If an error is encountered, it throws a
    'VerificationError' whose argument indicates the cause of the
    failure.  It is up to the programmer whether or not to handle the
    error in some way or execute the invalid code regardless.

    Phillip Eby has proposed a pseudo-code algorithm for bytecode
    stack depth verification used by the reference implementation.


Verification Issues

    This PEP describes only a small number of verifications.  While
    discussion and analysis will lead to many more, it is highly
    possible that future verification may need to be done or custom,
    project-specific verifications.  For this reason, it might be
    desirable to add a verification registration interface to the test
    implementation to register future verifiers.  The need for this is
    minimal since custom verifiers can subclass and extend the current
    implementation for added behavior.


Required Changes

    Armin Rigo noted that several byte-codes will need modification in
    order for their stack effect to be statically analyzed.  These are
    END_FINALLY, POP_BLOCK, and MAKE_CLOSURE.  Armin and Guido have
    already agreed on how to correct the instructions.  Currently the
    Python implementation punts on these instructions.

    This PEP does not propose to add the verification step to the
    interpreter, but only to provide the Python implementation in the
    standard library for optional use.  Whether or not this
    verification procedure is translated into C, included with the PVM
    or enforced in any way is left for future discussion.


References

    [1] The Java Virtual Machine Specification 2nd Edition
        http://java.sun.com/docs/books/vmspec/2nd-edition/html/ClassFile.doc.html


Copyright

    This document has been placed in the public domain.


pep-0331 Locale-Independent Float/String Conversions

PEP: 331
Title: Locale-Independent Float/String Conversions
Version: $Revision$
Last-Modified: $Date$
Author: Christian R. Reis
Status: Final
Type: Standards Track
Content-Type: text/plain
Created: 19-Jul-2003
Python-Version: 2.4
Post-History: 21-Jul-2003, 13-Aug-2003, 18-Jun-2004

Abstract

    Support for the LC_NUMERIC locale category in Python 2.3 is
    implemented only in Python-space.  This causes inconsistent
    behavior and thread-safety issues for applications that use
    extension modules and libraries implemented in C that parse and
    generate floats from strings.  This document proposes a plan for
    removing this inconsistency by providing and using substitute
    locale-agnostic functions as necessary.


Introduction

    Python provides generic localization services through the locale
    module, which among other things allows localizing the display and
    conversion process of numeric types.  Locale categories, such as
    LC_TIME and LC_COLLATE, allow configuring precisely what aspects
    of the application are to be localized.

    The LC_NUMERIC category specifies formatting for non-monetary
    numeric information, such as the decimal separator in float and
    fixed-precision numbers.  Localization of the LC_NUMERIC category
    is currently implemented only in Python-space; C libraries invoked
    from the Python runtime are unaware of Python's LC_NUMERIC
    setting.  This is done to avoid changing the behavior of certain
    low-level functions that are used by the Python parser and related
    code [2].

    However, this presents a problem for extension modules that wrap C
    libraries.  Applications that use these extension modules will
    inconsistently display and convert floating-point values.

    James Henstridge, the author of PyGTK [3], has additionally
    pointed out that the setlocale() function also presents
    thread-safety issues, since a thread may call the C library
    setlocale() outside of the GIL, and cause Python to parse and
    generate floats incorrectly.


Rationale

    The inconsistency between Python and C library localization for
    LC_NUMERIC is a problem for any localized application using C
    extensions.  The exact nature of the problem will vary depending
    on the application, but it will most likely occur when parsing or
    formatting a floating-point value.


Example Problem

    The initial problem that motivated this PEP is related to the
    GtkSpinButton [4] widget in the GTK+ UI toolkit, wrapped by the
    PyGTK module.  The widget can be set to numeric mode, and when
    this occurs, characters typed into it are evaluated as a number.

    Problems occur when LC_NUMERIC is set to a locale with a float
    separator that differs from the C locale's standard (for instance,
    `,' instead of `.' for the Brazilian locale pt_BR).  Because
    LC_NUMERIC is not set at the libc level, float values are
    displayed incorrectly (using `.' as a separator) in the
    spinbutton's text entry, and it is impossible to enter fractional
    values using the `,' separator.

    This small example demonstrates reduced usability for localized
    applications using this toolkit when coded in Python.


Proposal

    Martin v. Lรถwis commented on the initial constraints for an
    acceptable solution to the problem on python-dev:

        - LC_NUMERIC can be set at the C library level without
          breaking the parser.
        - float() and str() stay locale-unaware.
        - locale-aware str() and atof() stay in the locale module.

    An analysis of the Python source suggests that the following
    functions currently depend on LC_NUMERIC being set to the C
    locale:

        - Python/compile.c:parsenumber()
        - Python/marshal.c:r_object()
        - Objects/complexobject.c:complex_to_buf()
        - Objects/complexobject.c:complex_subtype_from_string()
        - Objects/floatobject.c:PyFloat_FromString()
        - Objects/floatobject.c:format_float()
        - Objects/stringobject.c:formatfloat()
        - Modules/stropmodule.c:strop_atof()
        - Modules/cPickle.c:load_float()

    The proposed approach is to implement LC_NUMERIC-agnostic
    functions for converting from (strtod()/atof()) and to
    (snprintf()) float formats, using these functions where the
    formatting should not vary according to the user-specified locale.

    The locale module should also be changed to remove the
    special-casing for LC_NUMERIC.

    This change should also solve the aforementioned thread-safety
    problems.


Potential Code Contributions

    This problem was initially reported as a problem in the GTK+
    libraries [5]; since then it has been correctly diagnosed as an
    inconsistency in Python's implementation.  However, in a fortunate
    coincidence, the glib library (developed primarily for GTK+, not
    to be confused with the GNU C library) implements a number of
    LC_NUMERIC-agnostic functions (for an example, see [6]) for
    reasons similar to those presented in this paper.

    In the same GTK+ problem report, Havoc Pennington suggested that
    the glib authors would be willing to contribute this code to the
    PSF, which would simplify implementation of this PEP considerably.
    Alex Larsson, the original author of the glib code, submitted a
    PSF Contributor Agreement [7] on 2003-08-20 [8] to ensure the code
    could be safely integrated; this agreement has been received and
    accepted.


Risks

    There may be cross-platform issues with the provided
    locale-agnostic functions, though this risk is low given that the
    code supplied simply reverses any locale-dependent changes made to
    floating-point numbers.

    Martin and Guido pointed out potential copyright issues with the
    contributed code.  I believe we will have no problems in this area
    as members of the GTK+ and glib teams have said they are fine with
    relicensing the code, and a PSF contributor agreement has been
    mailed in to ensure this safety.

    Tim Peters has pointed out [9] that there are situations involving
    threading in which the proposed change is insufficient to solve
    the problem completely.  A complete solution, however, does not
    currently exist.


Implementation

    An implementation was developed by Gustavo Carneiro <gjc at
    inescporto.pt>, and attached to Sourceforge.net bug 774665 [10]

    The final patch [11] was integrated into Python CVS by Martin v.
    Lรถwis on 2004-06-08, as stated in the bug report.


References

    [1] PEP 1, PEP Purpose and Guidelines, Warsaw, Hylton
        http://www.python.org/dev/peps/pep-0001/

    [2] Python locale documentation for embedding,
        http://docs.python.org/library/locale.html

    [3] PyGTK homepage, http://www.daa.com.au/~james/pygtk/

    [4] GtkSpinButton screenshot (demonstrating problem),
        http://www.async.com.br/~kiko/spin.png

    [5] GNOME bug report, http://bugzilla.gnome.org/show_bug.cgi?id=114132

    [6] Code submission of g_ascii_strtod and g_ascii_dtostr (later
        renamed g_ascii_formatd) by Alex Larsson,
        http://mail.gnome.org/archives/gtk-devel-list/2001-October/msg00114.html

    [7] PSF Contributor Agreement,
        http://www.python.org/psf/psf-contributor-agreement.html

    [8] Alex Larsson's email confirming his agreement was mailed in,
        http://mail.python.org/pipermail/python-dev/2003-August/037755.html

    [9] Tim Peters' email summarizing LC_NUMERIC trouble with Spambayes,
        http://mail.python.org/pipermail/python-dev/2003-September/037898.html

    [10] Python bug report, http://www.python.org/sf/774665

    [11] Integrated LC_NUMERIC-agnostic patch,
         https://sourceforge.net/tracker/download.php?group_id=5470&atid=305470&file_id=89685&aid=774665


Copyright

    This document has been placed in the public domain.


pep-0332 Byte vectors and String/Unicode Unification

PEP:332
Title:Byte vectors and String/Unicode Unification
Version:$Revision$
Last-Modified:$Date$
Author:Skip Montanaro <skip at pobox.com>
Status:Rejected
Type:Standards Track
Content-Type:text/x-rst
Created:11-Aug-2004
Python-Version:2.5
Post-History:

Abstract

This PEP outlines the introduction of a raw bytes sequence object and the unification of the current str and unicode objects.

Rejection Notice

This PEP is rejected in this form. The author has expressed lack of time to continue to shepherd it, and discussion on python-dev has moved to a slightly different proposal which will (eventually) be written up as a new PEP. See the thread starting at http://mail.python.org/pipermail/python-dev/2006-February/060930.html.

Rationale

Python's current string objects are overloaded. They serve both to hold ASCII and non-ASCII character data and to also hold sequences of raw bytes which have no reasonable interpretation as displayable character sequences. This overlap hasn't been a big problem in the past, but as Python moves closer to requiring source code to be properly encoded, the use of strings to represent raw byte sequences will be more problematic. In addition, as Python's Unicode support has improved, it's easier to consider strings as ASCII-encoded Unicode objects.

Proposed Implementation

The number in parentheses indicates the Python version in which the feature will be introduced.

  • Add a bytes builtin which is just a synonym for str. (2.5)
  • Add a b"..." string literal which is equivalent to raw string literals, with the exception that values which conflict with the source encoding of the containing file not generate warnings. (2.5)
  • Warn about the use of variables named "bytes". (2.5 or 2.6)
  • Introduce a bytes builtin which refers to a sequence distinct from the str type. (2.6)
  • Make str a synonym for unicode. (3.0)

Issues

  • Can this be accomplished before Python 3.0?
  • Should bytes objects be mutable or immutable? (Guido seems to like them to be mutable.)

pep-0333 Python Web Server Gateway Interface v1.0

PEP:333
Title:Python Web Server Gateway Interface v1.0
Version:$Revision$
Last-Modified:$Date$
Author:Phillip J. Eby <pje at telecommunity.com>
Discussions-To:Python Web-SIG <web-sig at python.org>
Status:Final
Type:Informational
Content-Type:text/x-rst
Created:07-Dec-2003
Post-History:07-Dec-2003, 08-Aug-2004, 20-Aug-2004, 27-Aug-2004, 27-Sep-2010
Superseded-By:3333

Preface

Note: For an updated version of this spec that supports Python 3.x and includes community errata, addenda, and clarifications, please see PEP 3333 instead.

Abstract

This document specifies a proposed standard interface between web servers and Python web applications or frameworks, to promote web application portability across a variety of web servers.

Rationale and Goals

Python currently boasts a wide variety of web application frameworks, such as Zope, Quixote, Webware, SkunkWeb, PSO, and Twisted Web -- to name just a few [1]. This wide variety of choices can be a problem for new Python users, because generally speaking, their choice of web framework will limit their choice of usable web servers, and vice versa.

By contrast, although Java has just as many web application frameworks available, Java's "servlet" API makes it possible for applications written with any Java web application framework to run in any web server that supports the servlet API.

The availability and widespread use of such an API in web servers for Python -- whether those servers are written in Python (e.g. Medusa), embed Python (e.g. mod_python), or invoke Python via a gateway protocol (e.g. CGI, FastCGI, etc.) -- would separate choice of framework from choice of web server, freeing users to choose a pairing that suits them, while freeing framework and server developers to focus on their preferred area of specialization.

This PEP, therefore, proposes a simple and universal interface between web servers and web applications or frameworks: the Python Web Server Gateway Interface (WSGI).

But the mere existence of a WSGI spec does nothing to address the existing state of servers and frameworks for Python web applications. Server and framework authors and maintainers must actually implement WSGI for there to be any effect.

However, since no existing servers or frameworks support WSGI, there is little immediate reward for an author who implements WSGI support. Thus, WSGI must be easy to implement, so that an author's initial investment in the interface can be reasonably low.

Thus, simplicity of implementation on both the server and framework sides of the interface is absolutely critical to the utility of the WSGI interface, and is therefore the principal criterion for any design decisions.

Note, however, that simplicity of implementation for a framework author is not the same thing as ease of use for a web application author. WSGI presents an absolutely "no frills" interface to the framework author, because bells and whistles like response objects and cookie handling would just get in the way of existing frameworks' handling of these issues. Again, the goal of WSGI is to facilitate easy interconnection of existing servers and applications or frameworks, not to create a new web framework.

Note also that this goal precludes WSGI from requiring anything that is not already available in deployed versions of Python. Therefore, new standard library modules are not proposed or required by this specification, and nothing in WSGI requires a Python version greater than 2.2.2. (It would be a good idea, however, for future versions of Python to include support for this interface in web servers provided by the standard library.)

In addition to ease of implementation for existing and future frameworks and servers, it should also be easy to create request preprocessors, response postprocessors, and other WSGI-based "middleware" components that look like an application to their containing server, while acting as a server for their contained applications.

If middleware can be both simple and robust, and WSGI is widely available in servers and frameworks, it allows for the possibility of an entirely new kind of Python web application framework: one consisting of loosely-coupled WSGI middleware components. Indeed, existing framework authors may even choose to refactor their frameworks' existing services to be provided in this way, becoming more like libraries used with WSGI, and less like monolithic frameworks. This would then allow application developers to choose "best-of-breed" components for specific functionality, rather than having to commit to all the pros and cons of a single framework.

Of course, as of this writing, that day is doubtless quite far off. In the meantime, it is a sufficient short-term goal for WSGI to enable the use of any framework with any server.

Finally, it should be mentioned that the current version of WSGI does not prescribe any particular mechanism for "deploying" an application for use with a web server or server gateway. At the present time, this is necessarily implementation-defined by the server or gateway. After a sufficient number of servers and frameworks have implemented WSGI to provide field experience with varying deployment requirements, it may make sense to create another PEP, describing a deployment standard for WSGI servers and application frameworks.

Specification Overview

The WSGI interface has two sides: the "server" or "gateway" side, and the "application" or "framework" side. The server side invokes a callable object that is provided by the application side. The specifics of how that object is provided are up to the server or gateway. It is assumed that some servers or gateways will require an application's deployer to write a short script to create an instance of the server or gateway, and supply it with the application object. Other servers and gateways may use configuration files or other mechanisms to specify where an application object should be imported from, or otherwise obtained.

In addition to "pure" servers/gateways and applications/frameworks, it is also possible to create "middleware" components that implement both sides of this specification. Such components act as an application to their containing server, and as a server to a contained application, and can be used to provide extended APIs, content transformation, navigation, and other useful functions.

Throughout this specification, we will use the term "a callable" to mean "a function, method, class, or an instance with a __call__ method". It is up to the server, gateway, or application implementing the callable to choose the appropriate implementation technique for their needs. Conversely, a server, gateway, or application that is invoking a callable must not have any dependency on what kind of callable was provided to it. Callables are only to be called, not introspected upon.

The Application/Framework Side

The application object is simply a callable object that accepts two arguments. The term "object" should not be misconstrued as requiring an actual object instance: a function, method, class, or instance with a __call__ method are all acceptable for use as an application object. Application objects must be able to be invoked more than once, as virtually all servers/gateways (other than CGI) will make such repeated requests.

(Note: although we refer to it as an "application" object, this should not be construed to mean that application developers will use WSGI as a web programming API! It is assumed that application developers will continue to use existing, high-level framework services to develop their applications. WSGI is a tool for framework and server developers, and is not intended to directly support application developers.)

Here are two example application objects; one is a function, and the other is a class:

def simple_app(environ, start_response):
    """Simplest possible application object"""
    status = '200 OK'
    response_headers = [('Content-type', 'text/plain')]
    start_response(status, response_headers)
    return ['Hello world!\n']


class AppClass:
    """Produce the same output, but using a class

    (Note: 'AppClass' is the "application" here, so calling it
    returns an instance of 'AppClass', which is then the iterable
    return value of the "application callable" as required by
    the spec.

    If we wanted to use *instances* of 'AppClass' as application
    objects instead, we would have to implement a '__call__'
    method, which would be invoked to execute the application,
    and we would need to create an instance for use by the
    server or gateway.
    """

    def __init__(self, environ, start_response):
        self.environ = environ
        self.start = start_response

    def __iter__(self):
        status = '200 OK'
        response_headers = [('Content-type', 'text/plain')]
        self.start(status, response_headers)
        yield "Hello world!\n"

The Server/Gateway Side

The server or gateway invokes the application callable once for each request it receives from an HTTP client, that is directed at the application. To illustrate, here is a simple CGI gateway, implemented as a function taking an application object. Note that this simple example has limited error handling, because by default an uncaught exception will be dumped to sys.stderr and logged by the web server.

import os, sys

def run_with_cgi(application):

    environ = dict(os.environ.items())
    environ['wsgi.input']        = sys.stdin
    environ['wsgi.errors']       = sys.stderr
    environ['wsgi.version']      = (1, 0)
    environ['wsgi.multithread']  = False
    environ['wsgi.multiprocess'] = True
    environ['wsgi.run_once']     = True

    if environ.get('HTTPS', 'off') in ('on', '1'):
        environ['wsgi.url_scheme'] = 'https'
    else:
        environ['wsgi.url_scheme'] = 'http'

    headers_set = []
    headers_sent = []

    def write(data):
        if not headers_set:
             raise AssertionError("write() before start_response()")

        elif not headers_sent:
             # Before the first output, send the stored headers
             status, response_headers = headers_sent[:] = headers_set
             sys.stdout.write('Status: %s\r\n' % status)
             for header in response_headers:
                 sys.stdout.write('%s: %s\r\n' % header)
             sys.stdout.write('\r\n')

        sys.stdout.write(data)
        sys.stdout.flush()

    def start_response(status, response_headers, exc_info=None):
        if exc_info:
            try:
                if headers_sent:
                    # Re-raise original exception if headers sent
                    raise exc_info[0], exc_info[1], exc_info[2]
            finally:
                exc_info = None     # avoid dangling circular ref
        elif headers_set:
            raise AssertionError("Headers already set!")

        headers_set[:] = [status, response_headers]
        return write

    result = application(environ, start_response)
    try:
        for data in result:
            if data:    # don't send headers until body appears
                write(data)
        if not headers_sent:
            write('')   # send headers now if body was empty
    finally:
        if hasattr(result, 'close'):
            result.close()

Middleware: Components that Play Both Sides

Note that a single object may play the role of a server with respect to some application(s), while also acting as an application with respect to some server(s). Such "middleware" components can perform such functions as:

  • Routing a request to different application objects based on the target URL, after rewriting the environ accordingly.
  • Allowing multiple applications or frameworks to run side-by-side in the same process
  • Load balancing and remote processing, by forwarding requests and responses over a network
  • Perform content postprocessing, such as applying XSL stylesheets

The presence of middleware in general is transparent to both the "server/gateway" and the "application/framework" sides of the interface, and should require no special support. A user who desires to incorporate middleware into an application simply provides the middleware component to the server, as if it were an application, and configures the middleware component to invoke the application, as if the middleware component were a server. Of course, the "application" that the middleware wraps may in fact be another middleware component wrapping another application, and so on, creating what is referred to as a "middleware stack".

For the most part, middleware must conform to the restrictions and requirements of both the server and application sides of WSGI. In some cases, however, requirements for middleware are more stringent than for a "pure" server or application, and these points will be noted in the specification.

Here is a (tongue-in-cheek) example of a middleware component that converts text/plain responses to pig latin, using Joe Strout's piglatin.py. (Note: a "real" middleware component would probably use a more robust way of checking the content type, and should also check for a content encoding. Also, this simple example ignores the possibility that a word might be split across a block boundary.)

from piglatin import piglatin

class LatinIter:

    """Transform iterated output to piglatin, if it's okay to do so

    Note that the "okayness" can change until the application yields
    its first non-empty string, so 'transform_ok' has to be a mutable
    truth value.
    """

    def __init__(self, result, transform_ok):
        if hasattr(result, 'close'):
            self.close = result.close
        self._next = iter(result).next
        self.transform_ok = transform_ok

    def __iter__(self):
        return self

    def next(self):
        if self.transform_ok:
            return piglatin(self._next())
        else:
            return self._next()

class Latinator:

    # by default, don't transform output
    transform = False

    def __init__(self, application):
        self.application = application

    def __call__(self, environ, start_response):

        transform_ok = []

        def start_latin(status, response_headers, exc_info=None):

            # Reset ok flag, in case this is a repeat call
            del transform_ok[:]

            for name, value in response_headers:
                if name.lower() == 'content-type' and value == 'text/plain':
                    transform_ok.append(True)
                    # Strip content-length if present, else it'll be wrong
                    response_headers = [(name, value)
                        for name, value in response_headers
                            if name.lower() != 'content-length'
                    ]
                    break

            write = start_response(status, response_headers, exc_info)

            if transform_ok:
                def write_latin(data):
                    write(piglatin(data))
                return write_latin
            else:
                return write

        return LatinIter(self.application(environ, start_latin), transform_ok)


# Run foo_app under a Latinator's control, using the example CGI gateway
from foo_app import foo_app
run_with_cgi(Latinator(foo_app))

Specification Details

The application object must accept two positional arguments. For the sake of illustration, we have named them environ and start_response, but they are not required to have these names. A server or gateway must invoke the application object using positional (not keyword) arguments. (E.g. by calling result = application(environ, start_response) as shown above.)

The environ parameter is a dictionary object, containing CGI-style environment variables. This object must be a builtin Python dictionary (not a subclass, UserDict or other dictionary emulation), and the application is allowed to modify the dictionary in any way it desires. The dictionary must also include certain WSGI-required variables (described in a later section), and may also include server-specific extension variables, named according to a convention that will be described below.

The start_response parameter is a callable accepting two required positional arguments, and one optional argument. For the sake of illustration, we have named these arguments status, response_headers, and exc_info, but they are not required to have these names, and the application must invoke the start_response callable using positional arguments (e.g. start_response(status, response_headers)).

The status parameter is a status string of the form "999 Message here", and response_headers is a list of (header_name, header_value) tuples describing the HTTP response header. The optional exc_info parameter is described below in the sections on The start_response() Callable and Error Handling. It is used only when the application has trapped an error and is attempting to display an error message to the browser.

The start_response callable must return a write(body_data) callable that takes one positional parameter: a string to be written as part of the HTTP response body. (Note: the write() callable is provided only to support certain existing frameworks' imperative output APIs; it should not be used by new applications or frameworks if it can be avoided. See the Buffering and Streaming section for more details.)

When called by the server, the application object must return an iterable yielding zero or more strings. This can be accomplished in a variety of ways, such as by returning a list of strings, or by the application being a generator function that yields strings, or by the application being a class whose instances are iterable. Regardless of how it is accomplished, the application object must always return an iterable yielding zero or more strings.

The server or gateway must transmit the yielded strings to the client in an unbuffered fashion, completing the transmission of each string before requesting another one. (In other words, applications should perform their own buffering. See the Buffering and Streaming section below for more on how application output must be handled.)

The server or gateway should treat the yielded strings as binary byte sequences: in particular, it should ensure that line endings are not altered. The application is responsible for ensuring that the string(s) to be written are in a format suitable for the client. (The server or gateway may apply HTTP transfer encodings, or perform other transformations for the purpose of implementing HTTP features such as byte-range transmission. See Other HTTP Features, below, for more details.)

If a call to len(iterable) succeeds, the server must be able to rely on the result being accurate. That is, if the iterable returned by the application provides a working __len__() method, it must return an accurate result. (See the Handling the Content-Length Header section for information on how this would normally be used.)

If the iterable returned by the application has a close() method, the server or gateway must call that method upon completion of the current request, whether the request was completed normally, or terminated early due to an error. (This is to support resource release by the application. This protocol is intended to complement PEP 325's generator support, and other common iterables with close() methods.

(Note: the application must invoke the start_response() callable before the iterable yields its first body string, so that the server can send the headers before any body content. However, this invocation may be performed by the iterable's first iteration, so servers must not assume that start_response() has been called before they begin iterating over the iterable.)

Finally, servers and gateways must not directly use any other attributes of the iterable returned by the application, unless it is an instance of a type specific to that server or gateway, such as a "file wrapper" returned by wsgi.file_wrapper (see Optional Platform-Specific File Handling). In the general case, only attributes specified here, or accessed via e.g. the PEP 234 iteration APIs are acceptable.

environ Variables

The environ dictionary is required to contain these CGI environment variables, as defined by the Common Gateway Interface specification [2]. The following variables must be present, unless their value would be an empty string, in which case they may be omitted, except as otherwise noted below.

REQUEST_METHOD
The HTTP request method, such as "GET" or "POST". This cannot ever be an empty string, and so is always required.
SCRIPT_NAME
The initial portion of the request URL's "path" that corresponds to the application object, so that the application knows its virtual "location". This may be an empty string, if the application corresponds to the "root" of the server.
PATH_INFO
The remainder of the request URL's "path", designating the virtual "location" of the request's target within the application. This may be an empty string, if the request URL targets the application root and does not have a trailing slash.
QUERY_STRING
The portion of the request URL that follows the "?", if any. May be empty or absent.
CONTENT_TYPE
The contents of any Content-Type fields in the HTTP request. May be empty or absent.
CONTENT_LENGTH
The contents of any Content-Length fields in the HTTP request. May be empty or absent.
SERVER_NAME, SERVER_PORT
When combined with SCRIPT_NAME and PATH_INFO, these variables can be used to complete the URL. Note, however, that HTTP_HOST, if present, should be used in preference to SERVER_NAME for reconstructing the request URL. See the URL Reconstruction section below for more detail. SERVER_NAME and SERVER_PORT can never be empty strings, and so are always required.
SERVER_PROTOCOL
The version of the protocol the client used to send the request. Typically this will be something like "HTTP/1.0" or "HTTP/1.1" and may be used by the application to determine how to treat any HTTP request headers. (This variable should probably be called REQUEST_PROTOCOL, since it denotes the protocol used in the request, and is not necessarily the protocol that will be used in the server's response. However, for compatibility with CGI we have to keep the existing name.)
HTTP_ Variables
Variables corresponding to the client-supplied HTTP request headers (i.e., variables whose names begin with "HTTP_"). The presence or absence of these variables should correspond with the presence or absence of the appropriate HTTP header in the request.

A server or gateway should attempt to provide as many other CGI variables as are applicable. In addition, if SSL is in use, the server or gateway should also provide as many of the Apache SSL environment variables [5] as are applicable, such as HTTPS=on and SSL_PROTOCOL. Note, however, that an application that uses any CGI variables other than the ones listed above are necessarily non-portable to web servers that do not support the relevant extensions. (For example, web servers that do not publish files will not be able to provide a meaningful DOCUMENT_ROOT or PATH_TRANSLATED.)

A WSGI-compliant server or gateway should document what variables it provides, along with their definitions as appropriate. Applications should check for the presence of any variables they require, and have a fallback plan in the event such a variable is absent.

Note: missing variables (such as REMOTE_USER when no authentication has occurred) should be left out of the environ dictionary. Also note that CGI-defined variables must be strings, if they are present at all. It is a violation of this specification for a CGI variable's value to be of any type other than str.

In addition to the CGI-defined variables, the environ dictionary may also contain arbitrary operating-system "environment variables", and must contain the following WSGI-defined variables:

Variable Value
wsgi.version The tuple (1, 0), representing WSGI version 1.0.
wsgi.url_scheme A string representing the "scheme" portion of the URL at which the application is being invoked. Normally, this will have the value "http" or "https", as appropriate.
wsgi.input An input stream (file-like object) from which the HTTP request body can be read. (The server or gateway may perform reads on-demand as requested by the application, or it may pre- read the client's request body and buffer it in-memory or on disk, or use any other technique for providing such an input stream, according to its preference.)
wsgi.errors

An output stream (file-like object) to which error output can be written, for the purpose of recording program or other errors in a standardized and possibly centralized location. This should be a "text mode" stream; i.e., applications should use "\n" as a line ending, and assume that it will be converted to the correct line ending by the server/gateway.

For many servers, wsgi.errors will be the server's main error log. Alternatively, this may be sys.stderr, or a log file of some sort. The server's documentation should include an explanation of how to configure this or where to find the recorded output. A server or gateway may supply different error streams to different applications, if this is desired.

wsgi.multithread This value should evaluate true if the application object may be simultaneously invoked by another thread in the same process, and should evaluate false otherwise.
wsgi.multiprocess This value should evaluate true if an equivalent application object may be simultaneously invoked by another process, and should evaluate false otherwise.
wsgi.run_once This value should evaluate true if the server or gateway expects (but does not guarantee!) that the application will only be invoked this one time during the life of its containing process. Normally, this will only be true for a gateway based on CGI (or something similar).

Finally, the environ dictionary may also contain server-defined variables. These variables should be named using only lower-case letters, numbers, dots, and underscores, and should be prefixed with a name that is unique to the defining server or gateway. For example, mod_python might define variables with names like mod_python.some_variable.

Input and Error Streams

The input and error streams provided by the server must support the following methods:

Method Stream Notes
read(size) input 1
readline() input 1, 2
readlines(hint) input 1, 3
__iter__() input  
flush() errors 4
write(str) errors  
writelines(seq) errors  

The semantics of each method are as documented in the Python Library Reference, except for these notes as listed in the table above:

  1. The server is not required to read past the client's specified Content-Length, and is allowed to simulate an end-of-file condition if the application attempts to read past that point. The application should not attempt to read more data than is specified by the CONTENT_LENGTH variable.
  2. The optional "size" argument to readline() is not supported, as it may be complex for server authors to implement, and is not often used in practice.
  3. Note that the hint argument to readlines() is optional for both caller and implementer. The application is free not to supply it, and the server or gateway is free to ignore it.
  4. Since the errors stream may not be rewound, servers and gateways are free to forward write operations immediately, without buffering. In this case, the flush() method may be a no-op. Portable applications, however, cannot assume that output is unbuffered or that flush() is a no-op. They must call flush() if they need to ensure that output has in fact been written. (For example, to minimize intermingling of data from multiple processes writing to the same error log.)

The methods listed in the table above must be supported by all servers conforming to this specification. Applications conforming to this specification must not use any other methods or attributes of the input or errors objects. In particular, applications must not attempt to close these streams, even if they possess close() methods.

The start_response() Callable

The second parameter passed to the application object is a callable of the form start_response(status, response_headers, exc_info=None). (As with all WSGI callables, the arguments must be supplied positionally, not by keyword.) The start_response callable is used to begin the HTTP response, and it must return a write(body_data) callable (see the Buffering and Streaming section, below).

The status argument is an HTTP "status" string like "200 OK" or "404 Not Found". That is, it is a string consisting of a Status-Code and a Reason-Phrase, in that order and separated by a single space, with no surrounding whitespace or other characters. (See RFC 2616, Section 6.1.1 for more information.) The string must not contain control characters, and must not be terminated with a carriage return, linefeed, or combination thereof.

The response_headers argument is a list of (header_name, header_value) tuples. It must be a Python list; i.e. type(response_headers) is ListType, and the server may change its contents in any way it desires. Each header_name must be a valid HTTP header field-name (as defined by RFC 2616, Section 4.2), without a trailing colon or other punctuation.

Each header_value must not include any control characters, including carriage returns or linefeeds, either embedded or at the end. (These requirements are to minimize the complexity of any parsing that must be performed by servers, gateways, and intermediate response processors that need to inspect or modify response headers.)

In general, the server or gateway is responsible for ensuring that correct headers are sent to the client: if the application omits a header required by HTTP (or other relevant specifications that are in effect), the server or gateway must add it. For example, the HTTP Date: and Server: headers would normally be supplied by the server or gateway.

(A reminder for server/gateway authors: HTTP header names are case-insensitive, so be sure to take that into consideration when examining application-supplied headers!)

Applications and middleware are forbidden from using HTTP/1.1 "hop-by-hop" features or headers, any equivalent features in HTTP/1.0, or any headers that would affect the persistence of the client's connection to the web server. These features are the exclusive province of the actual web server, and a server or gateway should consider it a fatal error for an application to attempt sending them, and raise an error if they are supplied to start_response(). (For more specifics on "hop-by-hop" features and headers, please see the Other HTTP Features section below.)

The start_response callable must not actually transmit the response headers. Instead, it must store them for the server or gateway to transmit only after the first iteration of the application return value that yields a non-empty string, or upon the application's first invocation of the write() callable. In other words, response headers must not be sent until there is actual body data available, or until the application's returned iterable is exhausted. (The only possible exception to this rule is if the response headers explicitly include a Content-Length of zero.)

This delaying of response header transmission is to ensure that buffered and asynchronous applications can replace their originally intended output with error output, up until the last possible moment. For example, the application may need to change the response status from "200 OK" to "500 Internal Error", if an error occurs while the body is being generated within an application buffer.

The exc_info argument, if supplied, must be a Python sys.exc_info() tuple. This argument should be supplied by the application only if start_response is being called by an error handler. If exc_info is supplied, and no HTTP headers have been output yet, start_response should replace the currently-stored HTTP response headers with the newly-supplied ones, thus allowing the application to "change its mind" about the output when an error has occurred.

However, if exc_info is provided, and the HTTP headers have already been sent, start_response must raise an error, and should raise the exc_info tuple. That is:

raise exc_info[0], exc_info[1], exc_info[2]

This will re-raise the exception trapped by the application, and in principle should abort the application. (It is not safe for the application to attempt error output to the browser once the HTTP headers have already been sent.) The application must not trap any exceptions raised by start_response, if it called start_response with exc_info. Instead, it should allow such exceptions to propagate back to the server or gateway. See Error Handling below, for more details.

The application may call start_response more than once, if and only if the exc_info argument is provided. More precisely, it is a fatal error to call start_response without the exc_info argument if start_response has already been called within the current invocation of the application. (See the example CGI gateway above for an illustration of the correct logic.)

Note: servers, gateways, or middleware implementing start_response should ensure that no reference is held to the exc_info parameter beyond the duration of the function's execution, to avoid creating a circular reference through the traceback and frames involved. The simplest way to do this is something like:

def start_response(status, response_headers, exc_info=None):
    if exc_info:
         try:
             # do stuff w/exc_info here
         finally:
             exc_info = None    # Avoid circular ref.

The example CGI gateway provides another illustration of this technique.

Handling the Content-Length Header

If the application does not supply a Content-Length header, a server or gateway may choose one of several approaches to handling it. The simplest of these is to close the client connection when the response is completed.

Under some circumstances, however, the server or gateway may be able to either generate a Content-Length header, or at least avoid the need to close the client connection. If the application does not call the write() callable, and returns an iterable whose len() is 1, then the server can automatically determine Content-Length by taking the length of the first string yielded by the iterable.

And, if the server and client both support HTTP/1.1 "chunked encoding" [3], then the server may use chunked encoding to send a chunk for each write() call or string yielded by the iterable, thus generating a Content-Length header for each chunk. This allows the server to keep the client connection alive, if it wishes to do so. Note that the server must comply fully with RFC 2616 when doing this, or else fall back to one of the other strategies for dealing with the absence of Content-Length.

(Note: applications and middleware must not apply any kind of Transfer-Encoding to their output, such as chunking or gzipping; as "hop-by-hop" operations, these encodings are the province of the actual web server/gateway. See Other HTTP Features below, for more details.)

Buffering and Streaming

Generally speaking, applications will achieve the best throughput by buffering their (modestly-sized) output and sending it all at once. This is a common approach in existing frameworks such as Zope: the output is buffered in a StringIO or similar object, then transmitted all at once, along with the response headers.

The corresponding approach in WSGI is for the application to simply return a single-element iterable (such as a list) containing the response body as a single string. This is the recommended approach for the vast majority of application functions, that render HTML pages whose text easily fits in memory.

For large files, however, or for specialized uses of HTTP streaming (such as multipart "server push"), an application may need to provide output in smaller blocks (e.g. to avoid loading a large file into memory). It's also sometimes the case that part of a response may be time-consuming to produce, but it would be useful to send ahead the portion of the response that precedes it.

In these cases, applications will usually return an iterator (often a generator-iterator) that produces the output in a block-by-block fashion. These blocks may be broken to coincide with mulitpart boundaries (for "server push"), or just before time-consuming tasks (such as reading another block of an on-disk file).

WSGI servers, gateways, and middleware must not delay the transmission of any block; they must either fully transmit the block to the client, or guarantee that they will continue transmission even while the application is producing its next block. A server/gateway or middleware may provide this guarantee in one of three ways:

  1. Send the entire block to the operating system (and request that any O/S buffers be flushed) before returning control to the application, OR
  2. Use a different thread to ensure that the block continues to be transmitted while the application produces the next block.
  3. (Middleware only) send the entire block to its parent gateway/server

By providing this guarantee, WSGI allows applications to ensure that transmission will not become stalled at an arbitrary point in their output data. This is critical for proper functioning of e.g. multipart "server push" streaming, where data between multipart boundaries should be transmitted in full to the client.

Middleware Handling of Block Boundaries

In order to better support asynchronous applications and servers, middleware components must not block iteration waiting for multiple values from an application iterable. If the middleware needs to accumulate more data from the application before it can produce any output, it must yield an empty string.

To put this requirement another way, a middleware component must yield at least one value each time its underlying application yields a value. If the middleware cannot yield any other value, it must yield an empty string.

This requirement ensures that asynchronous applications and servers can conspire to reduce the number of threads that are required to run a given number of application instances simultaneously.

Note also that this requirement means that middleware must return an iterable as soon as its underlying application returns an iterable. It is also forbidden for middleware to use the write() callable to transmit data that is yielded by an underlying application. Middleware may only use their parent server's write() callable to transmit data that the underlying application sent using a middleware-provided write() callable.

The write() Callable

Some existing application framework APIs support unbuffered output in a different manner than WSGI. Specifically, they provide a "write" function or method of some kind to write an unbuffered block of data, or else they provide a buffered "write" function and a "flush" mechanism to flush the buffer.

Unfortunately, such APIs cannot be implemented in terms of WSGI's "iterable" application return value, unless threads or other special mechanisms are used.

Therefore, to allow these frameworks to continue using an imperative API, WSGI includes a special write() callable, returned by the start_response callable.

New WSGI applications and frameworks should not use the write() callable if it is possible to avoid doing so. The write() callable is strictly a hack to support imperative streaming APIs. In general, applications should produce their output via their returned iterable, as this makes it possible for web servers to interleave other tasks in the same Python thread, potentially providing better throughput for the server as a whole.

The write() callable is returned by the start_response() callable, and it accepts a single parameter: a string to be written as part of the HTTP response body, that is treated exactly as though it had been yielded by the output iterable. In other words, before write() returns, it must guarantee that the passed-in string was either completely sent to the client, or that it is buffered for transmission while the application proceeds onward.

An application must return an iterable object, even if it uses write() to produce all or part of its response body. The returned iterable may be empty (i.e. yield no non-empty strings), but if it does yield non-empty strings, that output must be treated normally by the server or gateway (i.e., it must be sent or queued immediately). Applications must not invoke write() from within their return iterable, and therefore any strings yielded by the iterable are transmitted after all strings passed to write() have been sent to the client.

Unicode Issues

HTTP does not directly support Unicode, and neither does this interface. All encoding/decoding must be handled by the application; all strings passed to or from the server must be standard Python byte strings, not Unicode objects. The result of using a Unicode object where a string object is required, is undefined.

Note also that strings passed to start_response() as a status or as response headers must follow RFC 2616 with respect to encoding. That is, they must either be ISO-8859-1 characters, or use RFC 2047 MIME encoding.

On Python platforms where the str or StringType type is in fact Unicode-based (e.g. Jython, IronPython, Python 3000, etc.), all "strings" referred to in this specification must contain only code points representable in ISO-8859-1 encoding (\u0000 through \u00FF, inclusive). It is a fatal error for an application to supply strings containing any other Unicode character or code point. Similarly, servers and gateways must not supply strings to an application containing any other Unicode characters.

Again, all strings referred to in this specification must be of type str or StringType, and must not be of type unicode or UnicodeType. And, even if a given platform allows for more than 8 bits per character in str/StringType objects, only the lower 8 bits may be used, for any value referred to in this specification as a "string".

Error Handling

In general, applications should try to trap their own, internal errors, and display a helpful message in the browser. (It is up to the application to decide what "helpful" means in this context.)

However, to display such a message, the application must not have actually sent any data to the browser yet, or else it risks corrupting the response. WSGI therefore provides a mechanism to either allow the application to send its error message, or be automatically aborted: the exc_info argument to start_response. Here is an example of its use:

try:
    # regular application code here
    status = "200 Froody"
    response_headers = [("content-type", "text/plain")]
    start_response(status, response_headers)
    return ["normal body goes here"]
except:
    # XXX should trap runtime issues like MemoryError, KeyboardInterrupt
    #     in a separate handler before this bare 'except:'...
    status = "500 Oops"
    response_headers = [("content-type", "text/plain")]
    start_response(status, response_headers, sys.exc_info())
    return ["error body goes here"]

If no output has been written when an exception occurs, the call to start_response will return normally, and the application will return an error body to be sent to the browser. However, if any output has already been sent to the browser, start_response will reraise the provided exception. This exception should not be trapped by the application, and so the application will abort. The server or gateway can then trap this (fatal) exception and abort the response.

Servers should trap and log any exception that aborts an application or the iteration of its return value. If a partial response has already been written to the browser when an application error occurs, the server or gateway may attempt to add an error message to the output, if the already-sent headers indicate a text/* content type that the server knows how to modify cleanly.

Some middleware may wish to provide additional exception handling services, or intercept and replace application error messages. In such cases, middleware may choose to not re-raise the exc_info supplied to start_response, but instead raise a middleware-specific exception, or simply return without an exception after storing the supplied arguments. This will then cause the application to return its error body iterable (or invoke write()), allowing the middleware to capture and modify the error output. These techniques will work as long as application authors:

  1. Always provide exc_info when beginning an error response
  2. Never trap errors raised by start_response when exc_info is being provided

HTTP 1.1 Expect/Continue

Servers and gateways that implement HTTP 1.1 must provide transparent support for HTTP 1.1's "expect/continue" mechanism. This may be done in any of several ways:

  1. Respond to requests containing an Expect: 100-continue request with an immediate "100 Continue" response, and proceed normally.
  2. Proceed with the request normally, but provide the application with a wsgi.input stream that will send the "100 Continue" response if/when the application first attempts to read from the input stream. The read request must then remain blocked until the client responds.
  3. Wait until the client decides that the server does not support expect/continue, and sends the request body on its own. (This is suboptimal, and is not recommended.)

Note that these behavior restrictions do not apply for HTTP 1.0 requests, or for requests that are not directed to an application object. For more information on HTTP 1.1 Expect/Continue, see RFC 2616, sections 8.2.3 and 10.1.1.

Other HTTP Features

In general, servers and gateways should "play dumb" and allow the application complete control over its output. They should only make changes that do not alter the effective semantics of the application's response. It is always possible for the application developer to add middleware components to supply additional features, so server/gateway developers should be conservative in their implementation. In a sense, a server should consider itself to be like an HTTP "gateway server", with the application being an HTTP "origin server". (See RFC 2616, section 1.3, for the definition of these terms.)

However, because WSGI servers and applications do not communicate via HTTP, what RFC 2616 calls "hop-by-hop" headers do not apply to WSGI internal communications. WSGI applications must not generate any "hop-by-hop" headers [4], attempt to use HTTP features that would require them to generate such headers, or rely on the content of any incoming "hop-by-hop" headers in the environ dictionary. WSGI servers must handle any supported inbound "hop-by-hop" headers on their own, such as by decoding any inbound Transfer-Encoding, including chunked encoding if applicable.

Applying these principles to a variety of HTTP features, it should be clear that a server may handle cache validation via the If-None-Match and If-Modified-Since request headers and the Last-Modified and ETag response headers. However, it is not required to do this, and the application should perform its own cache validation if it wants to support that feature, since the server/gateway is not required to do such validation.

Similarly, a server may re-encode or transport-encode an application's response, but the application should use a suitable content encoding on its own, and must not apply a transport encoding. A server may transmit byte ranges of the application's response if requested by the client, and the application doesn't natively support byte ranges. Again, however, the application should perform this function on its own if desired.

Note that these restrictions on applications do not necessarily mean that every application must reimplement every HTTP feature; many HTTP features can be partially or fully implemented by middleware components, thus freeing both server and application authors from implementing the same features over and over again.

Thread Support

Thread support, or lack thereof, is also server-dependent. Servers that can run multiple requests in parallel, should also provide the option of running an application in a single-threaded fashion, so that applications or frameworks that are not thread-safe may still be used with that server.

Implementation/Application Notes

Server Extension APIs

Some server authors may wish to expose more advanced APIs, that application or framework authors can use for specialized purposes. For example, a gateway based on mod_python might wish to expose part of the Apache API as a WSGI extension.

In the simplest case, this requires nothing more than defining an environ variable, such as mod_python.some_api. But, in many cases, the possible presence of middleware can make this difficult. For example, an API that offers access to the same HTTP headers that are found in environ variables, might return different data if environ has been modified by middleware.

In general, any extension API that duplicates, supplants, or bypasses some portion of WSGI functionality runs the risk of being incompatible with middleware components. Server/gateway developers should not assume that nobody will use middleware, because some framework developers specifically intend to organize or reorganize their frameworks to function almost entirely as middleware of various kinds.

So, to provide maximum compatibility, servers and gateways that provide extension APIs that replace some WSGI functionality, must design those APIs so that they are invoked using the portion of the API that they replace. For example, an extension API to access HTTP request headers must require the application to pass in its current environ, so that the server/gateway may verify that HTTP headers accessible via the API have not been altered by middleware. If the extension API cannot guarantee that it will always agree with environ about the contents of HTTP headers, it must refuse service to the application, e.g. by raising an error, returning None instead of a header collection, or whatever is appropriate to the API.

Similarly, if an extension API provides an alternate means of writing response data or headers, it should require the start_response callable to be passed in, before the application can obtain the extended service. If the object passed in is not the same one that the server/gateway originally supplied to the application, it cannot guarantee correct operation and must refuse to provide the extended service to the application.

These guidelines also apply to middleware that adds information such as parsed cookies, form variables, sessions, and the like to environ. Specifically, such middleware should provide these features as functions which operate on environ, rather than simply stuffing values into environ. This helps ensure that information is calculated from environ after any middleware has done any URL rewrites or other environ modifications.

It is very important that these "safe extension" rules be followed by both server/gateway and middleware developers, in order to avoid a future in which middleware developers are forced to delete any and all extension APIs from environ to ensure that their mediation isn't being bypassed by applications using those extensions!

Application Configuration

This specification does not define how a server selects or obtains an application to invoke. These and other configuration options are highly server-specific matters. It is expected that server/gateway authors will document how to configure the server to execute a particular application object, and with what options (such as threading options).

Framework authors, on the other hand, should document how to create an application object that wraps their framework's functionality. The user, who has chosen both the server and the application framework, must connect the two together. However, since both the framework and the server now have a common interface, this should be merely a mechanical matter, rather than a significant engineering effort for each new server/framework pair.

Finally, some applications, frameworks, and middleware may wish to use the environ dictionary to receive simple string configuration options. Servers and gateways should support this by allowing an application's deployer to specify name-value pairs to be placed in environ. In the simplest case, this support can consist merely of copying all operating system-supplied environment variables from os.environ into the environ dictionary, since the deployer in principle can configure these externally to the server, or in the CGI case they may be able to be set via the server's configuration files.

Applications should try to keep such required variables to a minimum, since not all servers will support easy configuration of them. Of course, even in the worst case, persons deploying an application can create a script to supply the necessary configuration values:

from the_app import application

def new_app(environ, start_response):
    environ['the_app.configval1'] = 'something'
    return application(environ, start_response)

But, most existing applications and frameworks will probably only need a single configuration value from environ, to indicate the location of their application or framework-specific configuration file(s). (Of course, applications should cache such configuration, to avoid having to re-read it upon each invocation.)

URL Reconstruction

If an application wishes to reconstruct a request's complete URL, it may do so using the following algorithm, contributed by Ian Bicking:

from urllib import quote
url = environ['wsgi.url_scheme']+'://'

if environ.get('HTTP_HOST'):
    url += environ['HTTP_HOST']
else:
    url += environ['SERVER_NAME']

    if environ['wsgi.url_scheme'] == 'https':
        if environ['SERVER_PORT'] != '443':
           url += ':' + environ['SERVER_PORT']
    else:
        if environ['SERVER_PORT'] != '80':
           url += ':' + environ['SERVER_PORT']

url += quote(environ.get('SCRIPT_NAME', ''))
url += quote(environ.get('PATH_INFO', ''))
if environ.get('QUERY_STRING'):
    url += '?' + environ['QUERY_STRING']

Note that such a reconstructed URL may not be precisely the same URI as requested by the client. Server rewrite rules, for example, may have modified the client's originally requested URL to place it in a canonical form.

Supporting Older (<2.2) Versions of Python

Some servers, gateways, or applications may wish to support older (<2.2) versions of Python. This is especially important if Jython is a target platform, since as of this writing a production-ready version of Jython 2.2 is not yet available.

For servers and gateways, this is relatively straightforward: servers and gateways targeting pre-2.2 versions of Python must simply restrict themselves to using only a standard "for" loop to iterate over any iterable returned by an application. This is the only way to ensure source-level compatibility with both the pre-2.2 iterator protocol (discussed further below) and "today's" iterator protocol (see PEP 234).

(Note that this technique necessarily applies only to servers, gateways, or middleware that are written in Python. Discussion of how to use iterator protocol(s) correctly from other languages is outside the scope of this PEP.)

For applications, supporting pre-2.2 versions of Python is slightly more complex:

  • You may not return a file object and expect it to work as an iterable, since before Python 2.2, files were not iterable. (In general, you shouldn't do this anyway, because it will perform quite poorly most of the time!) Use wsgi.file_wrapper or an application-specific file wrapper class. (See Optional Platform-Specific File Handling for more on wsgi.file_wrapper, and an example class you can use to wrap a file as an iterable.)
  • If you return a custom iterable, it must implement the pre-2.2 iterator protocol. That is, provide a __getitem__ method that accepts an integer key, and raises IndexError when exhausted. (Note that built-in sequence types are also acceptable, since they also implement this protocol.)

Finally, middleware that wishes to support pre-2.2 versions of Python, and iterates over application return values or itself returns an iterable (or both), must follow the appropriate recommendations above.

(Note: It should go without saying that to support pre-2.2 versions of Python, any server, gateway, application, or middleware must also use only language features available in the target version, use 1 and 0 instead of True and False, etc.)

Optional Platform-Specific File Handling

Some operating environments provide special high-performance file- transmission facilities, such as the Unix sendfile() call. Servers and gateways may expose this functionality via an optional wsgi.file_wrapper key in the environ. An application may use this "file wrapper" to convert a file or file-like object into an iterable that it then returns, e.g.:

if 'wsgi.file_wrapper' in environ:
    return environ['wsgi.file_wrapper'](filelike, block_size)
else:
    return iter(lambda: filelike.read(block_size), '')

If the server or gateway supplies wsgi.file_wrapper, it must be a callable that accepts one required positional parameter, and one optional positional parameter. The first parameter is the file-like object to be sent, and the second parameter is an optional block size "suggestion" (which the server/gateway need not use). The callable must return an iterable object, and must not perform any data transmission until and unless the server/gateway actually receives the iterable as a return value from the application. (To do otherwise would prevent middleware from being able to interpret or override the response data.)

To be considered "file-like", the object supplied by the application must have a read() method that takes an optional size argument. It may have a close() method, and if so, the iterable returned by wsgi.file_wrapper must have a close() method that invokes the original file-like object's close() method. If the "file-like" object has any other methods or attributes with names matching those of Python built-in file objects (e.g. fileno()), the wsgi.file_wrapper may assume that these methods or attributes have the same semantics as those of a built-in file object.

The actual implementation of any platform-specific file handling must occur after the application returns, and the server or gateway checks to see if a wrapper object was returned. (Again, because of the presence of middleware, error handlers, and the like, it is not guaranteed that any wrapper created will actually be used.)

Apart from the handling of close(), the semantics of returning a file wrapper from the application should be the same as if the application had returned iter(filelike.read, ''). In other words, transmission should begin at the current position within the "file" at the time that transmission begins, and continue until the end is reached.

Of course, platform-specific file transmission APIs don't usually accept arbitrary "file-like" objects. Therefore, a wsgi.file_wrapper has to introspect the supplied object for things such as a fileno() (Unix-like OSes) or a java.nio.FileChannel (under Jython) in order to determine if the file-like object is suitable for use with the platform-specific API it supports.

Note that even if the object is not suitable for the platform API, the wsgi.file_wrapper must still return an iterable that wraps read() and close(), so that applications using file wrappers are portable across platforms. Here's a simple platform-agnostic file wrapper class, suitable for old (pre 2.2) and new Pythons alike:

class FileWrapper:

    def __init__(self, filelike, blksize=8192):
        self.filelike = filelike
        self.blksize = blksize
        if hasattr(filelike, 'close'):
            self.close = filelike.close

    def __getitem__(self, key):
        data = self.filelike.read(self.blksize)
        if data:
            return data
        raise IndexError

and here is a snippet from a server/gateway that uses it to provide access to a platform-specific API:

environ['wsgi.file_wrapper'] = FileWrapper
result = application(environ, start_response)

try:
    if isinstance(result, FileWrapper):
        # check if result.filelike is usable w/platform-specific
        # API, and if so, use that API to transmit the result.
        # If not, fall through to normal iterable handling
        # loop below.

    for data in result:
        # etc.

finally:
    if hasattr(result, 'close'):
        result.close()

Questions and Answers

  1. Why must environ be a dictionary? What's wrong with using a subclass?

    The rationale for requiring a dictionary is to maximize portability between servers. The alternative would be to define some subset of a dictionary's methods as being the standard and portable interface. In practice, however, most servers will probably find a dictionary adequate to their needs, and thus framework authors will come to expect the full set of dictionary features to be available, since they will be there more often than not. But, if some server chooses not to use a dictionary, then there will be interoperability problems despite that server's "conformance" to spec. Therefore, making a dictionary mandatory simplifies the specification and guarantees interoperabilty.

    Note that this does not prevent server or framework developers from offering specialized services as custom variables inside the environ dictionary. This is the recommended approach for offering any such value-added services.

  2. Why can you call write() and yield strings/return an iterable? Shouldn't we pick just one way?

    If we supported only the iteration approach, then current frameworks that assume the availability of "push" suffer. But, if we only support pushing via write(), then server performance suffers for transmission of e.g. large files (if a worker thread can't begin work on a new request until all of the output has been sent). Thus, this compromise allows an application framework to support both approaches, as appropriate, but with only a little more burden to the server implementor than a push-only approach would require.

  3. What's the close() for?

    When writes are done during the execution of an application object, the application can ensure that resources are released using a try/finally block. But, if the application returns an iterable, any resources used will not be released until the iterable is garbage collected. The close() idiom allows an application to release critical resources at the end of a request, and it's forward-compatible with the support for try/finally in generators that's proposed by PEP 325.

  4. Why is this interface so low-level? I want feature X! (e.g. cookies, sessions, persistence, ...)

    This isn't Yet Another Python Web Framework. It's just a way for frameworks to talk to web servers, and vice versa. If you want these features, you need to pick a web framework that provides the features you want. And if that framework lets you create a WSGI application, you should be able to run it in most WSGI-supporting servers. Also, some WSGI servers may offer additional services via objects provided in their environ dictionary; see the applicable server documentation for details. (Of course, applications that use such extensions will not be portable to other WSGI-based servers.)

  5. Why use CGI variables instead of good old HTTP headers? And why mix them in with WSGI-defined variables?

    Many existing web frameworks are built heavily upon the CGI spec, and existing web servers know how to generate CGI variables. In contrast, alternative ways of representing inbound HTTP information are fragmented and lack market share. Thus, using the CGI "standard" seems like a good way to leverage existing implementations. As for mixing them with WSGI variables, separating them would just require two dictionary arguments to be passed around, while providing no real benefits.

  6. What about the status string? Can't we just use the number, passing in 200 instead of "200 OK"?

    Doing this would complicate the server or gateway, by requiring them to have a table of numeric statuses and corresponding messages. By contrast, it is easy for an application or framework author to type the extra text to go with the specific response code they are using, and existing frameworks often already have a table containing the needed messages. So, on balance it seems better to make the application/framework responsible, rather than the server or gateway.

  7. Why is wsgi.run_once not guaranteed to run the app only once?

    Because it's merely a suggestion to the application that it should "rig for infrequent running". This is intended for application frameworks that have multiple modes of operation for caching, sessions, and so forth. In a "multiple run" mode, such frameworks may preload caches, and may not write e.g. logs or session data to disk after each request. In "single run" mode, such frameworks avoid preloading and flush all necessary writes after each request.

    However, in order to test an application or framework to verify correct operation in the latter mode, it may be necessary (or at least expedient) to invoke it more than once. Therefore, an application should not assume that it will definitely not be run again, just because it is called with wsgi.run_once set to True.

  8. Feature X (dictionaries, callables, etc.) are ugly for use in application code; why don't we use objects instead?

    All of these implementation choices of WSGI are specifically intended to decouple features from one another; recombining these features into encapsulated objects makes it somewhat harder to write servers or gateways, and an order of magnitude harder to write middleware that replaces or modifies only small portions of the overall functionality.

    In essence, middleware wants to have a "Chain of Responsibility" pattern, whereby it can act as a "handler" for some functions, while allowing others to remain unchanged. This is difficult to do with ordinary Python objects, if the interface is to remain extensible. For example, one must use __getattr__ or __getattribute__ overrides, to ensure that extensions (such as attributes defined by future WSGI versions) are passed through.

    This type of code is notoriously difficult to get 100% correct, and few people will want to write it themselves. They will therefore copy other people's implementations, but fail to update them when the person they copied from corrects yet another corner case.

    Further, this necessary boilerplate would be pure excise, a developer tax paid by middleware developers to support a slightly prettier API for application framework developers. But, application framework developers will typically only be updating one framework to support WSGI, and in a very limited part of their framework as a whole. It will likely be their first (and maybe their only) WSGI implementation, and thus they will likely implement with this specification ready to hand. Thus, the effort of making the API "prettier" with object attributes and suchlike would likely be wasted for this audience.

    We encourage those who want a prettier (or otherwise improved) WSGI interface for use in direct web application programming (as opposed to web framework development) to develop APIs or frameworks that wrap WSGI for convenient use by application developers. In this way, WSGI can remain conveniently low-level for server and middleware authors, while not being "ugly" for application developers.

Proposed/Under Discussion

These items are currently being discussed on the Web-SIG and elsewhere, or are on the PEP author's "to-do" list:

  • Should wsgi.input be an iterator instead of a file? This would help for asynchronous applications and chunked-encoding input streams.
  • Optional extensions are being discussed for pausing iteration of an application's output until input is available or until a callback occurs.
  • Add a section about synchronous vs. asynchronous apps and servers, the relevant threading models, and issues/design goals in these areas.

Acknowledgements

Thanks go to the many folks on the Web-SIG mailing list whose thoughtful feedback made this revised draft possible. Especially:

  • Gregory "Grisha" Trubetskoy, author of mod_python, who beat up on the first draft as not offering any advantages over "plain old CGI", thus encouraging me to look for a better approach.
  • Ian Bicking, who helped nag me into properly specifying the multithreading and multiprocess options, as well as badgering me to provide a mechanism for servers to supply custom extension data to an application.
  • Tony Lownds, who came up with the concept of a start_response function that took the status and headers, returning a write function. His input also guided the design of the exception handling facilities, especially in the area of allowing for middleware that overrides application error messages.
  • Alan Kennedy, whose courageous attempts to implement WSGI-on-Jython (well before the spec was finalized) helped to shape the "supporting older versions of Python" section, as well as the optional wsgi.file_wrapper facility.
  • Mark Nottingham, who reviewed the spec extensively for issues with HTTP RFC compliance, especially with regard to HTTP/1.1 features that I didn't even know existed until he pointed them out.

References

[1]The Python Wiki "Web Programming" topic (http://www.python.org/cgi-bin/moinmoin/WebProgramming)
[2]The Common Gateway Interface Specification, v 1.1, 3rd Draft (http://ken.coar.org/cgi/draft-coar-cgi-v11-03.txt)
[3]"Chunked Transfer Coding" -- HTTP/1.1, section 3.6.1 (http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.6.1)
[4]"End-to-end and Hop-by-hop Headers" -- HTTP/1.1, Section 13.5.1 (http://www.w3.org/Protocols/rfc2616/rfc2616-sec13.html#sec13.5.1)
[5]mod_ssl Reference, "Environment Variables" (http://www.modssl.org/docs/2.8/ssl_reference.html#ToC25)

pep-0334 Simple Coroutines via SuspendIteration

PEP:334
Title:Simple Coroutines via SuspendIteration
Version:$Revision$
Last-Modified:$Date$
Author:Clark C. Evans <cce at clarkevans.com>
Status:Withdrawn
Type:Standards Track
Content-Type:text/x-rst
Created:26-Aug-2004
Python-Version:3.0
Post-History:

Abstract

Asynchronous application frameworks such as Twisted [1] and Peak [2], are based on a cooperative multitasking via event queues or deferred execution. While this approach to application development does not involve threads and thus avoids a whole class of problems [3], it creates a different sort of programming challenge. When an I/O operation would block, a user request must suspend so that other requests can proceed. The concept of a coroutine [4] promises to help the application developer grapple with this state management difficulty.

This PEP proposes a limited approach to coroutines based on an extension to the iterator protocol [5]. Currently, an iterator may raise a StopIteration exception to indicate that it is done producing values. This proposal adds another exception to this protocol, SuspendIteration, which indicates that the given iterator may have more values to produce, but is unable to do so at this time.

Rationale

There are two current approaches to bringing co-routines to Python. Christian Tismer's Stackless [6] involves a ground-up restructuring of Python's execution model by hacking the 'C' stack. While this approach works, its operation is hard to describe and keep portable. A related approach is to compile Python code to Parrot [7], a register-based virtual machine, which has coroutines. Unfortunately, neither of these solutions is portable with IronPython (CLR) or Jython (JavaVM).

It is thought that a more limited approach, based on iterators, could provide a coroutine facility to application programmers and still be portable across runtimes.

  • Iterators keep their state in local variables that are not on the "C" stack. Iterators can be viewed as classes, with state stored in member variables that are persistent across calls to its next() method.
  • While an uncaught exception may terminate a function's execution, an uncaught exception need not invalidate an iterator. The proposed exception, SuspendIteration, uses this feature. In other words, just because one call to next() results in an exception does not necessarily need to imply that the iterator itself is no longer capable of producing values.

There are four places where this new exception impacts:

  • The simple generator [8] mechanism could be extended to safely 'catch' this SuspendIteration exception, stuff away its current state, and pass the exception on to the caller.
  • Various iterator filters [9] in the standard library, such as itertools.izip should be made aware of this exception so that it can transparently propagate SuspendIteration.
  • Iterators generated from I/O operations, such as a file or socket reader, could be modified to have a non-blocking variety. This option would raise a subclass of SuspendIteration if the requested operation would block.
  • The asyncore library could be updated to provide a basic 'runner' that pulls from an iterator; if the SuspendIteration exception is caught, then it moves on to the next iterator in its runlist [10]. External frameworks like Twisted would provide alternative implementations, perhaps based on FreeBSD's kqueue or Linux's epoll.

While these may seem dramatic changes, it is a very small amount of work compared with the utility provided by continuations.

Semantics

This section will explain, at a high level, how the introduction of this new SuspendIteration exception would behave.

Simple Iterators

The current functionality of iterators is best seen with a simple example which produces two values 'one' and 'two'.

class States:

    def __iter__(self):
        self._next = self.state_one
        return self

    def next(self):
        return self._next()

    def state_one(self):
        self._next = self.state_two
        return "one"

    def state_two(self):
        self._next = self.state_stop
        return "two"

    def state_stop(self):
        raise StopIteration

print list(States())

An equivalent iteration could, of course, be created by the following generator:

def States():
    yield 'one'
    yield 'two'

print list(States())

Introducing SuspendIteration

Suppose that between producing 'one' and 'two', the generator above could block on a socket read. In this case, we would want to raise SuspendIteration to signal that the iterator is not done producing, but is unable to provide a value at the current moment.

from random import randint
from time import sleep

class SuspendIteration(Exception):
      pass

class NonBlockingResource:

    """Randomly unable to produce the second value"""

    def __iter__(self):
        self._next = self.state_one
        return self

    def next(self):
        return self._next()

    def state_one(self):
        self._next = self.state_suspend
        return "one"

    def state_suspend(self):
        rand = randint(1,10)
        if 2 == rand:
            self._next = self.state_two
            return self.state_two()
        raise SuspendIteration()

    def state_two(self):
        self._next = self.state_stop
        return "two"

    def state_stop(self):
        raise StopIteration

def sleeplist(iterator, timeout = .1):
    """
    Do other things (e.g. sleep) while resource is
    unable to provide the next value
    """
    it = iter(iterator)
    retval = []
    while True:
        try:
            retval.append(it.next())
        except SuspendIteration:
            sleep(timeout)
            continue
        except StopIteration:
            break
    return retval

print sleeplist(NonBlockingResource())

In a real-world situation, the NonBlockingResource would be a file iterator, socket handle, or other I/O based producer. The sleeplist would instead be an async reactor, such as those found in asyncore or Twisted. The non-blocking resource could, of course, be written as a generator:

def NonBlockingResource():
    yield "one"
    while True:
        rand = randint(1,10)
        if 2 == rand:
            break
        raise SuspendIteration()
    yield "two"

It is not necessary to add a keyword, 'suspend', since most real content generators will not be in application code, they will be in low-level I/O based operations. Since most programmers need not be exposed to the SuspendIteration() mechanism, a keyword is not needed.

Application Iterators

The previous example is rather contrived, a more 'real-world' example would be a web page generator which yields HTML content, and pulls from a database. Note that this is an example of neither the 'producer' nor the 'consumer', but rather of a filter.

def ListAlbums(cursor):
    cursor.execute("SELECT title, artist FROM album")
    yield '<html><body><table><tr><td>Title</td><td>Artist</td></tr>'
    for (title, artist) in cursor:
        yield '<tr><td>%s</td><td>%s</td></tr>' % (title, artist)
    yield '</table></body></html>'

The problem, of course, is that the database may block for some time before any rows are returned, and that during execution, rows may be returned in blocks of 10 or 100 at a time. Ideally, if the database blocks for the next set of rows, another user connection could be serviced. Note the complete absence of SuspendIterator in the above code. If done correctly, application developers would be able to focus on functionality rather than concurrency issues.

The iterator created by the above generator should do the magic necessary to maintain state, yet pass the exception through to a lower-level async framework. Here is an example of what the corresponding iterator would look like if coded up as a class:

class ListAlbums:

    def __init__(self, cursor):
        self.cursor = cursor

    def __iter__(self):
        self.cursor.execute("SELECT title, artist FROM album")
        self._iter = iter(self._cursor)
        self._next = self.state_head
        return self

    def next(self):
        return self._next()

    def state_head(self):
        self._next = self.state_cursor
        return "<html><body><table><tr><td>\
                Title</td><td>Artist</td></tr>"

    def state_tail(self):
        self._next = self.state_stop
        return "</table></body></html>"

    def state_cursor(self):
        try:
            (title,artist) = self._iter.next()
            return '<tr><td>%s</td><td>%s</td></tr>' % (title, artist)
        except StopIteration:
            self._next = self.state_tail
            return self.next()
        except SuspendIteration:
            # just pass-through
            raise

    def state_stop(self):
        raise StopIteration

Complicating Factors

While the above example is straight-forward, things are a bit more complicated if the intermediate generator 'condenses' values, that is, it pulls in two or more values for each value it produces. For example,

def pair(iterLeft,iterRight):
    rhs = iter(iterRight)
    lhs = iter(iterLeft)
    while True:
       yield (rhs.next(), lhs.next())

In this case, the corresponding iterator behavior has to be a bit more subtle to handle the case of either the right or left iterator raising SuspendIteration. It seems to be a matter of decomposing the generator to recognize intermediate states where a SuspendIterator exception from the producing context could happen.

class pair:

    def __init__(self, iterLeft, iterRight):
        self.iterLeft = iterLeft
        self.iterRight = iterRight

    def __iter__(self):
        self.rhs = iter(iterRight)
        self.lhs = iter(iterLeft)
        self._temp_rhs = None
        self._temp_lhs = None
        self._next = self.state_rhs
        return self

    def next(self):
        return self._next()

    def state_rhs(self):
        self._temp_rhs = self.rhs.next()
        self._next = self.state_lhs
        return self.next()

    def state_lhs(self):
        self._temp_lhs = self.lhs.next()
        self._next = self.state_pair
        return self.next()

    def state_pair(self):
        self._next = self.state_rhs
        return (self._temp_rhs, self._temp_lhs)

This proposal assumes that a corresponding iterator written using this class-based method is possible for existing generators. The challenge seems to be the identification of distinct states within the generator where suspension could occur.

Resource Cleanup

The current generator mechanism has a strange interaction with exceptions where a 'yield' statement is not allowed within a try/finally block. The SuspendIterator exception provides another similar issue. The impacts of this issue are not clear. However it may be that re-writing the generator into a state machine, as the previous section did, could resolve this issue allowing for the situation to be no-worse than, and perhaps even removing the yield/finally situation. More investigation is needed in this area.

API and Limitations

This proposal only covers 'suspending' a chain of iterators, and does not cover (of course) suspending general functions, methods, or "C" extension function. While there could be no direct support for creating generators in "C" code, native "C" iterators which comply with the SuspendIterator semantics are certainly possible.

Low-Level Implementation

The author of the PEP is not yet familiar with the Python execution model to comment in this area.

pep-0335 Overloadable Boolean Operators

PEP:335
Title:Overloadable Boolean Operators
Version:$Revision$
Last-Modified:$Date$
Author:Gregory Ewing <greg.ewing at canterbury.ac.nz>
Status:Rejected
Type:Standards Track
Content-Type:text/x-rst
Created:29-Aug-2004
Python-Version:3.3
Post-History:05-Sep-2004, 30-Sep-2011, 25-Oct-2011

Abstract

This PEP proposes an extension to permit objects to define their own meanings for the boolean operators 'and', 'or' and 'not', and suggests an efficient strategy for implementation. A prototype of this implementation is available for download.

Background

Python does not currently provide any '__xxx__' special methods corresponding to the 'and', 'or' and 'not' boolean operators. In the case of 'and' and 'or', the most likely reason is that these operators have short-circuiting semantics, i.e. the second operand is not evaluated if the result can be determined from the first operand. The usual technique of providing special methods for these operators therefore would not work.

There is no such difficulty in the case of 'not', however, and it would be straightforward to provide a special method for this operator. The rest of this proposal will therefore concentrate mainly on providing a way to overload 'and' and 'or'.

Motivation

There are many applications in which it is natural to provide custom meanings for Python operators, and in some of these, having boolean operators excluded from those able to be customised can be inconvenient. Examples include:

  1. NumPy, in which almost all the operators are defined on arrays so as to perform the appropriate operation between corresponding elements, and return an array of the results. For consistency, one would expect a boolean operation between two arrays to return an array of booleans, but this is not currently possible.

    There is a precedent for an extension of this kind: comparison operators were originally restricted to returning boolean results, and rich comparisons were added so that comparisons of NumPy arrays could return arrays of booleans.

  2. A symbolic algebra system, in which a Python expression is evaluated in an environment which results in it constructing a tree of objects corresponding to the structure of the expression.

  3. A relational database interface, in which a Python expression is used to construct an SQL query.

A workaround often suggested is to use the bitwise operators '&', '|' and '~' in place of 'and', 'or' and 'not', but this has some drawbacks:

  • The precedence of these is different in relation to the other operators, and they may already be in use for other purposes (as in example 1).
  • It is aesthetically displeasing to force users to use something other than the most obvious syntax for what they are trying to express. This would be particularly acute in the case of example 3, considering that boolean operations are a staple of SQL queries.
  • Bitwise operators do not provide a solution to the problem of chained comparisons such as 'a < b < c' which involve an implicit 'and' operation. Such expressions currently cannot be used at all on data types such as NumPy arrays where the result of a comparison cannot be treated as having normal boolean semantics; they must be expanded into something like (a < b) & (b < c), losing a considerable amount of clarity.

Rationale

The requirements for a successful solution to the problem of allowing boolean operators to be customised are:

  1. In the default case (where there is no customisation), the existing short-circuiting semantics must be preserved.
  2. There must not be any appreciable loss of speed in the default case.
  3. Ideally, the customisation mechanism should allow the object to provide either short-circuiting or non-short-circuiting semantics, at its discretion.

One obvious strategy, that has been previously suggested, is to pass into the special method the first argument and a function for evaluating the second argument. This would satisfy requirements 1 and 3, but not requirement 2, since it would incur the overhead of constructing a function object and possibly a Python function call on every boolean operation. Therefore, it will not be considered further here.

The following section proposes a strategy that addresses all three requirements. A prototype implementation [1] of this strategy is available for download.

Specification

Special Methods

At the Python level, objects may define the following special methods.

Unary Binary, phase 1 Binary, phase 2
  • __not__(self)
  • __and1__(self)
  • __or1__(self)
  • __and2__(self, other)
  • __or2__(self, other)
  • __rand2__(self, other)
  • __ror2__(self, other)

The __not__ method, if defined, implements the 'not' operator. If it is not defined, or it returns NotImplemented, existing semantics are used.

To permit short-circuiting, processing of the 'and' and 'or' operators is split into two phases. Phase 1 occurs after evaluation of the first operand but before the second. If the first operand defines the relevant phase 1 method, it is called with the first operand as argument. If that method can determine the result without needing the second operand, it returns the result, and further processing is skipped.

If the phase 1 method determines that the second operand is needed, it returns the special value NeedOtherOperand. This triggers the evaluation of the second operand, and the calling of a relevant phase 2 method. During phase 2, the __and2__/__rand2__ and __or2__/__ror2__ method pairs work as for other binary operators.

Processing falls back to existing semantics if at any stage a relevant special method is not found or returns NotImplemented.

As a special case, if the first operand defines a phase 2 method but no corresponding phase 1 method, the second operand is always evaluated and the phase 2 method called. This allows an object which does not want short-circuiting semantics to simply implement the phase 2 methods and ignore phase 1.

Bytecodes

The patch adds four new bytecodes, LOGICAL_AND_1, LOGICAL_AND_2, LOGICAL_OR_1 and LOGICAL_OR_2. As an example of their use, the bytecode generated for an 'and' expression looks like this:

     .
     .
     .
     evaluate first operand
     LOGICAL_AND_1  L
     evaluate second operand
     LOGICAL_AND_2
L:   .
     .
     .

The LOGICAL_AND_1 bytecode performs phase 1 processing. If it determines that the second operand is needed, it leaves the first operand on the stack and continues with the following code. Otherwise it pops the first operand, pushes the result and branches to L.

The LOGICAL_AND_2 bytecode performs phase 2 processing, popping both operands and pushing the result.

Type Slots

At the C level, the new special methods are manifested as five new slots in the type object. In the patch, they are added to the tp_as_number substructure, since this allows making use of some existing code for dealing with unary and binary operators. Their existence is signalled by a new type flag, Py_TPFLAGS_HAVE_BOOLEAN_OVERLOAD.

The new type slots are:

unaryfunc nb_logical_not;
unaryfunc nb_logical_and_1;
unaryfunc nb_logical_or_1;
binaryfunc nb_logical_and_2;
binaryfunc nb_logical_or_2;

Python/C API Functions

There are also five new Python/C API functions corresponding to the new operations:

PyObject *PyObject_LogicalNot(PyObject *);
PyObject *PyObject_LogicalAnd1(PyObject *);
PyObject *PyObject_LogicalOr1(PyObject *);
PyObject *PyObject_LogicalAnd2(PyObject *, PyObject *);
PyObject *PyObject_LogicalOr2(PyObject *, PyObject *);

Alternatives and Optimisations

This section discusses some possible variations on the proposal, and ways in which the bytecode sequences generated for boolean expressions could be optimised.

Reduced special method set

For completeness, the full version of this proposal includes a mechanism for types to define their own customised short-circuiting behaviour. However, the full mechanism is not needed to address the main use cases put forward here, and it would be possible to define a simplified version that only includes the phase 2 methods. There would then only be 5 new special methods (__and2__, __rand2__, __or2__, __ror2__, __not__) with 3 associated type slots and 3 API functions.

This simplified version could be expanded to the full version later if desired.

Additional bytecodes

As defined here, the bytecode sequence for code that branches on the result of a boolean expression would be slightly longer than it currently is. For example, in Python 2.7,

if a and b:
    statement1
else:
    statement2

generates

    LOAD_GLOBAL         a
    POP_JUMP_IF_FALSE   false_branch
    LOAD_GLOBAL         b
    POP_JUMP_IF_FALSE   false_branch
    <code for statement1>
    JUMP_FORWARD        end_branch
false_branch:
    <code for statement2>
end_branch:

Under this proposal as described so far, it would become something like

    LOAD_GLOBAL         a
    LOGICAL_AND_1       test
    LOAD_GLOBAL         b
    LOGICAL_AND_2
test:
    POP_JUMP_IF_FALSE   false_branch
    <code for statement1>
    JUMP_FORWARD        end_branch
false_branch:
    <code for statement2>
end_branch:

This involves executing one extra bytecode in the short-circuiting case and two extra bytecodes in the non-short-circuiting case.

However, by introducing extra bytecodes that combine the logical operations with testing and branching on the result, it can be reduced to the same number of bytecodes as the original:

    LOAD_GLOBAL         a
    AND1_JUMP           true_branch, false_branch
    LOAD_GLOBAL         b
    AND2_JUMP_IF_FALSE  false_branch
true_branch:
    <code for statement1>
    JUMP_FORWARD        end_branch
false_branch:
    <code for statement2>
end_branch:

Here, AND1_JUMP performs phase 1 processing as above, and then examines the result. If there is a result, it is popped from the stack, its truth value is tested and a branch taken to one of two locations.

Otherwise, the first operand is left on the stack and execution continues to the next bytecode. The AND2_JUMP_IF_FALSE bytecode performs phase 2 processing, pops the result and branches if it tests false

For the 'or' operator, there would be corresponding OR1_JUMP and OR2_JUMP_IF_TRUE bytecodes.

If the simplified version without phase 1 methods is used, then early exiting can only occur if the first operand is false for 'and' and true for 'or'. Consequently, the two-target AND1_JUMP and OR1_JUMP bytecodes can be replaced with AND1_JUMP_IF_FALSE and OR1_JUMP_IF_TRUE, these being ordinary branch instructions with only one target.

Optimisation of 'not'

Recent versions of Python implement a simple optimisation in which branching on a negated boolean expression is implemented by reversing the sense of the branch, saving a UNARY_NOT opcode.

Taking a strict view, this optimisation should no longer be performed, because the 'not' operator may be overridden to produce quite different results from usual. However, in typical use cases, it is not envisaged that expressions involving customised boolean operations will be used for branching -- it is much more likely that the result will be used in some other way.

Therefore, it would probably do little harm to specify that the compiler is allowed to use the laws of boolean algebra to simplify any expression that appears directly in a boolean context. If this is inconvenient, the result can always be assigned to a temporary name first.

This would allow the existing 'not' optimisation to remain, and would permit future extensions of it such as using De Morgan's laws to extend it deeper into the expression.

Usage Examples

Example 1: NumPy Arrays

#-----------------------------------------------------------------
#
#   This example creates a subclass of numpy array to which
#   'and', 'or' and 'not' can be applied, producing an array
#   of booleans.
#
#-----------------------------------------------------------------

from numpy import array, ndarray

class BArray(ndarray):

    def __str__(self):
        return "barray(%s)" % ndarray.__str__(self)

    def __and2__(self, other):
        return (self & other)

    def __or2__(self, other):
        return (self & other)

    def __not__(self):
        return (self == 0)

def barray(*args, **kwds):
    return array(*args, **kwds).view(type = BArray)

a0 = barray([0, 1, 2, 4])
a1 = barray([1, 2, 3, 4])
a2 = barray([5, 6, 3, 4])
a3 = barray([5, 1, 2, 4])

print "a0:", a0
print "a1:", a1
print "a2:", a2
print "a3:", a3
print "not a0:", not a0
print "a0 == a1 and a2 == a3:", a0 == a1 and a2 == a3
print "a0 == a1 or a2 == a3:", a0 == a1 or a2 == a3

Example 1 Output

a0: barray([0 1 2 4])
a1: barray([1 2 3 4])
a2: barray([5 6 3 4])
a3: barray([5 1 2 4])
not a0: barray([ True False False False])
a0 == a1 and a2 == a3: barray([False False False  True])
a0 == a1 or a2 == a3: barray([False False False  True])

Example 2: Database Queries

#-----------------------------------------------------------------
#
#   This example demonstrates the creation of a DSL for database
#   queries allowing 'and' and 'or' operators to be used to
#   formulate the query.
#
#-----------------------------------------------------------------

class SQLNode(object):

    def __and2__(self, other):
        return SQLBinop("and", self, other)

    def __rand2__(self, other):
        return SQLBinop("and", other, self)

    def __eq__(self, other):
        return SQLBinop("=", self, other)


class Table(SQLNode):

    def __init__(self, name):
        self.__tablename__ = name

    def __getattr__(self, name):
        return SQLAttr(self, name)

    def __sql__(self):
        return self.__tablename__


class SQLBinop(SQLNode):

    def __init__(self, op, opnd1, opnd2):
        self.op = op.upper()
        self.opnd1 = opnd1
        self.opnd2 = opnd2

    def __sql__(self):
        return "(%s %s %s)" % (sql(self.opnd1), self.op, sql(self.opnd2))


class SQLAttr(SQLNode):

    def __init__(self, table, name):
        self.table = table
        self.name = name

    def __sql__(self):
        return "%s.%s" % (sql(self.table), self.name)


class SQLSelect(SQLNode):

    def __init__(self, targets):
        self.targets = targets
        self.where_clause = None

    def where(self, expr):
        self.where_clause = expr
        return self

    def __sql__(self):
        result = "SELECT %s" % ", ".join([sql(target) for target in self.targets])
        if self.where_clause:
            result = "%s WHERE %s" % (result, sql(self.where_clause))
        return result


def sql(expr):
    if isinstance(expr, SQLNode):
        return expr.__sql__()
    elif isinstance(expr, str):
        return "'%s'" % expr.replace("'", "''")
    else:
        return str(expr)


def select(*targets):
    return SQLSelect(targets)

#-----------------------------------------------------------------

dishes = Table("dishes")
customers = Table("customers")
orders = Table("orders")

query = select(customers.name, dishes.price, orders.amount).where(
    customers.cust_id == orders.cust_id and orders.dish_id == dishes.dish_id
    and dishes.name == "Spam, Eggs, Sausages and Spam")

print repr(query)
print sql(query)

Example 2 Output

<__main__.SQLSelect object at 0x1cc830>
SELECT customers.name, dishes.price, orders.amount WHERE
(((customers.cust_id = orders.cust_id) AND (orders.dish_id =
dishes.dish_id)) AND (dishes.name = 'Spam, Eggs, Sausages and Spam'))

pep-0336 Make None Callable

PEP: 336
Title: Make None Callable
Version: $Revision$
Last-Modified: $Date$
Author: Andrew McClelland <eternalsquire at comcast.net>
Status: Rejected
Type: Standards Track
Content-Type: text/plain
Created: 28-Oct-2004
Post-History: 

Abstract

    None should be a callable object that when called with any
    arguments has no side effect and returns None.

BDFL Pronouncement

    This PEP is rejected.  It is considered a feature that None raises
    an error when called.  The proposal falls short in tests for
    obviousness, clarity, explictness, and necessity.  The provided Switch
    example is nice but easily handled by a simple lambda definition.
    See python-dev discussion on 17 June 2005.


Motivation

    To allow a programming style for selectable actions that is more
    in accordance with the minimalistic functional programming goals
    of the Python language.


Rationale

    Allow the use of None in method tables as a universal no effect
    rather than either (1) checking a method table entry against None
    before calling, or (2) writing a local no effect method with
    arguments similar to other functions in the table.

    The semantics would be effectively,

        class None:

            def __call__(self, *args):
                pass


How To Use

    Before, checking function table entry against None:

        class Select:

            def a(self, input):
                print 'a'

            def b(self, input):
                print 'b'

            def c(self, input);
                print 'c'

            def __call__(self, input):
                function = { 1 : self.a,
                         2 : self.b,
                         3 : self.c
                       }.get(input, None)
                if function:  return function(input)

    Before, using a local no effect method:

        class Select:

            def a(self, input):
                print 'a'

            def b(self, input):
                print 'b'

            def c(self, input);
                print 'c'

            def nop(self, input):
                pass

            def __call__(self, input):
                return { 1 : self.a,
                     2 : self.b,
                     3 : self.c
                       }.get(input, self.nop)(input)

    After:

        class Select:

            def a(self, input):
                print 'a'

            def b(self, input):
                print 'b'

            def c(self, input);
                print 'c'

            def __call__(self, input):
                return { 1 : self.a,
                     2 : self.b,
                     3 : self.c
                       }.get(input, None)(input)


References

    [1] Python Reference Manual, Section 3.2,
        http://docs.python.org/reference/


Copyright

    This document has been placed in the public domain.



pep-0337 Logging Usage in the Standard Library

PEP: 337
Title: Logging Usage in the Standard Library
Version: $Revision$
Last-Modified: $Date$
Author: Michael P. Dubner <dubnerm at mindless.com>
Status: Deferred
Type: Standards Track
Content-Type: text/plain
Created: 02-Oct-2004
Python-Version: 2.5
Post-History: 10-Nov-2004

Abstract

    This PEP defines a standard for using the logging system (PEP 282
    [1]) in the standard library.

    Implementing this PEP will simplify development of daemon
    applications.  As a downside this PEP requires slight
    modifications (however in a back-portable way) to a large number
    of standard modules.

    After implementing this PEP one can use following filtering
    scheme:

        logging.getLogger('py.BaseHTTPServer').setLevel(logging.FATAL)

PEP Deferral

    Further exploration of the concepts covered in this PEP has been deferred
    for lack of a current champion interested in promoting the goals of the
    PEP and collecting and incorporating feedback, and with sufficient
    available time to do so effectively.


Rationale

    There are a couple of situations when output to stdout or stderr
    is impractical:

    - Daemon applications where the framework doesn't allow the
      redirection of standard output to some file, but assumes use of
      some other form of logging.  Examples are syslog under *nix'es
      and EventLog under WinNT+.

    - GUI applications which want to output every new log entry in
      separate pop-up window (i.e. fading OSD).

    Also sometimes applications want to filter output entries based on
    their source or severity.  This requirement can't be implemented
    using simple redirection.

    Finally sometimes output needs to be marked with event timestamps,
    which can be accomplished with ease using the logging system.


Proposal

    Every module usable for daemon and GUI applications should be
    rewritten to use the logging system instead of 'print' or
    'sys.stdout.write'.
    
    There should be code like this included in the beginning of every
    modified module:

        import logging

        _log = logging.getLogger('py.<module-name>')

    A prefix of 'py.' [2] must be used by all modules included in the
    standard library distributed along with Python, and only by such
    modules (unverifiable).  The use of "_log" is intentional as we
    don't want to auto-export it.  For modules that use log only in
    one class a logger can be created inside the class definition as
    follows:

        class XXX:

            __log = logging.getLogger('py.<module-name>')

    Then this class can create access methods to log to this private
    logger.

    So "print" and "sys.std{out|err}.write" statements should be
    replaced with "_log.{debug|info}", and "traceback.print_exception"
    with "_log.exception" or sometimes "_log.debug('...', exc_info=1)".


Module List

    Here is a (possibly incomplete) list of modules to be reworked:

    - asyncore (dispatcher.log, dispatcher.log_info)

    - BaseHTTPServer (BaseHTTPRequestHandler.log_request,
      BaseHTTPRequestHandler.log_error,
      BaseHTTPRequestHandler.log_message)

    - cgi (possibly - is cgi.log used by somebody?)

    - ftplib (if FTP.debugging)

    - gopherlib (get_directory)

    - httplib (HTTPResponse, HTTPConnection)

    - ihooks (_Verbose)

    - imaplib (IMAP4._mesg)

    - mhlib (MH.error)

    - nntplib (NNTP)

    - pipes (Template.makepipeline)

    - pkgutil (extend_path)

    - platform (_syscmd_ver)

    - poplib (if POP3._debugging)

    - profile (if Profile.verbose)

    - robotparser (_debug)

    - smtplib (if SGMLParser.verbose)

    - shlex (if shlex.debug)

    - smtpd (SMTPChannel/PureProxy where print >> DEBUGSTREAM)

    - smtplib (if SMTP.debuglevel)

    - SocketServer (BaseServer.handle_error)

    - telnetlib (if Telnet.debuglevel)

    - threading? (_Verbose._note, Thread.__bootstrap)

    - timeit (Timer.print_exc)

    - trace

    - uu (decode)

    Additionally there are a couple of modules with commented debug
    output or modules where debug output should be added.  For
    example:

    - urllib

    Finally possibly some modules should be extended to provide more
    debug information.


Doubtful Modules

    Listed here are modules that the community will propose for
    addition to the module list and modules that the community say
    should be removed from the module list.

    - tabnanny (check)


Guidelines for Logging Usage

    Also we can provide some recommendation to authors of library
    modules so they all follow the same format of naming loggers.  I
    propose that non-standard library modules should use loggers named
    after their full names, so a module "spam" in sub-package "junk"
    of package "dummy" will be named "dummy.junk.spam" and, of course,
    the "__init__" module of the same sub-package will have the logger
    name "dummy.junk".


References

    [1] PEP 282, A Logging System, Vinay Sajip, Trent Mick
        http://www.python.org/dev/peps/pep-0282/

    [2] http://mail.python.org/pipermail/python-dev/2004-October/049282.html


Copyright

    This document has been placed in the public domain.



pep-0338 Executing modules as scripts

PEP:338
Title:Executing modules as scripts
Version:$Revision$
Last-Modified:$Date$
Author:Nick Coghlan <ncoghlan at gmail.com>
Status:Final
Type:Standards Track
Content-Type:text/x-rst
Created:16-Oct-2004
Python-Version:2.5
Post-History:8-Nov-2004, 11-Feb-2006, 12-Feb-2006, 18-Feb-2006

Abstract

This PEP defines semantics for executing any Python module as a script, either with the -m command line switch, or by invoking it via runpy.run_module(modulename).

The -m switch implemented in Python 2.4 is quite limited. This PEP proposes making use of the PEP 302 [4] import hooks to allow any module which provides access to its code object to be executed.

Rationale

Python 2.4 adds the command line switch -m to allow modules to be located using the Python module namespace for execution as scripts. The motivating examples were standard library modules such as pdb and profile, and the Python 2.4 implementation is fine for this limited purpose.

A number of users and developers have requested extension of the feature to also support running modules located inside packages. One example provided is pychecker's pychecker.checker module. This capability was left out of the Python 2.4 implementation because the implementation of this was significantly more complicated, and the most appropriate strategy was not at all clear.

The opinion on python-dev was that it was better to postpone the extension to Python 2.5, and go through the PEP process to help make sure we got it right.

Since that time, it has also been pointed out that the current version of -m does not support zipimport or any other kind of alternative import behaviour (such as frozen modules).

Providing this functionality as a Python module is significantly easier than writing it in C, and makes the functionality readily available to all Python programs, rather than being specific to the CPython interpreter. CPython's command line switch can then be rewritten to make use of the new module.

Scripts which execute other scripts (e.g. profile, pdb) also have the option to use the new module to provide -m style support for identifying the script to be executed.

Scope of this proposal

In Python 2.4, a module located using -m is executed just as if its filename had been provided on the command line. The goal of this PEP is to get as close as possible to making that statement also hold true for modules inside packages, or accessed via alternative import mechanisms (such as zipimport).

Prior discussions suggest it should be noted that this PEP is not about changing the idiom for making Python modules also useful as scripts (see PEP 299 [1]). That issue is considered orthogonal to the specific feature addressed by this PEP.

Current Behaviour

Before describing the new semantics, it's worth covering the existing semantics for Python 2.4 (as they are currently defined only by the source code and the command line help).

When -m is used on the command line, it immediately terminates the option list (like -c). The argument is interpreted as the name of a top-level Python module (i.e. one which can be found on sys.path).

If the module is found, and is of type PY_SOURCE or PY_COMPILED, then the command line is effectively reinterpreted from python <options> -m <module> <args> to python <options> <filename> <args>. This includes setting sys.argv[0] correctly (some scripts rely on this - Python's own regrtest.py is one example).

If the module is not found, or is not of the correct type, an error is printed.

Proposed Semantics

The semantics proposed are fairly simple: if -m is used to execute a module the PEP 302 import mechanisms are used to locate the module and retrieve its compiled code, before executing the module in accordance with the semantics for a top-level module. The interpreter does this by invoking a new standard library function runpy.run_module.

This is necessary due to the way Python's import machinery locates modules inside packages. A package may modify its own __path__ variable during initialisation. In addition, paths may be affected by *.pth files, and some packages will install custom loaders on sys.metapath. Accordingly, the only way for Python to reliably locate the module is by importing the containing package and using the PEP 302 import hooks to gain access to the Python code.

Note that the process of locating the module to be executed may require importing the containing package. The effects of such a package import that will be visible to the executed module are:

  • the containing package will be in sys.modules
  • any external effects of the package initialisation (e.g. installed import hooks, loggers, atexit handlers, etc.)

Reference Implementation

A reference implementation is available on SourceForge ([2]), along with documentation for the library reference ([5]). There are two parts to this implementation. The first is a proposed standard library module runpy. The second is a modification to the code implementing the -m switch to always delegate to runpy.run_module instead of trying to run the module directly. The delegation has the form:

runpy.run_module(sys.argv[0], run_name="__main__", alter_sys=True)

run_module is the only function runpy exposes in its public API.

run_module(mod_name[, init_globals][, run_name][, alter_sys])

Execute the code of the specified module and return the resulting module globals dictionary. The module's code is first located using the standard import mechanism (refer to PEP 302 for details) and then executed in a fresh module namespace.

The optional dictionary argument init_globals may be used to pre-populate the globals dictionary before the code is executed. The supplied dictionary will not be modified. If any of the special global variables below are defined in the supplied dictionary, those definitions are overridden by the run_module function.

The special global variables __name__, __file__, __loader__ and __builtins__ are set in the globals dictionary before the module code is executed.

__name__ is set to run_name if this optional argument is supplied, and the original mod_name argument otherwise.

__loader__ is set to the PEP 302 module loader used to retrieve the code for the module (This loader may be a wrapper around the standard import mechanism).

__file__ is set to the name provided by the module loader. If the loader does not make filename information available, this argument is set to None.

__builtins__ is automatically initialised with a reference to the top level namespace of the __builtin__ module.

If the argument alter_sys is supplied and evaluates to True, then sys.argv[0] is updated with the value of __file__ and sys.modules[__name__] is updated with a temporary module object for the module being executed. Both sys.argv[0] and sys.modules[__name__] are restored to their original values before this function returns.

When invoked as a script, the runpy module finds and executes the module supplied as the first argument. It adjusts sys.argv by deleting sys.argv[0] (which refers to the runpy module itself) and then invokes run_module(sys.argv[0], run_name="__main__", alter_sys=True).

Import Statements and the Main Module

The release of 2.5b1 showed a surprising (although obvious in retrospect) interaction between this PEP and PEP 328 - explicit relative imports don't work from a main module. This is due to the fact that relative imports rely on __name__ to determine the current module's position in the package hierarchy. In a main module, the value of __name__ is always '__main__', so explicit relative imports will always fail (as they only work for a module inside a package).

Investigation into why implicit relative imports appear to work when a main module is executed directly but fail when executed using -m showed that such imports are actually always treated as absolute imports. Because of the way direct execution works, the package containing the executed module is added to sys.path, so its sibling modules are actually imported as top level modules. This can easily lead to multiple copies of the sibling modules in the application if implicit relative imports are used in modules that may be directly executed (e.g. test modules or utility scripts).

For the 2.5 release, the recommendation is to always use absolute imports in any module that is intended to be used as a main module. The -m switch provides a benefit here, as it inserts the current directory into sys.path, instead of the directory contain the main module. This means that it is possible to run a module from inside a package using -m so long as the current directory contains the top level directory for the package. Absolute imports will work correctly even if the package isn't installed anywhere else on sys.path. If the module is executed directly and uses absolute imports to retrieve its sibling modules, then the top level package directory needs to be installed somewhere on sys.path (since the current directory won't be added automatically).

Here's an example file layout:

devel/
    pkg/
        __init__.py
        moduleA.py
        moduleB.py
        test/
            __init__.py
            test_A.py
            test_B.py

So long as the current directory is devel, or devel is already on sys.path and the test modules use absolute imports (such as import pkg moduleA to retrieve the module under test, PEP 338 allows the tests to be run as:

python -m pkg.test.test_A
python -m pkg.test.test_B

The question of whether or not relative imports should be supported when a main module is executed with -m is something that will be revisited for Python 2.6. Permitting it would require changes to either Python's import semantics or the semantics used to indicate when a module is the main module, so it is not a decision to be made hastily.

Resolved Issues

There were some key design decisions that influenced the development of the runpy module. These are listed below.

  • The special variables __name__, __file__ and __loader__ are set in a module's global namespace before the module is executed. As run_module alters these values, it does not mutate the supplied dictionary. If it did, then passing globals() to this function could have nasty side effects.
  • Sometimes, the information needed to populate the special variables simply isn't available. Rather than trying to be too clever, these variables are simply set to None when the relevant information cannot be determined.
  • There is no special protection on the alter_sys argument. This may result in sys.argv[0] being set to None if file name information is not available.
  • The import lock is NOT used to avoid potential threading issues that arise when alter_sys is set to True. Instead, it is recommended that threaded code simply avoid using this flag.

Alternatives

The first alternative implementation considered ignored packages' __path__ variables, and looked only in the main package directory. A Python script with this behaviour can be found in the discussion of the execmodule cookbook recipe [3].

The execmodule cookbook recipe itself was the proposed mechanism in an earlier version of this PEP (before the PEP's author read PEP 302).

Both approaches were rejected as they do not meet the main goal of the -m switch -- to allow the full Python namespace to be used to locate modules for execution from the command line.

An earlier version of this PEP included some mistaken assumptions about the way exec handled locals dictionaries and code from function objects. These mistaken assumptions led to some unneeded design complexity which has now been removed - run_code shares all of the quirks of exec.

Earlier versions of the PEP also exposed a broader API that just the single run_module() function needed to implement the updates to the -m switch. In the interests of simplicity, those extra functions have been dropped from the proposed API.

After the original implementation in SVN, it became clear that holding the import lock when executing the initial application script was not correct (e.g. python -m test.regrtest test_threadedimport failed). So the run_module function only holds the import lock during the actual search for the module, and releases it before execution, even if alter_sys is set.

References

[1]Special __main__() function in modules (http://www.python.org/dev/peps/pep-0299/)
[2]PEP 338 implementation (runpy module and -m update) (http://www.python.org/sf/1429601)
[3]execmodule Python Cookbook Recipe (http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/307772)
[4]New import hooks (http://www.python.org/dev/peps/pep-0302/)
[5]PEP 338 documentation (for runpy module) (http://www.python.org/sf/1429605)

pep-0339 Design of the CPython Compiler

PEP:339
Title:Design of the CPython Compiler
Version:$Revision$
Last-Modified:$Date$
Author:Brett Cannon <brett at python.org>
Status:Withdrawn
Type:Informational
Content-Type:text/x-rst
Created:02-Feb-2005
Post-History:

Note

This PEP has been withdrawn and moved to the Python developer's guide.

Abstract

Historically (through 2.4), compilation from source code to bytecode involved two steps:

  1. Parse the source code into a parse tree (Parser/pgen.c)
  2. Emit bytecode based on the parse tree (Python/compile.c)

Historically, this is not how a standard compiler works. The usual steps for compilation are:

  1. Parse source code into a parse tree (Parser/pgen.c)
  2. Transform parse tree into an Abstract Syntax Tree (Python/ast.c)
  3. Transform AST into a Control Flow Graph (Python/compile.c)
  4. Emit bytecode based on the Control Flow Graph (Python/compile.c)

Starting with Python 2.5, the above steps are now used. This change was done to simplify compilation by breaking it into three steps. The purpose of this document is to outline how the latter three steps of the process works.

This document does not touch on how parsing works beyond what is needed to explain what is needed for compilation. It is also not exhaustive in terms of the how the entire system works. You will most likely need to read some source to have an exact understanding of all details.

Parse Trees

Python's parser is an LL(1) parser mostly based off of the implementation laid out in the Dragon Book [Aho86].

The grammar file for Python can be found in Grammar/Grammar with the numeric value of grammar rules are stored in Include/graminit.h. The numeric values for types of tokens (literal tokens, such as :, numbers, etc.) are kept in Include/token.h). The parse tree made up of node * structs (as defined in Include/node.h).

Querying data from the node structs can be done with the following macros (which are all defined in Include/token.h):

  • CHILD(node *, int)

    Returns the nth child of the node using zero-offset indexing

  • RCHILD(node *, int)

    Returns the nth child of the node from the right side; use negative numbers!

  • NCH(node *)

    Number of children the node has

  • STR(node *)

    String representation of the node; e.g., will return : for a COLON token

  • TYPE(node *)

    The type of node as specified in Include/graminit.h

  • REQ(node *, TYPE)

    Assert that the node is the type that is expected

  • LINENO(node *)

    retrieve the line number of the source code that led to the creation of the parse rule; defined in Python/ast.c

To tie all of this example, consider the rule for 'while':

while_stmt: 'while' test ':' suite ['else' ':' suite]

The node representing this will have TYPE(node) == while_stmt and the number of children can be 4 or 7 depending on if there is an 'else' statement. To access what should be the first ':' and require it be an actual ':' token, (REQ(CHILD(node, 2), COLON)`.

Abstract Syntax Trees (AST)

The abstract syntax tree (AST) is a high-level representation of the program structure without the necessity of containing the source code; it can be thought of as an abstract representation of the source code. The specification of the AST nodes is specified using the Zephyr Abstract Syntax Definition Language (ASDL) [Wang97].

The definition of the AST nodes for Python is found in the file Parser/Python.asdl .

Each AST node (representing statements, expressions, and several specialized types, like list comprehensions and exception handlers) is defined by the ASDL. Most definitions in the AST correspond to a particular source construct, such as an 'if' statement or an attribute lookup. The definition is independent of its realization in any particular programming language.

The following fragment of the Python ASDL construct demonstrates the approach and syntax:

module Python
{
      stmt = FunctionDef(identifier name, arguments args, stmt* body,
                          expr* decorators)
            | Return(expr? value) | Yield(expr value)
            attributes (int lineno)
}

The preceding example describes three different kinds of statements; function definitions, return statements, and yield statements. All three kinds are considered of type stmt as shown by '|' separating the various kinds. They all take arguments of various kinds and amounts.

Modifiers on the argument type specify the number of values needed; '?' means it is optional, '*' means 0 or more, no modifier means only one value for the argument and it is required. FunctionDef, for instance, takes an identifier for the name, 'arguments' for args, zero or more stmt arguments for 'body', and zero or more expr arguments for 'decorators'.

Do notice that something like 'arguments', which is a node type, is represented as a single AST node and not as a sequence of nodes as with stmt as one might expect.

All three kinds also have an 'attributes' argument; this is shown by the fact that 'attributes' lacks a '|' before it.

The statement definitions above generate the following C structure type:

typedef struct _stmt *stmt_ty;

struct _stmt {
      enum { FunctionDef_kind=1, Return_kind=2, Yield_kind=3 } kind;
      union {
              struct {
                      identifier name;
                      arguments_ty args;
                      asdl_seq *body;
              } FunctionDef;

              struct {
                      expr_ty value;
              } Return;

              struct {
                      expr_ty value;
              } Yield;
      } v;
      int lineno;
 }

Also generated are a series of constructor functions that allocate (in this case) a stmt_ty struct with the appropriate initialization. The 'kind' field specifies which component of the union is initialized. The FunctionDef() constructor function sets 'kind' to FunctionDef_kind and initializes the 'name', 'args', 'body', and 'attributes' fields.

Memory Management

Before discussing the actual implementation of the compiler, a discussion of how memory is handled is in order. To make memory management simple, an arena is used. This means that a memory is pooled in a single location for easy allocation and removal. What this gives us is the removal of explicit memory deallocation. Because memory allocation for all needed memory in the compiler registers that memory with the arena, a single call to free the arena is all that is needed to completely free all memory used by the compiler.

In general, unless you are working on the critical core of the compiler, memory management can be completely ignored. But if you are working at either the very beginning of the compiler or the end, you need to care about how the arena works. All code relating to the arena is in either Include/pyarena.h or Python/pyarena.c .

PyArena_New() will create a new arena. The returned PyArena structure will store pointers to all memory given to it. This does the bookkeeping of what memory needs to be freed when the compiler is finished with the memory it used. That freeing is done with PyArena_Free(). This needs to only be called in strategic areas where the compiler exits.

As stated above, in general you should not have to worry about memory management when working on the compiler. The technical details have been designed to be hidden from you for most cases.

The only exception comes about when managing a PyObject. Since the rest of Python uses reference counting, there is extra support added to the arena to cleanup each PyObject that was allocated. These cases are very rare. However, if you've allocated a PyObject, you must tell the arena about it by calling PyArena_AddPyObject().

Parse Tree to AST

The AST is generated from the parse tree (see Python/ast.c) using the function PyAST_FromNode().

The function begins a tree walk of the parse tree, creating various AST nodes as it goes along. It does this by allocating all new nodes it needs, calling the proper AST node creation functions for any required supporting functions, and connecting them as needed.

Do realize that there is no automated nor symbolic connection between the grammar specification and the nodes in the parse tree. No help is directly provided by the parse tree as in yacc.

For instance, one must keep track of which node in the parse tree one is working with (e.g., if you are working with an 'if' statement you need to watch out for the ':' token to find the end of the conditional).

The functions called to generate AST nodes from the parse tree all have the name ast_for_xx where xx is what the grammar rule that the function handles (alias_for_import_name is the exception to this). These in turn call the constructor functions as defined by the ASDL grammar and contained in Python/Python-ast.c (which was generated by Parser/asdl_c.py) to create the nodes of the AST. This all leads to a sequence of AST nodes stored in asdl_seq structs.

Function and macros for creating and using asdl_seq * types as found in Python/asdl.c and Include/asdl.h:

  • asdl_seq_new()

    Allocate memory for an asdl_seq for the specified length

  • asdl_seq_GET()

    Get item held at a specific position in an asdl_seq

  • asdl_seq_SET()

    Set a specific index in an asdl_seq to the specified value

  • asdl_seq_LEN(asdl_seq *)

    Return the length of an asdl_seq

If you are working with statements, you must also worry about keeping track of what line number generated the statement. Currently the line number is passed as the last parameter to each stmt_ty function.

Control Flow Graphs

A control flow graph (often referenced by its acronym, CFG) is a directed graph that models the flow of a program using basic blocks that contain the intermediate representation (abbreviated "IR", and in this case is Python bytecode) within the blocks. Basic blocks themselves are a block of IR that has a single entry point but possibly multiple exit points. The single entry point is the key to basic blocks; it all has to do with jumps. An entry point is the target of something that changes control flow (such as a function call or a jump) while exit points are instructions that would change the flow of the program (such as jumps and 'return' statements). What this means is that a basic block is a chunk of code that starts at the entry point and runs to an exit point or the end of the block.

As an example, consider an 'if' statement with an 'else' block. The guard on the 'if' is a basic block which is pointed to by the basic block containing the code leading to the 'if' statement. The 'if' statement block contains jumps (which are exit points) to the true body of the 'if' and the 'else' body (which may be NULL), each of which are their own basic blocks. Both of those blocks in turn point to the basic block representing the code following the entire 'if' statement.

CFGs are usually one step away from final code output. Code is directly generated from the basic blocks (with jump targets adjusted based on the output order) by doing a post-order depth-first search on the CFG following the edges.

AST to CFG to Bytecode

With the AST created, the next step is to create the CFG. The first step is to convert the AST to Python bytecode without having jump targets resolved to specific offsets (this is calculated when the CFG goes to final bytecode). Essentially, this transforms the AST into Python bytecode with control flow represented by the edges of the CFG.

Conversion is done in two passes. The first creates the namespace (variables can be classified as local, free/cell for closures, or global). With that done, the second pass essentially flattens the CFG into a list and calculates jump offsets for final output of bytecode.

The conversion process is initiated by a call to the function PyAST_Compile() in Python/compile.c . This function does both the conversion of the AST to a CFG and outputting final bytecode from the CFG. The AST to CFG step is handled mostly by two functions called by PyAST_Compile(); PySymtable_Build() and compiler_mod() . The former is in Python/symtable.c while the latter is in Python/compile.c .

PySymtable_Build() begins by entering the starting code block for the AST (passed-in) and then calling the proper symtable_visit_xx function (with xx being the AST node type). Next, the AST tree is walked with the various code blocks that delineate the reach of a local variable as blocks are entered and exited using symtable_enter_block() and symtable_exit_block(), respectively.

Once the symbol table is created, it is time for CFG creation, whose code is in Python/compile.c . This is handled by several functions that break the task down by various AST node types. The functions are all named compiler_visit_xx where xx is the name of the node type (such as stmt, expr, etc.). Each function receives a struct compiler * and xx_ty where xx is the AST node type. Typically these functions consist of a large 'switch' statement, branching based on the kind of node type passed to it. Simple things are handled inline in the 'switch' statement with more complex transformations farmed out to other functions named compiler_xx with xx being a descriptive name of what is being handled.

When transforming an arbitrary AST node, use the VISIT() macro. The appropriate compiler_visit_xx function is called, based on the value passed in for <node type> (so VISIT(c, expr, node) calls compiler_visit_expr(c, node)). The VISIT_SEQ macro is very similar, but is called on AST node sequences (those values that were created as arguments to a node that used the '*' modifier). There is also VISIT_SLICE() just for handling slices.

Emission of bytecode is handled by the following macros:

  • ADDOP()

    add a specified opcode

  • ADDOP_I()

    add an opcode that takes an argument

  • ADDOP_O(struct compiler *c, int op, PyObject *type, PyObject *obj)

    add an opcode with the proper argument based on the position of the specified PyObject in PyObject sequence object, but with no handling of mangled names; used for when you need to do named lookups of objects such as globals, consts, or parameters where name mangling is not possible and the scope of the name is known

  • ADDOP_NAME()

    just like ADDOP_O, but name mangling is also handled; used for attribute loading or importing based on name

  • ADDOP_JABS()

    create an absolute jump to a basic block

  • ADDOP_JREL()

    create a relative jump to a basic block

Several helper functions that will emit bytecode and are named compiler_xx() where xx is what the function helps with (list, boolop, etc.). A rather useful one is compiler_nameop(). This function looks up the scope of a variable and, based on the expression context, emits the proper opcode to load, store, or delete the variable.

As for handling the line number on which a statement is defined, is handled by compiler_visit_stmt() and thus is not a worry.

In addition to emitting bytecode based on the AST node, handling the creation of basic blocks must be done. Below are the macros and functions used for managing basic blocks:

  • NEW_BLOCK()

    create block and set it as current

  • NEXT_BLOCK()

    basically NEW_BLOCK() plus jump from current block

  • compiler_new_block()

    create a block but don't use it (used for generating jumps)

Once the CFG is created, it must be flattened and then final emission of bytecode occurs. Flattening is handled using a post-order depth-first search. Once flattened, jump offsets are backpatched based on the flattening and then a PyCodeObject file is created. All of this is handled by calling assemble() .

Introducing New Bytecode

Sometimes a new feature requires a new opcode. But adding new bytecode is not as simple as just suddenly introducing new bytecode in the AST -> bytecode step of the compiler. Several pieces of code throughout Python depend on having correct information about what bytecode exists.

First, you must choose a name and a unique identifier number. The official list of bytecode can be found in Include/opcode.h . If the opcode is to take an argument, it must be given a unique number greater than that assigned to HAVE_ARGUMENT (as found in Include/opcode.h).

Once the name/number pair has been chosen and entered in Include/opcode.h, you must also enter it into Lib/opcode.py and Doc/library/dis.rst .

With a new bytecode you must also change what is called the magic number for .pyc files. The variable MAGIC in Python/import.c contains the number. Changing this number will lead to all .pyc files with the old MAGIC to be recompiled by the interpreter on import.

Finally, you need to introduce the use of the new bytecode. Altering Python/compile.c and Python/ceval.c will be the primary places to change. But you will also need to change the 'compiler' package. The key files to do that are Lib/compiler/pyassem.py and Lib/compiler/pycodegen.py .

If you make a change here that can affect the output of bytecode that is already in existence and you do not change the magic number constantly, make sure to delete your old .py(c|o) files! Even though you will end up changing the magic number if you change the bytecode, while you are debugging your work you will be changing the bytecode output without constantly bumping up the magic number. This means you end up with stale .pyc files that will not be recreated. Running find . -name '*.py[co]' -exec rm -f {} ';' should delete all .pyc files you have, forcing new ones to be created and thus allow you test out your new bytecode properly.

Code Objects

The result of PyAST_Compile() is a PyCodeObject which is defined in Include/code.h . And with that you now have executable Python bytecode!

The code objects (byte code) is executed in Python/ceval.c . This file will also need a new case statement for the new opcode in the big switch statement in PyEval_EvalFrameEx().

Important Files

  • Parser/

    • Python.asdl

      ASDL syntax file

    • asdl.py

      "An implementation of the Zephyr Abstract Syntax Definition Language." Uses SPARK [5] to parse the ASDL files.

    • asdl_c.py

      "Generate C code from an ASDL description." Generates Python/Python-ast.c and Include/Python-ast.h .

    • spark.py

      SPARK [5] parser generator

  • Python/

    • Python-ast.c

      Creates C structs corresponding to the ASDL types. Also contains code for marshaling AST nodes (core ASDL types have marshaling code in asdl.c). "File automatically generated by Parser/asdl_c.py". This file must be committed separately after every grammar change is committed since the __version__ value is set to the latest grammar change revision number.

    • asdl.c

      Contains code to handle the ASDL sequence type. Also has code to handle marshalling the core ASDL types, such as number and identifier. used by Python-ast.c for marshaling AST nodes.

    • ast.c

      Converts Python's parse tree into the abstract syntax tree.

    • ceval.c

      Executes byte code (aka, eval loop).

    • compile.c

      Emits bytecode based on the AST.

    • symtable.c

      Generates a symbol table from AST.

    • pyarena.c

      Implementation of the arena memory manager.

    • import.c

      Home of the magic number (named MAGIC) for bytecode versioning

  • Include/

    • Python-ast.h

      Contains the actual definitions of the C structs as generated by Python/Python-ast.c . "Automatically generated by Parser/asdl_c.py".

    • asdl.h

      Header for the corresponding Python/ast.c .

    • ast.h

      Declares PyAST_FromNode() external (from Python/ast.c).

    • code.h

      Header file for Objects/codeobject.c; contains definition of PyCodeObject.

    • symtable.h

      Header for Python/symtable.c . struct symtable and PySTEntryObject are defined here.

    • pyarena.h

      Header file for the corresponding Python/pyarena.c .

    • opcode.h

      Master list of bytecode; if this file is modified you must modify several other files accordingly (see "Introducing New Bytecode")

  • Objects/

    • codeobject.c

      Contains PyCodeObject-related code (originally in Python/compile.c).

  • Lib/

    • opcode.py

      One of the files that must be modified if Include/opcode.h is.

    • compiler/

      • pyassem.py

        One of the files that must be modified if Include/opcode.h is changed.

      • pycodegen.py

        One of the files that must be modified if Include/opcode.h is changed.

References

[Aho86]Alfred V. Aho, Ravi Sethi, Jeffrey D. Ullman. Compilers: Principles, Techniques, and Tools, http://www.amazon.com/exec/obidos/tg/detail/-/0201100886/104-0162389-6419108
[Wang97]Daniel C. Wang, Andrew W. Appel, Jeff L. Korn, and Chris S. Serra. The Zephyr Abstract Syntax Description Language. [4] In Proceedings of the Conference on Domain-Specific Languages, pp. 213--227, 1997.
[1]Skip Montanaro's Peephole Optimizer Paper (http://www.foretec.com/python/workshops/1998-11/proceedings/papers/montanaro/montanaro.html)
[2]Bytecodehacks Project (http://bytecodehacks.sourceforge.net/bch-docs/bch/index.html)
[3]CALL_ATTR opcode (http://www.python.org/sf/709744)
[4]http://www.cs.princeton.edu/research/techreps/TR-554-97
[5](1, 2) http://pages.cpsc.ucalgary.ca/~aycock/spark/

pep-0340 Anonymous Block Statements

PEP: 340
Title: Anonymous Block Statements
Version: $Revision$
Last-Modified: $Date$
Author: Guido van Rossum
Status: Rejected
Type: Standards Track
Content-Type: text/plain
Created: 27-Apr-2005
Post-History: 

Introduction

    This PEP proposes a new type of compound statement which can be
    used for resource management purposes.  The new statement type
    is provisionally called the block-statement because the keyword
    to be used has not yet been chosen.

    This PEP competes with several other PEPs: PEP 288 (Generators
    Attributes and Exceptions; only the second part), PEP 310
    (Reliable Acquisition/Release Pairs), and PEP 325
    (Resource-Release Support for Generators).

    I should clarify that using a generator to "drive" a block
    statement is really a separable proposal; with just the definition
    of the block statement from the PEP you could implement all the
    examples using a class (similar to example 6, which is easily
    turned into a template).  But the key idea is using a generator to
    drive a block statement; the rest is elaboration, so I'd like to
    keep these two parts together.

    (PEP 342, Enhanced Iterators, was originally a part of this PEP;
    but the two proposals are really independent and with Steven
    Bethard's help I have moved it to a separate PEP.)

Rejection Notice

    I am rejecting this PEP in favor of PEP 343.  See the motivational
    section in that PEP for the reasoning behind this rejection.  GvR.

Motivation and Summary

    (Thanks to Shane Hathaway -- Hi Shane!)

    Good programmers move commonly used code into reusable functions.
    Sometimes, however, patterns arise in the structure of the
    functions rather than the actual sequence of statements.  For
    example, many functions acquire a lock, execute some code specific
    to that function, and unconditionally release the lock.  Repeating
    the locking code in every function that uses it is error prone and
    makes refactoring difficult.

    Block statements provide a mechanism for encapsulating patterns of
    structure.  Code inside the block statement runs under the control
    of an object called a block iterator.  Simple block iterators
    execute code before and after the code inside the block statement.
    Block iterators also have the opportunity to execute the
    controlled code more than once (or not at all), catch exceptions,
    or receive data from the body of the block statement.

    A convenient way to write block iterators is to write a generator
    (PEP 255).  A generator looks a lot like a Python function, but
    instead of returning a value immediately, generators pause their
    execution at "yield" statements.  When a generator is used as a
    block iterator, the yield statement tells the Python interpreter
    to suspend the block iterator, execute the block statement body,
    and resume the block iterator when the body has executed.

    The Python interpreter behaves as follows when it encounters a
    block statement based on a generator.  First, the interpreter
    instantiates the generator and begins executing it.  The generator
    does setup work appropriate to the pattern it encapsulates, such
    as acquiring a lock, opening a file, starting a database
    transaction, or starting a loop.  Then the generator yields
    execution to the body of the block statement using a yield
    statement.  When the block statement body completes, raises an
    uncaught exception, or sends data back to the generator using a
    continue statement, the generator resumes.  At this point, the
    generator can either clean up and stop or yield again, causing the
    block statement body to execute again.  When the generator
    finishes, the interpreter leaves the block statement.

Use Cases

    See the Examples section near the end.

Specification: the __exit__() Method

    An optional new method for iterators is proposed, called
    __exit__().  It takes up to three arguments which correspond to
    the three "arguments" to the raise-statement: type, value, and
    traceback.  If all three arguments are None, sys.exc_info() may be
    consulted to provide suitable default values.

Specification: the Anonymous Block Statement

    A new statement is proposed with the syntax

        block EXPR1 as VAR1:
            BLOCK1

    Here, 'block' and 'as' are new keywords; EXPR1 is an arbitrary
    expression (but not an expression-list) and VAR1 is an arbitrary
    assignment target (which may be a comma-separated list).

    The "as VAR1" part is optional; if omitted, the assignments to
    VAR1 in the translation below are omitted (but the expressions
    assigned are still evaluated!).

    The choice of the 'block' keyword is contentious; many
    alternatives have been proposed, including not to use a keyword at
    all (which I actually like).  PEP 310 uses 'with' for similar
    semantics, but I would like to reserve that for a with-statement
    similar to the one found in Pascal and VB.  (Though I just found
    that the C# designers don't like 'with' [2], and I have to agree
    with their reasoning.)  To sidestep this issue momentarily I'm
    using 'block' until we can agree on the right keyword, if any.

    Note that the 'as' keyword is not contentious (it will finally be
    elevated to proper keyword status).

    Note that it is up to the iterator to decide whether a
    block-statement represents a loop with multiple iterations; in the
    most common use case BLOCK1 is executed exactly once.  To the
    parser, however, it is always a loop; break and continue return
    transfer to the block's iterator (see below for details).

    The translation is subtly different from a for-loop: iter() is
    not called, so EXPR1 should already be an iterator (not just an
    iterable); and the iterator is guaranteed to be notified when
    the block-statement is left, regardless if this is due to a
    break, return or exception:

        itr = EXPR1  # The iterator
        ret = False  # True if a return statement is active
        val = None   # Return value, if ret == True
        exc = None   # sys.exc_info() tuple if an exception is active
        while True:
            try:
                if exc:
                    ext = getattr(itr, "__exit__", None)
                    if ext is not None:
                        VAR1 = ext(*exc)   # May re-raise *exc
                    else:
                        raise exc[0], exc[1], exc[2]
                else:
                    VAR1 = itr.next()  # May raise StopIteration
            except StopIteration:
                if ret:
                    return val
                break
            try:
                ret = False
                val = exc = None
                BLOCK1
            except:
                exc = sys.exc_info()

    (However, the variables 'itr' etc. are not user-visible and the
    built-in names used cannot be overridden by the user.)

    Inside BLOCK1, the following special translations apply:

    - "break" is always legal; it is translated into:

        exc = (StopIteration, None, None)
        continue

    - "return EXPR3" is only legal when the block-statement is
      contained in a function definition; it is translated into:

        exc = (StopIteration, None, None)
        ret = True
        val = EXPR3
        continue

    The net effect is that break and return behave much the same as
    if the block-statement were a for-loop, except that the iterator
    gets a chance at resource cleanup before the block-statement is
    left, through the optional __exit__() method. The iterator also
    gets a chance if the block-statement is left through raising an
    exception.  If the iterator doesn't have an __exit__() method,
    there is no difference with a for-loop (except that a for-loop
    calls iter() on EXPR1).

    Note that a yield-statement in a block-statement is not treated
    differently.  It suspends the function containing the block
    *without* notifying the block's iterator.  The block's iterator is
    entirely unaware of this yield, since the local control flow
    doesn't actually leave the block.  In other words, it is *not*
    like a break or return statement.  When the loop that was resumed
    by the yield calls next(), the block is resumed right after the
    yield.  (See example 7 below.)  The generator finalization
    semantics described below guarantee (within the limitations of all
    finalization semantics) that the block will be resumed eventually.

    Unlike the for-loop, the block-statement does not have an
    else-clause.  I think it would be confusing, and emphasize the
    "loopiness" of the block-statement, while I want to emphasize its
    *difference* from a for-loop.  In addition, there are several
    possible semantics for an else-clause, and only a very weak use
    case.

Specification: Generator Exit Handling

    Generators will implement the new __exit__() method API.

    Generators will be allowed to have a yield statement inside a
    try-finally statement.

    The expression argument to the yield-statement will become
    optional (defaulting to None).

    When __exit__() is called, the generator is resumed but at the
    point of the yield-statement the exception represented by the
    __exit__ argument(s) is raised.  The generator may re-raise this
    exception, raise another exception, or yield another value,
    except that if the exception passed in to __exit__() was
    StopIteration, it ought to raise StopIteration (otherwise the
    effect would be that a break is turned into continue, which is
    unexpected at least).  When the *initial* call resuming the
    generator is an __exit__() call instead of a next() call, the
    generator's execution is aborted and the exception is re-raised
    without passing control to the generator's body.

    When a generator that has not yet terminated is garbage-collected
    (either through reference counting or by the cyclical garbage
    collector), its __exit__() method is called once with
    StopIteration as its first argument.  Together with the
    requirement that a generator ought to raise StopIteration when
    __exit__() is called with StopIteration, this guarantees the
    eventual activation of any finally-clauses that were active when
    the generator was last suspended.  Of course, under certain
    circumstances the generator may never be garbage-collected.  This
    is no different than the guarantees that are made about finalizers
    (__del__() methods) of other objects.

Alternatives Considered and Rejected

    - Many alternatives have been proposed for 'block'.  I haven't
      seen a proposal for another keyword that I like better than
      'block' yet.  Alas, 'block' is also not a good choice; it is a
      rather popular name for variables, arguments and methods.
      Perhaps 'with' is the best choice after all?

    - Instead of trying to pick the ideal keyword, the block-statement
      could simply have the form:

        EXPR1 as VAR1:
            BLOCK1

      This is at first attractive because, together with a good choice
      of function names (like those in the Examples section below)
      used in EXPR1, it reads well, and feels like a "user-defined
      statement".  And yet, it makes me (and many others)
      uncomfortable; without a keyword the syntax is very "bland",
      difficult to look up in a manual (remember that 'as' is
      optional), and it makes the meaning of break and continue in the
      block-statement even more confusing.

    - Phillip Eby has proposed to have the block-statement use
      an entirely different API than the for-loop, to differentiate
      between the two.  A generator would have to be wrapped in a
      decorator to make it support the block API.  IMO this adds more
      complexity with very little benefit; and we can't relly deny
      that the block-statement is conceptually a loop -- it supports
      break and continue, after all.

    - This keeps getting proposed: "block VAR1 = EXPR1" instead of
      "block EXPR1 as VAR1".  That would be very misleading, since
      VAR1 does *not* get assigned the value of EXPR1; EXPR1 results
      in a generator which is assigned to an internal variable, and
      VAR1 is the value returned by successive calls to the __next__()
      method of that iterator.

    - Why not change the translation to apply iter(EXPR1)?  All the
      examples would continue to work.  But this makes the
      block-statement *more* like a for-loop, while the emphasis ought
      to be on the *difference* between the two.  Not calling iter()
      catches a bunch of misunderstandings, like using a sequence as
      EXPR1.

Comparison to Thunks

    Alternative semantics proposed for the block-statement turn the
    block into a thunk (an anonymous function that blends into the
    containing scope).

    The main advantage of thunks that I can see is that you can save
    the thunk for later, like a callback for a button widget (the
    thunk then becomes a closure).  You can't use a yield-based block
    for that (except in Ruby, which uses yield syntax with a
    thunk-based implementation).  But I have to say that I almost see
    this as an advantage: I think I'd be slightly uncomfortable seeing
    a block and not knowing whether it will be executed in the normal
    control flow or later.  Defining an explicit nested function for
    that purpose doesn't have this problem for me, because I already
    know that the 'def' keyword means its body is executed later.

    The other problem with thunks is that once we think of them as the
    anonymous functions they are, we're pretty much forced to say that
    a return statement in a thunk returns from the thunk rather than
    from the containing function.  Doing it any other way would cause
    major weirdness when the thunk were to survive its containing
    function as a closure (perhaps continuations would help, but I'm
    not about to go there :-).

    But then an IMO important use case for the resource cleanup
    template pattern is lost.  I routinely write code like this:

       def findSomething(self, key, default=None):
           self.lock.acquire()
           try:
                for item in self.elements:
                    if item.matches(key):
                        return item
                return default
           finally:
              self.lock.release()

    and I'd be bummed if I couldn't write this as:

       def findSomething(self, key, default=None):
           block locking(self.lock):
                for item in self.elements:
                    if item.matches(key):
                        return item
                return default

    This particular example can be rewritten using a break:

       def findSomething(self, key, default=None):
           block locking(self.lock):
                for item in self.elements:
                    if item.matches(key):
                        break
                else:
                    item = default
            return item

    but it looks forced and the transformation isn't always that easy;
    you'd be forced to rewrite your code in a single-return style
    which feels too restrictive.

    Also note the semantic conundrum of a yield in a thunk -- the only
    reasonable interpretation is that this turns the thunk into a
    generator!

    Greg Ewing believes that thunks "would be a lot simpler, doing
    just what is required without any jiggery pokery with exceptions
    and break/continue/return statements.  It would be easy to explain
    what it does and why it's useful."

    But in order to obtain the required local variable sharing between
    the thunk and the containing function, every local variable used
    or set in the thunk would have to become a 'cell' (our mechanism
    for sharing variables between nested scopes).  Cells slow down
    access compared to regular local variables: access involves an
    extra C function call (PyCell_Get() or PyCell_Set()).

    Perhaps not entirely coincidentally, the last example above
    (findSomething() rewritten to avoid a return inside the block)
    shows that, unlike for regular nested functions, we'll want
    variables *assigned to* by the thunk also to be shared with the
    containing function, even if they are not assigned to outside the
    thunk.

    Greg Ewing again: "generators have turned out to be more powerful,
    because you can have more than one of them on the go at once. Is
    there a use for that capability here?"

    I believe there are definitely uses for this; several people have
    already shown how to do asynchronous light-weight threads using
    generators (e.g. David Mertz quoted in PEP 288, and Fredrik
    Lundh[3]).

    And finally, Greg says: "a thunk implementation has the potential
    to easily handle multiple block arguments, if a suitable syntax
    could ever be devised. It's hard to see how that could be done in
    a general way with the generator implementation."

    However, the use cases for multiple blocks seem elusive.

    (Proposals have since been made to change the implementation of
    thunks to remove most of these objections, but the resulting
    semantics are fairly complex to explain and to implement, so IMO
    that defeats the purpose of using thunks in the first place.)

Examples

    (Several of these examples contain "yield None".  If PEP 342 is
    accepted, these can be changed to just "yield" of course.)

    1. A template for ensuring that a lock, acquired at the start of a
       block, is released when the block is left:

        def locking(lock):
            lock.acquire()
            try:
                yield None
            finally:
                lock.release()

       Used as follows:

        block locking(myLock):
            # Code here executes with myLock held.  The lock is
            # guaranteed to be released when the block is left (even
            # if via return or by an uncaught exception).

    2. A template for opening a file that ensures the file is closed
       when the block is left:

        def opening(filename, mode="r"):
            f = open(filename, mode)
            try:
                yield f
            finally:
                f.close()

       Used as follows:

        block opening("/etc/passwd") as f:
            for line in f:
                print line.rstrip()

    3. A template for committing or rolling back a database
       transaction:

        def transactional(db):
            try:
                yield None
            except:
                db.rollback()
                raise
            else:
                db.commit()

    4. A template that tries something up to n times:

        def auto_retry(n=3, exc=Exception):
            for i in range(n):
                try:
                    yield None
                    return
                except exc, err:
                    # perhaps log exception here
                    continue
            raise # re-raise the exception we caught earlier

       Used as follows:

        block auto_retry(3, IOError):
            f = urllib.urlopen("http://www.python.org/dev/peps/pep-0340/")
            print f.read()

    5. It is possible to nest blocks and combine templates:

        def locking_opening(lock, filename, mode="r"):
            block locking(lock):
                block opening(filename) as f:
                    yield f

       Used as follows:

        block locking_opening(myLock, "/etc/passwd") as f:
            for line in f:
                print line.rstrip()

       (If this example confuses you, consider that it is equivalent
       to using a for-loop with a yield in its body in a regular
       generator which is invoking another iterator or generator
       recursively; see for example the source code for os.walk().)

    6. It is possible to write a regular iterator with the
       semantics of example 1:

        class locking:
           def __init__(self, lock):
               self.lock = lock
               self.state = 0
           def __next__(self, arg=None):
               # ignores arg
               if self.state:
                   assert self.state == 1
                   self.lock.release()
                   self.state += 1
                   raise StopIteration
               else:
                   self.lock.acquire()
                   self.state += 1
                   return None
           def __exit__(self, type, value=None, traceback=None):
               assert self.state in (0, 1, 2)
               if self.state == 1:
                   self.lock.release()
               raise type, value, traceback

       (This example is easily modified to implement the other
       examples; it shows how much simpler generators are for the same
       purpose.)

    7. Redirect stdout temporarily:

        def redirecting_stdout(new_stdout):
            save_stdout = sys.stdout
            try:
                sys.stdout = new_stdout
                yield None
            finally:
                sys.stdout = save_stdout

       Used as follows:

        block opening(filename, "w") as f:
            block redirecting_stdout(f):
                print "Hello world"

    8. A variant on opening() that also returns an error condition:

        def opening_w_error(filename, mode="r"):
            try:
                f = open(filename, mode)
            except IOError, err:
                yield None, err
            else:
                try:
                    yield f, None
                finally:
                    f.close()

       Used as follows:

        block opening_w_error("/etc/passwd", "a") as f, err:
            if err:
                print "IOError:", err
            else:
                f.write("guido::0:0::/:/bin/sh\n")

Acknowledgements

    In no useful order: Alex Martelli, Barry Warsaw, Bob Ippolito,
    Brett Cannon, Brian Sabbey, Chris Ryland, Doug Landauer, Duncan
    Booth, Fredrik Lundh, Greg Ewing, Holger Krekel, Jason Diamond,
    Jim Jewett, Josiah Carlson, Ka-Ping Yee, Michael Chermside,
    Michael Hudson, Neil Schemenauer, Nick Coghlan, Paul Moore,
    Phillip Eby, Raymond Hettinger, Georg Brandl, Samuele
    Pedroni, Shannon Behrens, Skip Montanaro, Steven Bethard, Terry
    Reedy, Tim Delaney, Aahz, and others.  Thanks all for the valuable
    contributions!

References

    [1] http://mail.python.org/pipermail/python-dev/2005-April/052821.html

    [2] http://msdn.microsoft.com/vcsharp/programming/language/ask/withstatement/

    [3] http://effbot.org/zone/asyncore-generators.htm

Copyright

    This document has been placed in the public domain.

pep-0341 Unifying try-except and try-finally

PEP: 341
Title: Unifying try-except and try-finally
Version: $Revision$
Last-Modified: $Date$
Author: Georg Brandl <georg at python.org>
Status: Final
Type: Standards Track
Content-Type: text/plain
Created: 04-May-2005
Post-History: 

Abstract

    This PEP proposes a change in the syntax and semantics of try
    statements to allow combined try-except-finally blocks. This
    means in short that it would be valid to write

        try:
            <do something>
        except Exception:
            <handle the error>
        finally:
            <cleanup>


Rationale/Proposal

    There are many use cases for the try-except statement and
    for the try-finally statement per se; however, often one needs
    to catch exceptions and execute some cleanup code afterwards.
    It is slightly annoying and not very intelligible that
    one has to write

        f = None
        try:
            try:
                f = open(filename)
                text = f.read()
            except IOError:
                print 'An error occurred'
        finally:
            if f:
                f.close()

    So it is proposed that a construction like this

        try:
            <suite 1>
        except Ex1:
            <suite 2>
        <more except: clauses>
        else:
            <suite 3>
        finally:
            <suite 4>

    be exactly the same as the legacy

        try:
            try:
                <suite 1>
            except Ex1:
                <suite 2>
            <more except: clauses>
            else:
                <suite 3>
        finally:
            <suite 4>

    This is backwards compatible, and every try statement that is
    legal today would continue to work.


Changes to the grammar

    The grammar for the try statement, which is currently

        try_stmt: ('try' ':' suite (except_clause ':' suite)+
                   ['else' ':' suite] | 'try' ':' suite 'finally' ':' suite)

    would have to become

        try_stmt: 'try' ':' suite
                  (
                    (except_clause ':' suite)+
                    ['else' ':' suite]
                    ['finally' ':' suite]
                  |
                    'finally' ':' suite
                  )

Implementation

    As the PEP author currently does not have sufficient knowledge
    of the CPython implementation, he is unfortunately not able
    to deliver one.  Thomas Lee has submitted a patch[2].

    However, according to Guido, it should be a piece of cake to
    implement[1] -- at least for a core hacker.

    This patch was committed 17 December 2005, SVN revision 41740 [3].


References

    [1] http://mail.python.org/pipermail/python-dev/2005-May/053319.html
    [2] http://python.org/sf/1355913
    [3] http://mail.python.org/pipermail/python-checkins/2005-December/048457.html


Copyright

    This document has been placed in the public domain.



pep-0342 Coroutines via Enhanced Generators

PEP: 342
Title: Coroutines via Enhanced Generators
Version: $Revision$
Last-Modified: $Date$
Author: Guido van Rossum, Phillip J. Eby
Status: Final
Type: Standards Track
Content-Type: text/plain
Created: 10-May-2005
Python-Version: 2.5
Post-History: 

Introduction

    This PEP proposes some enhancements to the API and syntax of
    generators, to make them usable as simple coroutines.  It is
    basically a combination of ideas from these two PEPs, which
    may be considered redundant if this PEP is accepted:

    - PEP 288, Generators Attributes and Exceptions.  The current PEP
      covers its second half, generator exceptions (in fact the
      throw() method name was taken from PEP 288).  PEP 342 replaces
      generator attributes, however, with a concept from an earlier
      revision of PEP 288, the "yield expression".

    - PEP 325, Resource-Release Support for Generators.  PEP 342
      ties up a few loose ends in the PEP 325 spec, to make it suitable
      for actual implementation.

Motivation

    Coroutines are a natural way of expressing many algorithms, such as
    simulations, games, asynchronous I/O, and other forms of event-
    driven programming or co-operative multitasking.  Python's generator
    functions are almost coroutines -- but not quite -- in that they
    allow pausing execution to produce a value, but do not provide for
    values or exceptions to be passed in when execution resumes.  They
    also do not allow execution to be paused within the "try" portion of
    try/finally blocks, and therefore make it difficult for an aborted
    coroutine to clean up after itself.

    Also, generators cannot yield control while other functions are
    executing, unless those functions are themselves expressed as
    generators, and the outer generator is written to yield in response
    to values yielded by the inner generator.  This complicates the
    implementation of even relatively simple use cases like asynchronous
    communications, because calling any functions either requires the
    generator to "block" (i.e. be unable to yield control), or else a
    lot of boilerplate looping code must be added around every needed
    function call.

    However, if it were possible to pass values or exceptions *into* a
    generator at the point where it was suspended, a simple co-routine
    scheduler or "trampoline function" would let coroutines "call" each
    other without blocking -- a tremendous boon for asynchronous
    applications.  Such applications could then write co-routines to
    do non-blocking socket I/O by yielding control to an I/O scheduler
    until data has been sent or becomes available.  Meanwhile, code that
    performs the I/O would simply do something like this:

         data = (yield nonblocking_read(my_socket, nbytes))

    in order to pause execution until the nonblocking_read() coroutine
    produced a value.

    In other words, with a few relatively minor enhancements to the
    language and to the implementation of the generator-iterator type,
    Python will be able to support performing asynchronous operations
    without needing to write the entire application as a series of
    callbacks, and without requiring the use of resource-intensive threads
    for programs that need hundreds or even thousands of co-operatively
    multitasking pseudothreads.  Thus, these enhancements will give
    standard Python many of the benefits of the Stackless Python fork,
    without requiring any significant modification to the CPython core
    or its APIs.  In addition, these enhancements should be readily
    implementable by any Python implementation (such as Jython) that
    already supports generators.

Specification Summary

    By adding a few simple methods to the generator-iterator type, and
    with two minor syntax adjustments, Python developers will be able
    to use generator functions to implement co-routines and other forms
    of co-operative multitasking.  These methods and adjustments are:

    1. Redefine "yield" to be an expression, rather than a statement.
       The current yield statement would become a yield expression
       whose value is thrown away.  A yield expression's value is
       None whenever the generator is resumed by a normal next() call.

    2. Add a new send() method for generator-iterators, which resumes
       the generator and "sends" a value that becomes the result of the
       current yield-expression.  The send() method returns the next
       value yielded by the generator, or raises StopIteration if the
       generator exits without yielding another value.

    3. Add a new throw() method for generator-iterators, which raises
       an exception at the point where the generator was paused, and
       which returns the next value yielded by the generator, raising
       StopIteration if the generator exits without yielding another
       value.  (If the generator does not catch the passed-in exception,
       or raises a different exception, then that exception propagates
       to the caller.)

    4. Add a close() method for generator-iterators, which raises
       GeneratorExit at the point where the generator was paused.  If
       the generator then raises StopIteration (by exiting normally, or
       due to already being closed) or GeneratorExit (by not catching
       the exception), close() returns to its caller.  If the generator
       yields a value, a RuntimeError is raised.  If the generator
       raises any other exception, it is propagated to the caller.
       close() does nothing if the generator has already exited due to
       an exception or normal exit.

    5. Add support to ensure that close() is called when a generator
       iterator is garbage-collected.

    6. Allow "yield" to be used in try/finally blocks, since garbage
       collection or an explicit close() call would now allow the
       finally clause to execute.

    A prototype patch implementing all of these changes against the
    current Python CVS HEAD is available as SourceForge patch #1223381
    (http://python.org/sf/1223381).


Specification: Sending Values into Generators

  New generator method: send(value)

    A new method for generator-iterators is proposed, called send().  It
    takes exactly one argument, which is the value that should be "sent
    in" to the generator.  Calling send(None) is exactly equivalent to
    calling a generator's next() method.  Calling send() with any other
    value is the same, except that the value produced by the generator's
    current yield expression will be different.

    Because generator-iterators begin execution at the top of the
    generator's function body, there is no yield expression to receive
    a value when the generator has just been created.  Therefore,
    calling send() with a non-None argument is prohibited when the
    generator iterator has just started, and a TypeError is raised if
    this occurs (presumably due to a logic error of some kind).  Thus,
    before you can communicate with a coroutine you must first call
    next() or send(None) to advance its execution to the first yield
    expression.

    As with the next() method, the send() method returns the next value
    yielded by the generator-iterator, or raises StopIteration if the
    generator exits normally, or has already exited.  If the generator
    raises an uncaught exception, it is propagated to send()'s caller.

  New syntax: Yield Expressions

    The yield-statement will be allowed to be used on the right-hand
    side of an assignment; in that case it is referred to as
    yield-expression.  The value of this yield-expression is None
    unless send() was called with a non-None argument; see below.

    A yield-expression must always be parenthesized except when it
    occurs at the top-level expression on the right-hand side of an
    assignment.  So

        x = yield 42
        x = yield
        x = 12 + (yield 42)
        x = 12 + (yield)
        foo(yield 42)
        foo(yield)

    are all legal, but

        x = 12 + yield 42
        x = 12 + yield
        foo(yield 42, 12)
        foo(yield, 12)

    are all illegal.  (Some of the edge cases are motivated by the
    current legality of "yield 12, 42".)

    Note that a yield-statement or yield-expression without an
    expression is now legal.  This makes sense: when the information
    flow in the next() call is reversed, it should be possible to
    yield without passing an explicit value ("yield" is of course
    equivalent to "yield None").

    When send(value) is called, the yield-expression that it resumes
    will return the passed-in value.  When next() is called, the resumed
    yield-expression will return None.  If the yield-expression is a
    yield-statement, this returned value is ignored, similar to ignoring
    the value returned by a function call used as a statement.

    In effect, a yield-expression is like an inverted function call; the
    argument to yield is in fact returned (yielded) from the currently
    executing function, and the "return value" of yield is the argument
    passed in via send().

    Note: the syntactic extensions to yield make its use very similar
    to that in Ruby.  This is intentional.  Do note that in Python the
    block passes a value to the generator using "send(EXPR)" rather
    than "return EXPR", and the underlying mechanism whereby control
    is passed between the generator and the block is completely
    different.  Blocks in Python are not compiled into thunks; rather,
    yield suspends execution of the generator's frame.  Some edge
    cases work differently; in Python, you cannot save the block for
    later use, and you cannot test whether there is a block or not.
    (XXX - this stuff about blocks seems out of place now, perhaps
    Guido can edit to clarify.)

Specification: Exceptions and Cleanup

    Let a generator object be the iterator produced by calling a
    generator function.  Below, 'g' always refers to a generator
    object.

  New syntax: yield allowed inside try-finally

    The syntax for generator functions is extended to allow a
    yield-statement inside a try-finally statement.

  New generator method: throw(type, value=None, traceback=None)

    g.throw(type, value, traceback) causes the specified exception to
    be thrown at the point where the generator g is currently
    suspended (i.e. at a yield-statement, or at the start of its
    function body if next() has not been called yet).  If the
    generator catches the exception and yields another value, that is
    the return value of g.throw().  If it doesn't catch the exception,
    the throw() appears to raise the same exception passed it (it
    "falls through").  If the generator raises another exception (this
    includes the StopIteration produced when it returns) that
    exception is raised by the throw() call.  In summary, throw()
    behaves like next() or send(), except it raises an exception at the
    suspension point.  If the generator is already in the closed
    state, throw() just raises the exception it was passed without
    executing any of the generator's code.

    The effect of raising the exception is exactly as if the
    statement:

        raise type, value, traceback

    was executed at the suspension point.  The type argument must
    not be None, and the type and value must be compatible.  If the
    value is not an instance of the type, a new exception instance
    is created using the value, following the same rules that the raise
    statement uses to create an exception instance.  The traceback, if
    supplied, must be a valid Python traceback object, or a TypeError
    occurs.

    Note: The name of the throw() method was selected for several
    reasons.  Raise is a keyword and so cannot be used as a method
    name.  Unlike raise (which immediately raises an exception from the
    current execution point), throw() first resumes the generator, and
    only then raises the exception.  The word throw is suggestive of
    putting the exception in another location, and is already associated
    with exceptions in other languages.

    Alternative method names were considered: resolve(), signal(),
    genraise(), raiseinto(), and flush().  None of these seem to fit
    as well as throw().

  New standard exception: GeneratorExit

    A new standard exception is defined, GeneratorExit, inheriting
    from Exception.  A generator should handle this by re-raising it
    (or just not catching it) or by raising StopIteration.

  New generator method: close()

    g.close() is defined by the following pseudo-code:

        def close(self):
            try:
                self.throw(GeneratorExit)
            except (GeneratorExit, StopIteration):
                pass
            else:
                raise RuntimeError("generator ignored GeneratorExit")
            # Other exceptions are not caught

  New generator method: __del__()

    g.__del__() is a wrapper for g.close().  This will be called when
    the generator object is garbage-collected (in CPython, this is
    when its reference count goes to zero).  If close() raises an
    exception, a traceback for the exception is printed to sys.stderr
    and further ignored; it is not propagated back to the place that
    triggered the garbage collection.  This is consistent with the
    handling of exceptions in __del__() methods on class instances.

    If the generator object participates in a cycle, g.__del__() may
    not be called.  This is the behavior of CPython's current garbage
    collector.  The reason for the restriction is that the GC code
    needs to "break" a cycle at an arbitrary point in order to collect
    it, and from then on no Python code should be allowed to see the
    objects that formed the cycle, as they may be in an invalid state.
    Objects "hanging off" a cycle are not subject to this restriction.

    Note that it is unlikely to see a generator object participate in
    a cycle in practice.  However, storing a generator object in a
    global variable creates a cycle via the generator frame's
    f_globals pointer.  Another way to create a cycle would be to
    store a reference to the generator object in a data structure that
    is passed to the generator as an argument (e.g., if an object has
    a method that's a generator, and keeps a reference to a running
    iterator created by that method).  Neither of these cases
    are very likely given the typical patterns of generator use.

    Also, in the CPython implementation of this PEP, the frame object
    used by the generator should be released whenever its execution is
    terminated due to an error or normal exit.  This will ensure that
    generators that cannot be resumed do not remain part of an
    uncollectable reference cycle.  This allows other code to
    potentially use close() in a try/finally or "with" block (per PEP
    343) to ensure that a given generator is properly finalized.

Optional Extensions

  The Extended 'continue' Statement

     An earlier draft of this PEP proposed a new "continue EXPR"
     syntax for use in for-loops (carried over from PEP 340), that
     would pass the value of EXPR into the iterator being looped over.
     This feature has been withdrawn for the time being, because the
     scope of this PEP has been narrowed to focus only on passing values
     into generator-iterators, and not other kinds of iterators.  It
     was also felt by some on the Python-Dev list that adding new syntax
     for this particular feature would be premature at best.

Open Issues

    Discussion on python-dev has revealed some open issues.  I list
    them here, with my preferred resolution and its motivation.  The
    PEP as currently written reflects this preferred resolution.

    1. What exception should be raised by close() when the generator
       yields another value as a response to the GeneratorExit
       exception?

       I originally chose TypeError because it represents gross
       misbehavior of the generator function, which should be fixed by
       changing the code.  But the with_template decorator class in
       PEP 343 uses RuntimeError for similar offenses.  Arguably they
       should all use the same exception.  I'd rather not introduce a
       new exception class just for this purpose, since it's not an
       exception that I want people to catch: I want it to turn into a
       traceback which is seen by the programmer who then fixes the
       code.  So now I believe they should both raise RuntimeError.
       There are some precedents for that: it's raised by the core
       Python code in situations where endless recursion is detected,
       and for uninitialized objects (and for a variety of
       miscellaneous conditions).

    2. Oren Tirosh has proposed renaming the send() method to feed(),
       for compatibility with the "consumer interface" (see
       http://effbot.org/zone/consumer.htm for the specification.)

       However, looking more closely at the consumer interface, it seems
       that the desired semantics for feed() are different than for
       send(), because send() can't be meaningfully called on a just-
       started generator.  Also, the consumer interface as currently
       defined doesn't include handling for StopIteration.

       Therefore, it seems like it would probably be more useful to
       create a simple decorator that wraps a generator function to make
       it conform to the consumer interface.  For example, it could
       "warm up" the generator with an initial next() call, trap
       StopIteration, and perhaps even provide reset() by re-invoking
       the generator function.

Examples

    1. A simple "consumer" decorator that makes a generator function
       automatically advance to its first yield point when initially
       called:

        def consumer(func):
            def wrapper(*args,**kw):
                gen = func(*args, **kw)
                gen.next()
                return gen
            wrapper.__name__ = func.__name__
            wrapper.__dict__ = func.__dict__
            wrapper.__doc__  = func.__doc__
            return wrapper

    2. An example of using the "consumer" decorator to create a
       "reverse generator" that receives images and creates thumbnail
       pages, sending them on to another consumer.  Functions like
       this can be chained together to form efficient processing
       pipelines of "consumers" that each can have complex internal
       state:

        @consumer
        def thumbnail_pager(pagesize, thumbsize, destination):
            while True:
                page = new_image(pagesize)
                rows, columns = pagesize / thumbsize
                pending = False
                try:
                    for row in xrange(rows):
                        for column in xrange(columns):
                            thumb = create_thumbnail((yield), thumbsize)
                            page.write(
                                thumb, col*thumbsize.x, row*thumbsize.y
                            )
                            pending = True
                except GeneratorExit:
                    # close() was called, so flush any pending output
                    if pending:
                        destination.send(page)

                    # then close the downstream consumer, and exit
                    destination.close()
                    return
                else:
                    # we finished a page full of thumbnails, so send it
                    # downstream and keep on looping
                    destination.send(page)

        @consumer
        def jpeg_writer(dirname):
            fileno = 1
            while True:
                filename = os.path.join(dirname,"page%04d.jpg" % fileno)
                write_jpeg((yield), filename)
                fileno += 1


        # Put them together to make a function that makes thumbnail
        # pages from a list of images and other parameters.      
        #
        def write_thumbnails(pagesize, thumbsize, images, output_dir):
            pipeline = thumbnail_pager(
                pagesize, thumbsize, jpeg_writer(output_dir)
            )

            for image in images:
                pipeline.send(image)

            pipeline.close()

    3. A simple co-routine scheduler or "trampoline" that lets
       coroutines "call" other coroutines by yielding the coroutine
       they wish to invoke.  Any non-generator value yielded by
       a coroutine is returned to the coroutine that "called" the
       one yielding the value.  Similarly, if a coroutine raises an
       exception, the exception is propagated to its "caller".  In
       effect, this example emulates simple tasklets as are used
       in Stackless Python, as long as you use a yield expression to
       invoke routines that would otherwise "block".  This is only
       a very simple example, and far more sophisticated schedulers
       are possible.  (For example, the existing GTasklet framework
       for Python (http://www.gnome.org/~gjc/gtasklet/gtasklets.html)
       and the peak.events framework (http://peak.telecommunity.com/)
       already implement similar scheduling capabilities, but must
       currently use awkward workarounds for the inability to pass
       values or exceptions into generators.)

        import collections

        class Trampoline:
            """Manage communications between coroutines"""

            running = False

            def __init__(self):
                self.queue = collections.deque()

            def add(self, coroutine):
                """Request that a coroutine be executed"""
                self.schedule(coroutine)

            def run(self):
                result = None
                self.running = True
                try:
                    while self.running and self.queue:
                        func = self.queue.popleft()
                        result = func()
                    return result
                finally:
                    self.running = False

            def stop(self):
                self.running = False

            def schedule(self, coroutine, stack=(), val=None, *exc):
                def resume():
                    value = val
                    try:
                        if exc:
                            value = coroutine.throw(value,*exc)
                        else:
                            value = coroutine.send(value)
                    except:
                        if stack:
                            # send the error back to the "caller"
                            self.schedule(
                                stack[0], stack[1], *sys.exc_info()
                            )
                        else:
                            # Nothing left in this pseudothread to
                            # handle it, let it propagate to the
                            # run loop
                            raise

                    if isinstance(value, types.GeneratorType):
                        # Yielded to a specific coroutine, push the
                        # current one on the stack, and call the new
                        # one with no args
                        self.schedule(value, (coroutine,stack))

                    elif stack:
                        # Yielded a result, pop the stack and send the
                        # value to the caller
                        self.schedule(stack[0], stack[1], value)

                    # else: this pseudothread has ended

                self.queue.append(resume)

    4. A simple "echo" server, and code to run it using a trampoline
       (presumes the existence of "nonblocking_read",
       "nonblocking_write", and other I/O coroutines, that e.g. raise
       ConnectionLost if the connection is closed):

           # coroutine function that echos data back on a connected
           # socket
           #
           def echo_handler(sock):
               while True:
                   try:
                       data = yield nonblocking_read(sock)
                       yield nonblocking_write(sock, data)
                   except ConnectionLost:
                       pass  # exit normally if connection lost

           # coroutine function that listens for connections on a
           # socket, and then launches a service "handler" coroutine
           # to service the connection
           #
           def listen_on(trampoline, sock, handler):
               while True:
                   # get the next incoming connection
                   connected_socket = yield nonblocking_accept(sock)

                   # start another coroutine to handle the connection
                   trampoline.add( handler(connected_socket) )

           # Create a scheduler to manage all our coroutines
           t = Trampoline()

           # Create a coroutine instance to run the echo_handler on
           # incoming connections
           #
           server = listen_on(
               t, listening_socket("localhost","echo"), echo_handler
           )

           # Add the coroutine to the scheduler
           t.add(server)

           # loop forever, accepting connections and servicing them
           # "in parallel"
           #
           t.run()


Reference Implementation

    A prototype patch implementing all of the features described in this
    PEP is available as SourceForge patch #1223381
    (http://python.org/sf/1223381).

    This patch was committed to CVS 01-02 August 2005.


Acknowledgements

    Raymond Hettinger (PEP 288) and Samuele Pedroni (PEP 325) first
    formally proposed the ideas of communicating values or exceptions
    into generators, and the ability to "close" generators.  Timothy
    Delaney suggested the title of this PEP, and Steven Bethard helped
    edit a previous version.  See also the Acknowledgements section
    of PEP 340.

References

    TBD.

Copyright

    This document has been placed in the public domain.

pep-0343 The "with" Statement

PEP: 343
Title: The "with" Statement
Version: $Revision$
Last-Modified: $Date$
Author: Guido van Rossum, Nick Coghlan
Status: Final
Type: Standards Track
Content-Type: text/plain
Created: 13-May-2005
Python-Version: 2.5
Post-History: 2-Jun-2005, 16-Oct-2005, 29-Oct-2005, 23-Apr-2006, 1-May-2006, 30-Jul-2006

Abstract

    This PEP adds a new statement "with" to the Python language to make
    it possible to factor out standard uses of try/finally statements.

    In this PEP, context managers provide __enter__() and __exit__()
    methods that are invoked on entry to and exit from the body of the
    with statement.

Author's Note

    This PEP was originally written in first person by Guido, and
    subsequently updated by Nick Coghlan to reflect later discussion
    on python-dev. Any first person references are from Guido's
    original.

    Python's alpha release cycle revealed terminology problems in this
    PEP and in the associated documentation and implementation [14].
    The PEP stabilised around the time of the first Python 2.5 beta
    release.

    Yes, the verb tense is messed up in a few places. We've been
    working on this PEP for over a year now, so things that were
    originally in the future are now in the past :)

Introduction

    After a lot of discussion about PEP 340 and alternatives, I
    decided to withdraw PEP 340 and proposed a slight variant on PEP
    310.  After more discussion, I have added back a mechanism for
    raising an exception in a suspended generator using a throw()
    method, and a close() method which throws a new GeneratorExit
    exception; these additions were first proposed on python-dev in
    [2] and universally approved of.  I'm also changing the keyword to
    'with'.

    After acceptance of this PEP, the following PEPs were rejected due
    to overlap:

    - PEP 310, Reliable Acquisition/Release Pairs.  This is the
      original with-statement proposal.

    - PEP 319, Python Synchronize/Asynchronize Block.  Its use cases
      can be covered by the current PEP by providing suitable
      with-statement controllers: for 'synchronize' we can use the
      "locking" template from example 1; for 'asynchronize' we can use
      a similar "unlocking" template.  I don't think having an
      "anonymous" lock associated with a code block is all that
      important; in fact it may be better to always be explicit about
      the mutex being used.

    PEP 340 and PEP 346 also overlapped with this PEP, but were
    voluntarily withdrawn when this PEP was submitted.

    Some discussion of earlier incarnations of this PEP took place on
    the Python Wiki [3].

Motivation and Summary

    PEP 340, Anonymous Block Statements, combined many powerful ideas:
    using generators as block templates, adding exception handling and
    finalization to generators, and more.  Besides praise it received
    a lot of opposition from people who didn't like the fact that it
    was, under the covers, a (potential) looping construct.  This
    meant that break and continue in a block-statement would break or
    continue the block-statement, even if it was used as a non-looping
    resource management tool.

    But the final blow came when I read Raymond Chen's rant about
    flow-control macros[1].  Raymond argues convincingly that hiding
    flow control in macros makes your code inscrutable, and I find
    that his argument applies to Python as well as to C.  I realized
    that PEP 340 templates can hide all sorts of control flow; for
    example, its example 4 (auto_retry()) catches exceptions and
    repeats the block up to three times.

    However, the with-statement of PEP 310 does *not* hide control
    flow, in my view: while a finally-suite temporarily suspends the
    control flow, in the end, the control flow resumes as if the
    finally-suite wasn't there at all.

    Remember, PEP 310 proposes roughly this syntax (the "VAR =" part is
    optional):

        with VAR = EXPR:
            BLOCK

    which roughly translates into this:

        VAR = EXPR
        VAR.__enter__()
        try:
            BLOCK
        finally:
            VAR.__exit__()

    Now consider this example:

        with f = open("/etc/passwd"):
            BLOCK1
        BLOCK2

    Here, just as if the first line was "if True" instead, we know
    that if BLOCK1 completes without an exception, BLOCK2 will be
    reached; and if BLOCK1 raises an exception or executes a non-local
    goto (a break, continue or return), BLOCK2 is *not* reached.  The
    magic added by the with-statement at the end doesn't affect this.

    (You may ask, what if a bug in the __exit__() method causes an
    exception?  Then all is lost -- but this is no worse than with
    other exceptions; the nature of exceptions is that they can happen
    *anywhere*, and you just have to live with that.  Even if you
    write bug-free code, a KeyboardInterrupt exception can still cause
    it to exit between any two virtual machine opcodes.)

    This argument almost led me to endorse PEP 310, but I had one idea
    left from the PEP 340 euphoria that I wasn't ready to drop: using
    generators as "templates" for abstractions like acquiring and
    releasing a lock or opening and closing a file is a powerful idea,
    as can be seen by looking at the examples in that PEP.

    Inspired by a counter-proposal to PEP 340 by Phillip Eby I tried
    to create a decorator that would turn a suitable generator into an
    object with the necessary __enter__() and __exit__() methods.
    Here I ran into a snag: while it wasn't too hard for the locking
    example, it was impossible to do this for the opening example.
    The idea was to define the template like this:

        @contextmanager
        def opening(filename):
            f = open(filename)
            try:
                yield f
            finally:
                f.close()

    and used it like this:

        with f = opening(filename):
            ...read data from f...

    The problem is that in PEP 310, the result of calling EXPR is
    assigned directly to VAR, and then VAR's __exit__() method is
    called upon exit from BLOCK1.  But here, VAR clearly needs to
    receive the opened file, and that would mean that __exit__() would
    have to be a method on the file.

    While this can be solved using a proxy class, this is awkward and
    made me realize that a slightly different translation would make
    writing the desired decorator a piece of cake: let VAR receive the
    result from calling the __enter__() method, and save the value of
    EXPR to call its __exit__() method later.  Then the decorator can
    return an instance of a wrapper class whose __enter__() method
    calls the generator's next() method and returns whatever next()
    returns; the wrapper instance's __exit__() method calls next()
    again but expects it to raise StopIteration.  (Details below in
    the section Optional Generator Decorator.)

    So now the final hurdle was that the PEP 310 syntax:

        with VAR = EXPR:
            BLOCK1

    would be deceptive, since VAR does *not* receive the value of
    EXPR.  Borrowing from PEP 340, it was an easy step to:

        with EXPR as VAR:
            BLOCK1

    Additional discussion showed that people really liked being able
    to "see" the exception in the generator, even if it was only to
    log it; the generator is not allowed to yield another value, since
    the with-statement should not be usable as a loop (raising a
    different exception is marginally acceptable).  To enable this, a
    new throw() method for generators is proposed, which takes one to
    three arguments representing an exception in the usual fashion
    (type, value, traceback) and raises it at the point where the
    generator is suspended.

    Once we have this, it is a small step to proposing another
    generator method, close(), which calls throw() with a special
    exception, GeneratorExit.  This tells the generator to exit, and
    from there it's another small step to proposing that close() be
    called automatically when the generator is garbage-collected.

    Then, finally, we can allow a yield-statement inside a try-finally
    statement, since we can now guarantee that the finally-clause will
    (eventually) be executed.  The usual cautions about finalization
    apply -- the process may be terminated abruptly without finalizing
    any objects, and objects may be kept alive forever by cycles or
    memory leaks in the application (as opposed to cycles or leaks in
    the Python implementation, which are taken care of by GC).

    Note that we're not guaranteeing that the finally-clause is
    executed immediately after the generator object becomes unused,
    even though this is how it will work in CPython.  This is similar
    to auto-closing files: while a reference-counting implementation
    like CPython deallocates an object as soon as the last reference
    to it goes away, implementations that use other GC algorithms do
    not make the same guarantee.  This applies to Jython, IronPython,
    and probably to Python running on Parrot.

    (The details of the changes made to generators can now be found in
     PEP 342 rather than in the current PEP)

Use Cases

    See the Examples section near the end.

Specification: The 'with' Statement

    A new statement is proposed with the syntax:

        with EXPR as VAR:
            BLOCK

    Here, 'with' and 'as' are new keywords; EXPR is an arbitrary
    expression (but not an expression-list) and VAR is a single
    assignment target.  It can *not* be a comma-separated sequence of
    variables, but it *can* be a *parenthesized* comma-separated
    sequence of variables.  (This restriction makes a future extension
    possible of the syntax to have multiple comma-separated resources,
    each with its own optional as-clause.)

    The "as VAR" part is optional.

    The translation of the above statement is:

        mgr = (EXPR)
        exit = type(mgr).__exit__  # Not calling it yet
        value = type(mgr).__enter__(mgr)
        exc = True
        try:
            try:
                VAR = value  # Only if "as VAR" is present
                BLOCK
            except:
                # The exceptional case is handled here
                exc = False
                if not exit(mgr, *sys.exc_info()):
                    raise
                # The exception is swallowed if exit() returns true
        finally:
            # The normal and non-local-goto cases are handled here
            if exc:
                exit(mgr, None, None, None)

    Here, the lowercase variables (mgr, exit, value, exc) are internal
    variables and not accessible to the user; they will most likely be
    implemented as special registers or stack positions.

    The details of the above translation are intended to prescribe the
    exact semantics.  If either of the relevant methods are not found
    as expected, the interpreter will raise AttributeError, in the
    order that they are tried (__exit__, __enter__).
    Similarly, if any of the calls raises an exception, the effect is
    exactly as it would be in the above code.  Finally, if BLOCK
    contains a break, continue or return statement, the __exit__()
    method is called with three None arguments just as if BLOCK
    completed normally.  (I.e. these "pseudo-exceptions" are not seen
    as exceptions by __exit__().)

    If the "as VAR" part of the syntax is omitted, the "VAR =" part of
    the translation is omitted (but mgr.__enter__() is still called).

    The calling convention for mgr.__exit__() is as follows.  If the
    finally-suite was reached through normal completion of BLOCK or
    through a non-local goto (a break, continue or return statement in
    BLOCK), mgr.__exit__() is called with three None arguments.  If
    the finally-suite was reached through an exception raised in
    BLOCK, mgr.__exit__() is called with three arguments representing
    the exception type, value, and traceback.

    IMPORTANT: if mgr.__exit__() returns a "true" value, the exception
    is "swallowed".  That is, if it returns "true", execution
    continues at the next statement after the with-statement, even if
    an exception happened inside the with-statement.  However, if the
    with-statement was left via a non-local goto (break, continue or
    return), this non-local return is resumed when mgr.__exit__()
    returns regardless of the return value.  The motivation for this
    detail is to make it possible for mgr.__exit__() to swallow
    exceptions, without making it too easy (since the default return
    value, None, is false and this causes the exception to be
    re-raised).  The main use case for swallowing exceptions is to
    make it possible to write the @contextmanager decorator so
    that a try/except block in a decorated generator behaves exactly
    as if the body of the generator were expanded in-line at the place
    of the with-statement.

    The motivation for passing the exception details to __exit__(), as
    opposed to the argument-less __exit__() from PEP 310, was given by
    the transactional() use case, example 3 below.  The template in
    that example must commit or roll back the transaction depending on
    whether an exception occurred or not.  Rather than just having a
    boolean flag indicating whether an exception occurred, we pass the
    complete exception information, for the benefit of an
    exception-logging facility for example.  Relying on sys.exc_info()
    to get at the exception information was rejected; sys.exc_info()
    has very complex semantics and it is perfectly possible that it
    returns the exception information for an exception that was caught
    ages ago.  It was also proposed to add an additional boolean to
    distinguish between reaching the end of BLOCK and a non-local
    goto.  This was rejected as too complex and unnecessary; a
    non-local goto should be considered unexceptional for the purposes
    of a database transaction roll-back decision.

    To facilitate chaining of contexts in Python code that directly
    manipulates context managers, __exit__() methods  should *not*
    re-raise the error that is passed in to them. It is always the
    responsibility of the *caller* of the __exit__() method to do any
    reraising in that case.

    That way, if the caller needs to tell whether the __exit__() 
    invocation *failed* (as opposed to successfully cleaning up before
    propagating the original error), it can do so.

    If __exit__() returns without an error, this can then be
    interpreted as success of the __exit__() method itself (regardless
    of whether or not the original error is to be propagated or
    suppressed).

    However, if __exit__() propagates an exception to its caller, this
    means that __exit__() *itself* has failed.  Thus, __exit__()
    methods should avoid raising errors unless they have actually 
    failed.  (And allowing the original error to proceed isn't a 
    failure.)

Transition Plan

    In Python 2.5, the new syntax will only be recognized if a future
    statement is present:

        from __future__ import with_statement

    This will make both 'with' and 'as' keywords.  Without the future
    statement, using 'with' or 'as' as an identifier will cause a
    Warning to be issued to stderr.

    In Python 2.6, the new syntax will always be recognized; 'with'
    and 'as' are always keywords.

Generator Decorator

    With PEP 342 accepted, it is possible to write a decorator
    that makes it possible to use a generator that yields exactly once
    to control a with-statement.  Here's a sketch of such a decorator:

        class GeneratorContextManager(object):

           def __init__(self, gen):
               self.gen = gen

           def __enter__(self):
               try:
                   return self.gen.next()
               except StopIteration:
                   raise RuntimeError("generator didn't yield")

           def __exit__(self, type, value, traceback):
               if type is None:
                   try:
                       self.gen.next()
                   except StopIteration:
                       return
                   else:
                       raise RuntimeError("generator didn't stop")
               else:
                   try:
                       self.gen.throw(type, value, traceback)
                       raise RuntimeError("generator didn't stop after throw()")
                   except StopIteration:
                       return True
                   except:
                       # only re-raise if it's *not* the exception that was
                       # passed to throw(), because __exit__() must not raise
                       # an exception unless __exit__() itself failed.  But
                       # throw() has to raise the exception to signal
                       # propagation, so this fixes the impedance mismatch 
                       # between the throw() protocol and the __exit__()
                       # protocol.
                       #
                       if sys.exc_info()[1] is not value:
                           raise

        def contextmanager(func):
           def helper(*args, **kwds):
               return GeneratorContextManager(func(*args, **kwds))
           return helper

    This decorator could be used as follows:

        @contextmanager
        def opening(filename):
           f = open(filename) # IOError is untouched by GeneratorContext
           try:
               yield f
           finally:
               f.close() # Ditto for errors here (however unlikely)

    A robust implementation of this decorator will be made
    part of the standard library.

Context Managers in the Standard Library

    It would be possible to endow certain objects, like files,
    sockets, and locks, with __enter__() and __exit__() methods so
    that instead of writing:

        with locking(myLock):
            BLOCK

    one could write simply:

        with myLock:
            BLOCK

    I think we should be careful with this; it could lead to mistakes
    like:

        f = open(filename)
        with f:
            BLOCK1
        with f:
            BLOCK2

    which does not do what one might think (f is closed before BLOCK2
    is entered).

    OTOH such mistakes are easily diagnosed; for example, the
    generator context decorator above raises RuntimeError when a
    second  with-statement calls f.__enter__() again. A similar error
    can be raised if __enter__ is invoked on a closed file object.

    For Python 2.5, the following types have been identified as
    context managers:
        - file
        - thread.LockType
        - threading.Lock
        - threading.RLock
        - threading.Condition
        - threading.Semaphore
        - threading.BoundedSemaphore

    A context manager will also be added to the decimal module to
    support using a local decimal arithmetic context within the body
    of a with statement, automatically restoring the original context
    when the with statement is exited.

Standard Terminology

    This PEP proposes that the protocol consisting of the __enter__()
    and __exit__() methods be known as the "context management protocol",
    and that objects that implement that protocol be known as "context
    managers". [4]

    The expression immediately following the with keyword in the
    statement is a "context expression" as that expression provides the
    main clue as to the runtime environment the context manager
    establishes for the duration of the statement body.

    The code in the body of the with statement and the variable name
    (or names) after the as keyword don't really have special terms at
    this point in time. The general terms "statement body" and "target
    list" can be used, prefixing with "with" or "with statement" if the
    terms would otherwise be unclear.

    Given the existence of objects such as the decimal module's
    arithmetic context, the term "context" is unfortunately ambiguous.
    If necessary, it can be made more specific by using the terms
    "context manager" for the concrete object created by the context
    expression and "runtime context" or (preferably) "runtime
    environment" for the actual state modifications made by the context
    manager. When simply discussing use of the with statement, the
    ambiguity shouldn't matter too much as the context expression fully
    defines the changes made to the runtime environment.
    The distinction is more important when discussing the mechanics of
    the with statement itself and how to go about actually implementing
    context managers.

Caching Context Managers

    Many context managers (such as files and generator-based contexts)
    will be single-use objects. Once the __exit__() method has been
    called, the context manager will no longer be in a usable state
    (e.g. the file has been closed, or the underlying generator has
    finished execution).

    Requiring a fresh manager object for each with statement is the
    easiest way to avoid problems with multi-threaded code and nested
    with statements trying to use the same context manager. It isn't
    coincidental that all of the standard library context managers
    that support reuse come from the threading module - they're all
    already designed to deal with the problems created by threaded
    and nested usage.

    This means that in order to save a context manager with particular
    initialisation arguments to be used in multiple with statements, it
    will typically be necessary to store it in a zero-argument callable
    that is then called in the context expression of each statement
    rather than caching the context manager directly.

    When this restriction does not apply, the documentation of the
    affected context manager should make that clear.


Resolved Issues

    The following issues were resolved by BDFL approval (and a lack
    of any major objections on python-dev).

    1. What exception should GeneratorContextManager raise when the
       underlying generator-iterator misbehaves? The following quote is
       the reason behind Guido's choice of RuntimeError for both this
       and for the generator close() method in PEP 342 (from [8]):

       "I'd rather not introduce a new exception class just for this
       purpose, since it's not an exception that I want people to catch:
       I want it to turn into a traceback which is seen by the
       programmer who then fixes the code.  So now I believe they
       should both raise RuntimeError.
       There are some precedents for that: it's raised by the core
       Python code in situations where endless recursion is detected,
       and for uninitialized objects (and for a variety of
       miscellaneous conditions)."

    2. It is fine to raise AttributeError instead of TypeError if the
       relevant methods aren't present on a class involved in a with
       statement. The fact that the abstract object C API raises
       TypeError rather than AttributeError is an accident of history,
       rather than a deliberate design decision [11].

    3. Objects with __enter__/__exit__ methods are called "context
       managers" and the decorator to convert a generator function
       into a context manager factory is ``contextlib.contextmanager``.
       There were some other suggestions [16] during the 2.5 release
       cycle but no compelling arguments for switching away from the
       terms that had been used in the PEP implementation were made.


Rejected Options

    For several months, the PEP prohibited suppression of exceptions
    in order to avoid hidden flow control. Implementation
    revealed this to be a right royal pain, so Guido restored the
    ability [13].

    Another aspect of the PEP that caused no end of questions and
    terminology debates was providing a __context__() method that
    was analogous to an iterable's __iter__() method [5, 7, 9].
    The ongoing problems [10, 13] with explaining what it was and why
    it was and how it was meant to work eventually lead to Guido
    killing the concept outright [15] (and there was much rejoicing!).

    The notion of using the PEP 342 generator API directly to define
    the with statement was also briefly entertained [6], but quickly
    dismissed as making it too difficult to write non-generator
    based context managers.


Examples

    The generator based examples rely on PEP 342. Also, some of the
    examples are unnecessary in practice, as the appropriate objects,
    such as threading.RLock, are able to be used directly in with
    statements.

    The tense used in the names of the example contexts is not
    arbitrary. Past tense ("-ed") is used when the name refers to an
    action which is done in the __enter__ method and undone in the
    __exit__ method. Progressive tense ("-ing") is used when the name
    refers to an action which is to be done in the __exit__ method.

    1. A template for ensuring that a lock, acquired at the start of a
       block, is released when the block is left:

        @contextmanager
        def locked(lock):
            lock.acquire()
            try:
                yield
            finally:
                lock.release()

       Used as follows:

        with locked(myLock):
            # Code here executes with myLock held.  The lock is
            # guaranteed to be released when the block is left (even
            # if via return or by an uncaught exception).

    2. A template for opening a file that ensures the file is closed
       when the block is left:

        @contextmanager
        def opened(filename, mode="r"):
            f = open(filename, mode)
            try:
                yield f
            finally:
                f.close()

       Used as follows:

        with opened("/etc/passwd") as f:
            for line in f:
                print line.rstrip()

    3. A template for committing or rolling back a database
       transaction:

        @contextmanager
        def transaction(db):
            db.begin()
            try:
                yield None
            except:
                db.rollback()
                raise
            else:
                db.commit()

    4. Example 1 rewritten without a generator:

        class locked:
           def __init__(self, lock):
               self.lock = lock
           def __enter__(self):
               self.lock.acquire()
           def __exit__(self, type, value, tb):
               self.lock.release()

       (This example is easily modified to implement the other
       relatively stateless examples; it shows that it is easy to avoid
       the need for a generator if no special state needs to be
       preserved.)

    5. Redirect stdout temporarily:

        @contextmanager
        def stdout_redirected(new_stdout):
            save_stdout = sys.stdout
            sys.stdout = new_stdout
            try:
                yield None
            finally:
                sys.stdout = save_stdout

       Used as follows:

        with opened(filename, "w") as f:
            with stdout_redirected(f):
                print "Hello world"

       This isn't thread-safe, of course, but neither is doing this
       same dance manually.  In single-threaded programs (for example,
       in scripts) it is a popular way of doing things.

    6. A variant on opened() that also returns an error condition:

        @contextmanager
        def opened_w_error(filename, mode="r"):
            try:
                f = open(filename, mode)
            except IOError, err:
                yield None, err
            else:
                try:
                    yield f, None
                finally:
                    f.close()

       Used as follows:

        with opened_w_error("/etc/passwd", "a") as (f, err):
            if err:
                print "IOError:", err
            else:
                f.write("guido::0:0::/:/bin/sh\n")

    7. Another useful example would be an operation that blocks
       signals.  The use could be like this:

        import signal

        with signal.blocked():
            # code executed without worrying about signals

       An optional argument might be a list of signals to be blocked;
       by default all signals are blocked.  The implementation is left
       as an exercise to the reader.

    8. Another use for this feature is the Decimal context.  Here's a
       simple example, after one posted by Michael Chermside:

        import decimal

        @contextmanager
        def extra_precision(places=2):
            c = decimal.getcontext()
            saved_prec = c.prec
            c.prec += places
            try:
                yield None
            finally:
                c.prec = saved_prec

       Sample usage (adapted from the Python Library Reference):

        def sin(x):
            "Return the sine of x as measured in radians."
            with extra_precision():
                i, lasts, s, fact, num, sign = 1, 0, x, 1, x, 1
                while s != lasts:
                    lasts = s
                    i += 2
                    fact *= i * (i-1)
                    num *= x * x
                    sign *= -1
                    s += num / fact * sign
            # The "+s" rounds back to the original precision,
            # so this must be outside the with-statement:
            return +s

     9. Here's a simple context manager for the decimal module:

         @contextmanager
         def localcontext(ctx=None):
             """Set a new local decimal context for the block"""
             # Default to using the current context
             if ctx is None:
                 ctx = getcontext()
             # We set the thread context to a copy of this context
             # to ensure that changes within the block are kept
             # local to the block.
             newctx = ctx.copy()
             oldctx = decimal.getcontext()
             decimal.setcontext(newctx)
             try:
                 yield newctx
             finally:
                 # Always restore the original context
                 decimal.setcontext(oldctx)

        Sample usage:

         from decimal import localcontext, ExtendedContext

         def sin(x):
             with localcontext() as ctx:
                 ctx.prec += 2
                 # Rest of sin calculation algorithm
                 # uses a precision 2 greater than normal
             return +s # Convert result to normal precision

         def sin(x):
             with localcontext(ExtendedContext):
                 # Rest of sin calculation algorithm
                 # uses the Extended Context from the
                 # General Decimal Arithmetic Specification
             return +s # Convert result to normal context

     10. A generic "object-closing" context manager:

         class closing(object):
             def __init__(self, obj):
                 self.obj = obj
             def __enter__(self):
                 return self.obj
             def __exit__(self, *exc_info):
                 try:
                     close_it = self.obj.close
                 except AttributeError:
                     pass
                 else:
                     close_it()

         This can be used to deterministically close anything with a
         close method, be it file, generator, or something else. It
         can even be used when the object isn't guaranteed to require
         closing (e.g., a function that accepts an arbitrary
         iterable):

         # emulate opening():
         with closing(open("argument.txt")) as contradiction:
            for line in contradiction:
                print line

         # deterministically finalize an iterator:
         with closing(iter(data_source)) as data:
            for datum in data:
                process(datum)

         (Python 2.5's contextlib module contains a version
          of this context manager) 

     11. PEP 319 gives a use case for also having a released()
         context to temporarily release a previously acquired lock;
         this can be written very similarly to the locked context
         manager above by swapping the acquire() and release() calls.

         class released:
           def __init__(self, lock):
               self.lock = lock
           def __enter__(self):
               self.lock.release()
           def __exit__(self, type, value, tb):
               self.lock.acquire()

         Sample usage:

         with my_lock:
             # Operations with the lock held
             with released(my_lock):
                 # Operations without the lock
                 # e.g. blocking I/O
             # Lock is held again here

     12. A "nested" context manager that automatically nests the
         supplied contexts from left-to-right to avoid excessive
         indentation:

         @contextmanager
         def nested(*contexts):
             exits = []
             vars = []
             try:
                 try:
                     for context in contexts:
                         mgr = context.__context__()
                         exit = mgr.__exit__
                         enter = mgr.__enter__
                         vars.append(enter())
                         exits.append(exit)
                     yield vars
                 except:
                     exc = sys.exc_info()
                 else:
                     exc = (None, None, None)
             finally:
                 while exits:
                     exit = exits.pop()
                     try:
                         exit(*exc)
                     except:
                         exc = sys.exc_info()
                     else:
                         exc = (None, None, None)
                 if exc != (None, None, None):
                     # sys.exc_info() may have been
                     # changed by one of the exit methods
                     # so provide explicit exception info
                     raise exc[0], exc[1], exc[2]

         Sample usage:

         with nested(a, b, c) as (x, y, z):
             # Perform operation

         Is equivalent to:

          with a as x:
              with b as y:
                  with c as z:
                      # Perform operation

         (Python 2.5's contextlib module contains a version
          of this context manager) 

Reference Implementation

    This PEP was first accepted by Guido at his EuroPython
    keynote, 27 June 2005.
    It was accepted again later, with the __context__ method added.
    The PEP was implemented in Subversion for Python 2.5a1
    The __context__() method will be removed in Python 2.5a3


Ackowledgements

    Many people contributed to the ideas and concepts in this PEP,
    including all those mentioned in the acknowledgements for PEP 340
    and PEP 346.

    Additional thanks goes to (in no meaningful order): Paul Moore,
    Phillip J. Eby, Greg Ewing, Jason Orendorff, Michael Hudson,
    Raymond Hettinger, Walter Dรถrwald, Aahz, Georg Brandl, Terry Reedy,
    A.M. Kuchling, Brett Cannon, and all those that participated in the
    discussions on python-dev.


References

    [1] Raymond Chen's article on hidden flow control
    http://blogs.msdn.com/oldnewthing/archive/2005/01/06/347666.aspx

    [2] Guido suggests some generator changes that ended up in PEP 342
    http://mail.python.org/pipermail/python-dev/2005-May/053885.html

    [3] Wiki discussion of PEP 343
    http://wiki.python.org/moin/WithStatement

    [4] Early draft of some documentation for the with statement
    http://mail.python.org/pipermail/python-dev/2005-July/054658.html

    [5] Proposal to add the __with__ method
    http://mail.python.org/pipermail/python-dev/2005-October/056947.html

    [6] Proposal to use the PEP 342 enhanced generator API directly
    http://mail.python.org/pipermail/python-dev/2005-October/056969.html

    [7] Guido lets me (Nick Coghlan) talk him into a bad idea ;)
    http://mail.python.org/pipermail/python-dev/2005-October/057018.html

    [8] Guido raises some exception handling questions
    http://mail.python.org/pipermail/python-dev/2005-June/054064.html

    [9] Guido answers some questions about the __context__ method
    http://mail.python.org/pipermail/python-dev/2005-October/057520.html

    [10] Guido answers more questions about the __context__ method
    http://mail.python.org/pipermail/python-dev/2005-October/057535.html

    [11] Guido says AttributeError is fine for missing special methods
    http://mail.python.org/pipermail/python-dev/2005-October/057625.html

    [12] Original PEP 342 implementation patch
    http://sourceforge.net/tracker/index.php?func=detail&aid=1223381&group_id=5470&atid=305470

    [13] Guido restores the ability to suppress exceptions
    http://mail.python.org/pipermail/python-dev/2006-February/061909.html

    [14] A simple question kickstarts a thorough review of PEP 343
    http://mail.python.org/pipermail/python-dev/2006-April/063859.html

    [15] Guido kills the __context__() method
    http://mail.python.org/pipermail/python-dev/2006-April/064632.html

    [16] Proposal to use 'context guard' instead of 'context manager'
    http://mail.python.org/pipermail/python-dev/2006-May/064676.html

Copyright

    This document has been placed in the public domain.


..

pep-0344 Exception Chaining and Embedded Tracebacks

PEP: 344
Title: Exception Chaining and Embedded Tracebacks
Version: $Revision$
Last-Modified: $Date$
Author: Ka-Ping Yee
Status: Superseded
Type: Standards Track
Content-Type: text/plain
Created: 12-May-2005
Python-Version: 2.5
Post-History: 

Numbering Note

    This PEP has been renumbered to PEP 3134.  The text below is the
    last version submitted under the old number.


Abstract

    This PEP proposes three standard attributes on exception instances:
    the '__context__' attribute for implicitly chained exceptions, the
    '__cause__' attribute for explicitly chained exceptions, and the
    '__traceback__' attribute for the traceback.  A new "raise ... from"
    statement sets the '__cause__' attribute.


Motivation

    During the handling of one exception (exception A), it is possible
    that another exception (exception B) may occur.  In today's Python
    (version 2.4), if this happens, exception B is propagated outward
    and exception A is lost.  In order to debug the problem, it is
    useful to know about both exceptions.  The '__context__' attribute
    retains this information automatically.

    Sometimes it can be useful for an exception handler to intentionally
    re-raise an exception, either to provide extra information or to
    translate an exception to another type.  The '__cause__' attribute
    provides an explicit way to record the direct cause of an exception.

    In today's Python implementation, exceptions are composed of three
    parts: the type, the value, and the traceback.  The 'sys' module,
    exposes the current exception in three parallel variables, exc_type,
    exc_value, and exc_traceback, the sys.exc_info() function returns a
    tuple of these three parts, and the 'raise' statement has a
    three-argument form accepting these three parts.  Manipulating
    exceptions often requires passing these three things in parallel,
    which can be tedious and error-prone.  Additionally, the 'except'
    statement can only provide access to the value, not the traceback.
    Adding the '__traceback__' attribute to exception values makes all
    the exception information accessible from a single place.


History

    Raymond Hettinger [1] raised the issue of masked exceptions on
    Python-Dev in January 2003 and proposed a PyErr_FormatAppend()
    function that C modules could use to augment the currently active
    exception with more information.  Brett Cannon [2] brought up
    chained exceptions again in June 2003, prompting a long discussion.

    Greg Ewing [3] identified the case of an exception occuring in a
    'finally' block during unwinding triggered by an original exception,
    as distinct from the case of an exception occuring in an 'except'
    block that is handling the original exception.

    Greg Ewing [4] and Guido van Rossum [5], and probably others, have
    previously mentioned adding a traceback attribute to Exception
    instances.  This is noted in PEP 3000.

    This PEP was motivated by yet another recent Python-Dev reposting
    of the same ideas [6] [7].


Rationale

    The Python-Dev discussions revealed interest in exception chaining
    for two quite different purposes.  To handle the unexpected raising
    of a secondary exception, the exception must be retained implicitly.
    To support intentional translation of an exception, there must be a
    way to chain exceptions explicitly.  This PEP addresses both.

    Several attribute names for chained exceptions have been suggested
    on Python-Dev [2], including 'cause', 'antecedent', 'reason',
    'original', 'chain', 'chainedexc', 'exc_chain', 'excprev',
    'previous', and 'precursor'.  For an explicitly chained exception,
    this PEP suggests '__cause__' because of its specific meaning.  For
    an implicitly chained exception, this PEP proposes the name
    '__context__' because the intended meaning is more specific than
    temporal precedence but less specific than causation: an exception
    occurs in the context of handling another exception.
    
    This PEP suggests names with leading and trailing double-underscores
    for these three attributes because they are set by the Python VM.
    Only in very special cases should they be set by normal assignment.

    This PEP handles exceptions that occur during 'except' blocks and
    'finally' blocks in the same way.  Reading the traceback makes it
    clear where the exceptions occurred, so additional mechanisms for
    distinguishing the two cases would only add unnecessary complexity.

    This PEP proposes that the outermost exception object (the one
    exposed for matching by 'except' clauses) be the most recently
    raised exception for compatibility with current behaviour.

    This PEP proposes that tracebacks display the outermost exception
    last, because this would be consistent with the chronological order
    of tracebacks (from oldest to most recent frame) and because the
    actual thrown exception is easier to find on the last line.

    To keep things simpler, the C API calls for setting an exception
    will not automatically set the exception's '__context__'.  Guido
    van Rossum has has expressed concerns with making such changes [8].

    As for other languages, Java and Ruby both discard the original
    exception when another exception occurs in a 'catch'/'rescue' or
    'finally'/'ensure' clause.  Perl 5 lacks built-in structured
    exception handling.  For Perl 6, RFC number 88 [9] proposes an exception
    mechanism that implicitly retains chained exceptions in an array
    named @@.  In that RFC, the most recently raised exception is
    exposed for matching, as in this PEP; also, arbitrary expressions
    (possibly involving @@) can be evaluated for exception matching.

    Exceptions in C# contain a read-only 'InnerException' property that
    may point to another exception.  Its documentation [10] says that
    "When an exception X is thrown as a direct result of a previous
    exception Y, the InnerException property of X should contain a
    reference to Y."  This property is not set by the VM automatically;
    rather, all exception constructors take an optional 'innerException'
    argument to set it explicitly.  The '__cause__' attribute fulfills
    the same purpose as InnerException, but this PEP proposes a new form
    of 'raise' rather than extending the constructors of all exceptions.
    C# also provides a GetBaseException method that jumps directly to
    the end of the InnerException chain; this PEP proposes no analog.

    The reason all three of these attributes are presented together in
    one proposal is that the '__traceback__' attribute provides
    convenient access to the traceback on chained exceptions.


Implicit Exception Chaining

    Here is an example to illustrate the '__context__' attribute.

        def compute(a, b):
            try:
                a/b
            except Exception, exc:
                log(exc)

        def log(exc):
            file = open('logfile.txt')  # oops, forgot the 'w'
            print >>file, exc
            file.close()

    Calling compute(0, 0) causes a ZeroDivisionError.  The compute()
    function catches this exception and calls log(exc), but the log()
    function also raises an exception when it tries to write to a
    file that wasn't opened for writing.

    In today's Python, the caller of compute() gets thrown an IOError.
    The ZeroDivisionError is lost.  With the proposed change, the
    instance of IOError has an additional '__context__' attribute that
    retains the ZeroDivisionError.

    The following more elaborate example demonstrates the handling of a
    mixture of 'finally' and 'except' clauses:

        def main(filename):
            file = open(filename)       # oops, forgot the 'w'
            try:
                try:
                    compute()
                except Exception, exc:
                    log(file, exc)
            finally:
                file.clos()             # oops, misspelled 'close'
        
        def compute():
            1/0
        
        def log(file, exc):
            try:
                print >>file, exc       # oops, file is not writable
            except:
                display(exc)
        
        def display(exc):
            print ex                    # oops, misspelled 'exc'

    Calling main() with the name of an existing file will trigger four
    exceptions.  The ultimate result will be an AttributeError due to
    the misspelling of 'clos', whose __context__ points to a NameError
    due to the misspelling of 'ex', whose __context__ points to an
    IOError due to the file being read-only, whose __context__ points to
    a ZeroDivisionError, whose __context__ attribute is None.

    The proposed semantics are as follows:

    1.  Each thread has an exception context initially set to None.
    
    2.  Whenever an exception is raised, if the exception instance does
        not already have a '__context__' attribute, the interpreter sets
        it equal to the thread's exception context.

    3.  Immediately after an exception is raised, the thread's exception
        context is set to the exception.

    4.  Whenever the interpreter exits an 'except' block by reaching the
        end or executing a 'return', 'yield', 'continue', or 'break'
        statement, the thread's exception context is set to None.


Explicit Exception Chaining

    The '__cause__' attribute on exception objects is always initialized
    to None.  It is set by a new form of the 'raise' statement:

        raise EXCEPTION from CAUSE

    which is equivalent to:

        exc = EXCEPTION
        exc.__cause__ = CAUSE
        raise exc
    
    In the following example, a database provides implementations for a
    few different kinds of storage, with file storage as one kind.  The
    database designer wants errors to propagate as DatabaseError objects
    so that the client doesn't have to be aware of the storage-specific
    details, but doesn't want to lose the underlying error information.

        class DatabaseError(StandardError):
            pass

        class FileDatabase(Database):
            def __init__(self, filename):
                try:
                    self.file = open(filename)
                except IOError, exc:
                    raise DatabaseError('failed to open') from exc

    If the call to open() raises an exception, the problem will be
    reported as a DatabaseError, with a __cause__ attribute that reveals
    the IOError as the original cause.


Traceback Attribute

    The following example illustrates the '__traceback__' attribute.

        def do_logged(file, work):
            try:
                work()
            except Exception, exc:
                write_exception(file, exc)
                raise exc

        from traceback import format_tb

        def write_exception(file, exc):
            ...
            type = exc.__class__
            message = str(exc)
            lines = format_tb(exc.__traceback__)
            file.write(... type ... message ... lines ...)
            ...

    In today's Python, the do_logged() function would have to extract
    the traceback from sys.exc_traceback or sys.exc_info()[2] and pass
    both the value and the traceback to write_exception().  With the
    proposed change, write_exception() simply gets one argument and
    obtains the exception using the '__traceback__' attribute.

    The proposed semantics are as follows:

    1.  Whenever an exception is caught, if the exception instance does
        not already have a '__traceback__' attribute, the interpreter
        sets it to the newly caught traceback.


Enhanced Reporting

    The default exception handler will be modified to report chained
    exceptions.  The chain of exceptions is traversed by following the
    '__cause__' and '__context__' attributes, with '__cause__' taking
    priority.  In keeping with the chronological order of tracebacks,
    the most recently raised exception is displayed last; that is, the
    display begins with the description of the innermost exception and
    backs up the chain to the outermost exception.  The tracebacks are
    formatted as usual, with one of the lines:

        The above exception was the direct cause of the following exception:

    or

        During handling of the above exception, another exception occurred:

    between tracebacks, depending whether they are linked by __cause__
    or __context__ respectively.  Here is a sketch of the procedure:
    
        def print_chain(exc):
            if exc.__cause__:
                print_chain(exc.__cause__)
                print '\nThe above exception was the direct cause...'
            elif exc.__context__:
                print_chain(exc.__context__)
                print '\nDuring handling of the above exception, ...'
            print_exc(exc)

    In the 'traceback' module, the format_exception, print_exception,
    print_exc, and print_last functions will be updated to accept an
    optional 'chain' argument, True by default.  When this argument is
    True, these functions will format or display the entire chain of
    exceptions as just described.  When it is False, these functions
    will format or display only the outermost exception.

    The 'cgitb' module should also be updated to display the entire
    chain of exceptions.


C API

    The PyErr_Set* calls for setting exceptions will not set the
    '__context__' attribute on exceptions.  PyErr_NormalizeException
    will always set the 'traceback' attribute to its 'tb' argument and
    the '__context__' and '__cause__' attributes to None.

    A new API function, PyErr_SetContext(context), will help C
    programmers provide chained exception information.  This function
    will first normalize the current exception so it is an instance,
    then set its '__context__' attribute.  A similar API function,
    PyErr_SetCause(cause), will set the '__cause__' attribute.


Compatibility

    Chained exceptions expose the type of the most recent exception, so
    they will still match the same 'except' clauses as they do now.

    The proposed changes should not break any code unless it sets or
    uses attributes named '__context__', '__cause__', or '__traceback__'
    on exception instances.  As of 2005-05-12, the Python standard
    library contains no mention of such attributes.


Open Issue: Extra Information

    Walter Dรถrwald [11] expressed a desire to attach extra information
    to an exception during its upward propagation without changing its
    type.  This could be a useful feature, but it is not addressed by
    this PEP.  It could conceivably be addressed by a separate PEP
    establishing conventions for other informational attributes on
    exceptions.


Open Issue: Suppressing Context

    As written, this PEP makes it impossible to suppress '__context__',
    since setting exc.__context__ to None in an 'except' or 'finally'
    clause will only result in it being set again when exc is raised.


Open Issue: Limiting Exception Types

    To improve encapsulation, library implementors may want to wrap all
    implementation-level exceptions with an application-level exception.
    One could try to wrap exceptions by writing this:

        try:
            ... implementation may raise an exception ...
        except:
            import sys
            raise ApplicationError from sys.exc_value

    or this:

        try:
            ... implementation may raise an exception ...
        except Exception, exc:
            raise ApplicationError from exc

    but both are somewhat flawed.  It would be nice to be able to name
    the current exception in a catch-all 'except' clause, but that isn't
    addressed here.  Such a feature would allow something like this:

        try:
            ... implementation may raise an exception ...
        except *, exc:
            raise ApplicationError from exc


Open Issue: yield

    The exception context is lost when a 'yield' statement is executed;
    resuming the frame after the 'yield' does not restore the context.
    Addressing this problem is out of the scope of this PEP; it is not a
    new problem, as demonstrated by the following example:

        >>> def gen():
        ...     try:
        ...         1/0
        ...     except:
        ...         yield 3
        ...         raise
        ...
        >>> g = gen()
        >>> g.next()
        3
        >>> g.next()
        TypeError: exceptions must be classes, instances, or strings
        (deprecated), not NoneType


Open Issue: Garbage Collection

    The strongest objection to this proposal has been that it creates
    cycles between exceptions and stack frames [12].  Collection of
    cyclic garbage (and therefore resource release) can be greatly
    delayed.

        >>> try:
        >>>   1/0
        >>> except Exception, err:
        >>>   pass

    will introduce a cycle from err -> traceback -> stack frame -> err,
    keeping all locals in the same scope alive until the next GC happens.

    Today, these locals would go out of scope.  There is lots of code
    which assumes that "local" resources -- particularly open files -- will
    be closed quickly.  If closure has to wait for the next GC, a program
    (which runs fine today) may run out of file handles.

    Making the __traceback__ attribute a weak reference would avoid the
    problems with cyclic garbage.  Unfortunately, it would make saving
    the Exception for later (as unittest does) more awkward, and it would
    not allow as much cleanup of the sys module.

    A possible alternate solution, suggested by Adam Olsen, would be to
    instead turn the reference from the stack frame to the 'err' variable
    into a weak reference when the variable goes out of scope [13].

  

Possible Future Compatible Changes

    These changes are consistent with the appearance of exceptions as
    a single object rather than a triple at the interpreter level.

    - If PEP 340 or PEP 343 is accepted, replace the three (type, value,
      traceback) arguments to __exit__ with a single exception argument.

    - Deprecate sys.exc_type, sys.exc_value, sys.exc_traceback, and
      sys.exc_info() in favour of a single member, sys.exception.

    - Deprecate sys.last_type, sys.last_value, and sys.last_traceback
      in favour of a single member, sys.last_exception.

    - Deprecate the three-argument form of the 'raise' statement in
      favour of the one-argument form.

    - Upgrade cgitb.html() to accept a single value as its first
      argument as an alternative to a (type, value, traceback) tuple.


Possible Future Incompatible Changes

    These changes might be worth considering for Python 3000.

    - Remove sys.exc_type, sys.exc_value, sys.exc_traceback, and
      sys.exc_info().

    - Remove sys.last_type, sys.last_value, and sys.last_traceback.

    - Replace the three-argument sys.excepthook with a one-argument
      API, and changing the 'cgitb' module to match.

    - Remove the three-argument form of the 'raise' statement.

    - Upgrade traceback.print_exception to accept an 'exception'
      argument instead of the type, value, and traceback arguments.


Acknowledgements

    Brett Cannon, Greg Ewing, Guido van Rossum, Jeremy Hylton, Phillip
    J. Eby, Raymond Hettinger, Walter Dรถrwald, and others.


References

    [1] Raymond Hettinger, "Idea for avoiding exception masking"
        http://mail.python.org/pipermail/python-dev/2003-January/032492.html

    [2] Brett Cannon explains chained exceptions
        http://mail.python.org/pipermail/python-dev/2003-June/036063.html

    [3] Greg Ewing points out masking caused by exceptions during finally
        http://mail.python.org/pipermail/python-dev/2003-June/036290.html

    [4] Greg Ewing suggests storing the traceback in the exception object
        http://mail.python.org/pipermail/python-dev/2003-June/036092.html

    [5] Guido van Rossum mentions exceptions having a traceback attribute
        http://mail.python.org/pipermail/python-dev/2005-April/053060.html

    [6] Ka-Ping Yee, "Tidier Exceptions"
        http://mail.python.org/pipermail/python-dev/2005-May/053671.html

    [7] Ka-Ping Yee, "Chained Exceptions"
        http://mail.python.org/pipermail/python-dev/2005-May/053672.html

    [8] Guido van Rossum discusses automatic chaining in PyErr_Set*
        http://mail.python.org/pipermail/python-dev/2003-June/036180.html

    [9] Tony Olensky, "Omnibus Structured Exception/Error Handling Mechanism"
        http://dev.perl.org/perl6/rfc/88.html
     
   [10] MSDN .NET Framework Library, "Exception.InnerException Property"
        http://msdn.microsoft.com/library/en-us/cpref/html/frlrfsystemexceptionclassinnerexceptiontopic.asp

   [11] Walter Dรถrwald suggests wrapping exceptions to add details
        http://mail.python.org/pipermail/python-dev/2003-June/036148.html

   [12] Guido van Rossum restates the objection to cyclic trash
        http://mail.python.org/pipermail/python-3000/2007-January/005322.html

   [13] Adam Olsen suggests using a weakref from stack frame to exception
        http://mail.python.org/pipermail/python-3000/2007-January/005363.html


Copyright

    This document has been placed in the public domain.


pep-0345 Metadata for Python Software Packages 1.2

PEP:345
Title:Metadata for Python Software Packages 1.2
Version:$Revision$
Last-Modified:$Date$
Author:Richard Jones <richard at python.org>
Discussions-To:Distutils SIG
Status:Accepted
Type:Standards Track
Content-Type:text/x-rst
Created:28-Apr-2005
Python-Version:2.5
Post-History:

Abstract

This PEP describes a mechanism for adding metadata to Python distributions. It includes specifics of the field names, and their semantics and usage.

This document specifies version 1.2 of the metadata format. Version 1.0 is specified in PEP 241. Version 1.1 is specified in PEP 314.

Version 1.2 of the metadata format adds a number of optional fields designed to make third-party packaging of Python Software easier. These fields are "Requires-Python", "Requires-External", "Requires-Dist", "Provides-Dist", and "Obsoletes-Dist". This version also changes the "Platform" field. Three new fields were also added: "Maintainer", "Maintainer-email" and "Project-URL".

Last, this new version also adds environment markers.

Fields

This section specifies the names and semantics of each of the supported metadata fields.

Fields marked with "(Multiple use)" may be specified multiple times in a single PKG-INFO file. Other fields may only occur once in a PKG-INFO file. Fields marked with "(optional)" are not required to appear in a valid PKG-INFO file; all other fields must be present.

Metadata-Version

Version of the file format; "1.2" is the only legal value.

Example:

Metadata-Version: 1.2

Name

The name of the distributions.

Example:

Name: BeagleVote

Version

A string containing the distribution's version number. This field must be in the format specified in PEP 386.

Example:

Version: 1.0a2

Platform (multiple use)

A Platform specification describing an operating system supported by the distribution which is not listed in the "Operating System" Trove classifiers. See "Classifier" below.

Examples:

Platform: ObscureUnix
Platform: RareDOS

Supported-Platform (multiple use)

Binary distributions containing a PKG-INFO file will use the Supported-Platform field in their metadata to specify the OS and CPU for which the binary distribution was compiled. The semantics of the Supported-Platform field are not specified in this PEP.

Example:

Supported-Platform: RedHat 7.2
Supported-Platform: i386-win32-2791

Summary

A one-line summary of what the distribution does.

Example:

Summary: A module for collecting votes from beagles.

Description (optional)

A longer description of the distribution that can run to several paragraphs. Software that deals with metadata should not assume any maximum size for this field, though people shouldn't include their instruction manual as the description.

The contents of this field can be written using reStructuredText markup [1]. For programs that work with the metadata, supporting markup is optional; programs can also display the contents of the field as-is. This means that authors should be conservative in the markup they use.

To support empty lines and lines with indentation with respect to the RFC 822 format, any CRLF character has to be suffixed by 7 spaces followed by a pipe ("|") char. As a result, the Description field is encoded into a folded field that can be interpreted by RFC822 parser [2].

Example:

Description: This project provides powerful math functions
        |For example, you can use `sum()` to sum numbers:
        |
        |Example::
        |
        |    >>> sum(1, 2)
        |    3
        |

This encoding implies that any occurences of a CRLF followed by 7 spaces and a pipe char have to be replaced by a single CRLF when the field is unfolded using a RFC822 reader.

Keywords (optional)

A list of additional keywords to be used to assist searching for the distribution in a larger catalog.

Example:

Keywords: dog puppy voting election

Home-page (optional)

A string containing the URL for the distribution's home page.

Example:

Home-page: http://www.example.com/~cschultz/bvote/

Download-URL

A string containing the URL from which this version of the distribution can be downloaded. (This means that the URL can't be something like ".../BeagleVote-latest.tgz", but instead must be ".../BeagleVote-0.45.tgz".)

Author (optional)

A string containing the author's name at a minimum; additional contact information may be provided.

Example:

Author: C. Schultz, Universal Features Syndicate,
        Los Angeles, CA <cschultz@peanuts.example.com>

Author-email (optional)

A string containing the author's e-mail address. It can contain a name and e-mail address in the legal forms for a RFC-822 From: header.

Example:

Author-email: "C. Schultz" <cschultz@example.com>

Maintainer (optional)

A string containing the maintainer's name at a minimum; additional contact information may be provided.

Note that this field is intended for use when a project is being maintained by someone other than the original author: it should be omitted if it is identical to Author.

Example:

Maintainer: C. Schultz, Universal Features Syndicate,
        Los Angeles, CA <cschultz@peanuts.example.com>

Maintainer-email (optional)

A string containing the maintainer's e-mail address. It can contain a name and e-mail address in the legal forms for a RFC-822 From: header.

Note that this field is intended for use when a project is being maintained by someone other than the original author: it should be omitted if it is identical to Author-email.

Example:

Maintainer-email: "C. Schultz" <cschultz@example.com>

License (optional)

Text indicating the license covering the distribution where the license is not a selection from the "License" Trove classifiers. See "Classifier" below. This field may also be used to specify a particular version of a licencse which is named via the Classifier field, or to indicate a variation or exception to such a license.

Examples:

License: This software may only be obtained by sending the
        author a postcard, and then the user promises not
        to redistribute it.

License: GPL version 3, excluding DRM provisions

Classifier (multiple use)

Each entry is a string giving a single classification value for the distribution. Classifiers are described in PEP 301 [3].

Examples:

Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console (Text Based)

Requires-Dist (multiple use)

Each entry contains a string naming some other distutils project required by this distribution.

The format of a requirement string is identical to that of a distutils project name (e.g., as found in the Name: field. optionally followed by a version declaration within parentheses.

The distutils project names should correspond to names as found on the Python Package Index [4].

Version declarations must follow the rules described in Version Specifiers

Examples:

Requires-Dist: pkginfo
Requires-Dist: PasteDeploy
Requires-Dist: zope.interface (>3.5.0)

Provides-Dist (multiple use)

Each entry contains a string naming a Distutils project which is contained within this distribution. This field must include the project identified in the Name field, followed by the version : Name (Version).

A distribution may provide additional names, e.g. to indicate that multiple projects have been bundled together. For instance, source distributions of the ZODB project have historically included the transaction project, which is now available as a separate distribution. Installing such a source distribution satisfies requirements for both ZODB and transaction.

A distribution may also provide a "virtual" project name, which does not correspond to any separately-distributed project: such a name might be used to indicate an abstract capability which could be supplied by one of multiple projects. E.g., multiple projects might supply RDBMS bindings for use by a given ORM: each project might declare that it provides ORM-bindings, allowing other projects to depend only on having at most one of them installed.

A version declaration may be supplied and must follow the rules described in Version Specifiers. The distribution's version number will be implied if none is specified.

Examples:

Provides-Dist: OtherProject
Provides-Dist: AnotherProject (3.4)
Provides-Dist: virtual_package

Obsoletes-Dist (multiple use)

Each entry contains a string describing a distutils project's distribution which this distribution renders obsolete, meaning that the two projects should not be installed at the same time.

Version declarations can be supplied. Version numbers must be in the format specified in Version Specifiers.

The most common use of this field will be in case a project name changes, e.g. Gorgon 2.3 gets subsumed into Torqued Python 1.0. When you install Torqued Python, the Gorgon distribution should be removed.

Examples:

Obsoletes-Dist: Gorgon
Obsoletes-Dist: OtherProject (<3.0)

Requires-Python

This field specifies the Python version(s) that the distribution is guaranteed to be compatible with.

Version numbers must be in the format specified in Version Specifiers.

Examples:

Requires-Python: 2.5
Requires-Python: >2.1
Requires-Python: >=2.3.4
Requires-Python: >=2.5,<2.7

Requires-External (multiple use)

Each entry contains a string describing some dependency in the system that the distribution is to be used. This field is intended to serve as a hint to downstream project maintainers, and has no semantics which are meaningful to the distutils distribution.

The format of a requirement string is a name of an external dependency, optionally followed by a version declaration within parentheses.

Because they refer to non-Python software releases, version numbers for this field are not required to conform to the format specified in PEP 386: they should correspond to the version scheme used by the external dependency.

Notice that there's is no particular rule on the strings to be used.

Examples:

Requires-External: C
Requires-External: libpng (>=1.5)

Project-URL (multiple-use)

A string containing a browsable URL for the project and a label for it, separated by a comma.

Example:

Bug Tracker, http://bitbucket.org/tarek/distribute/issues/

The label is a free text limited to 32 signs.

Version Specifiers

Version specifiers are a series of conditional operators and version numbers, separated by commas. Conditional operators must be one of "<", ">", "<=", ">=", "==" and "!=".

Any number of conditional operators can be specified, e.g. the string ">1.0, !=1.3.4, <2.0" is a legal version declaration. The comma (",") is equivalent to the and operator.

Each version number must be in the format specified in PEP 386.

When a version is provided, it always includes all versions that starts with the same value. For example the "2.5" version of Python will include versions like "2.5.2" or "2.5.3". Pre and post releases in that case are excluded. So in our example, versions like "2.5a1" are not included when "2.5" is used. If the first version of the range is required, it has to be explicitly given. In our example, it will be "2.5.0".

Notice that some projects might omit the ".0" prefix for the first release of the "2.5.x" series:

  • 2.5
  • 2.5.1
  • 2.5.2
  • etc.

In that case, "2.5.0" will have to be explicitly used to avoid any confusion between the "2.5" notation that represents the full range. It is a recommended practice to use schemes of the same length for a series to completely avoid this problem.

Some Examples:

  • Requires-Dist: zope.interface (3.1): any version that starts with 3.1, excluding post or pre-releases.
  • Requires-Dist: zope.interface (3.1.0): any version that starts with 3.1.0, excluding post or pre-releases. Since that particular project doesn't use more than 3 digits, it also means "only the 3.1.0 release".
  • Requires-Python: 3: Any Python 3 version, no matter wich one, excluding post or pre-releases.
  • Requires-Python: >=2.6,<3: Any version of Python 2.6 or 2.7, including post releases of 2.6, pre and post releases of 2.7. It excludes pre releases of Python 3.
  • Requires-Python: 2.6.2: Equivalent to ">=2.6.2,<2.6.3". So this includes only Python 2.6.2. Of course, if Python was numbered with 4 digits, it would have include all versions of the 2.6.2 series.
  • Requires-Python: 2.5.0: Equivalent to ">=2.5.0,<2.5.1".
  • Requires-Dist: zope.interface (3.1,!=3.1.3): any version that starts with 3.1, excluding post or pre-releases of 3.1 and excluding any version that starts with "3.1.3". For this particular project, this means: "any version of the 3.1 series but not 3.1.3". This is equivalent to: ">=3.1,!=3.1.3,<3.2".

Environment markers

An environment marker is a marker that can be added at the end of a field after a semi-colon (";"), to add a condition about the execution environment.

Here are some example of fields using such markers:

Requires-Dist: pywin32 (>1.0); sys.platform == 'win32'
Obsoletes-Dist: pywin31; sys.platform == 'win32'
Requires-Dist: foo (1,!=1.3); platform.machine == 'i386'
Requires-Dist: bar; python_version == '2.4' or python_version == '2.5'
Requires-External: libxslt; 'linux' in sys.platform

The micro-language behind this is the simplest possible: it compares only strings, with the == and in operators (and their opposites), and with the ability to combine expressions. It makes it also easy to understand to non-pythoneers.

The pseudo-grammar is

EXPR [in|==|!=|not in] EXPR [or|and] ...

where EXPR belongs to any of those:

  • python_version = '%s.%s' % (sys.version_info[0], sys.version_info[1])
  • python_full_version = sys.version.split()[0]
  • os.name = os.name
  • sys.platform = sys.platform
  • platform.version = platform.version()
  • platform.machine = platform.machine()
  • platform.python_implementation = platform.python_implementation()
  • a free string, like '2.4', or 'win32'

Notice that in is restricted to strings, meaning that it is not possible to use other sequences like tuples or lists on the right side.

The fields that benefit from this marker are:

  • Requires-Python
  • Requires-External
  • Requires-Dist
  • Provides-Dist
  • Obsoletes-Dist
  • Classifier

Summary of Differences From PEP 314

  • Metadata-Version is now 1.2.
  • Added the environment markers.
  • Changed fields:
    • Platform
    • Author
  • Added fields:
    • Maintainer
    • Maintainer-email
    • Requires-Python
    • Requires-External
    • Requires-Dist
    • Provides-Dist
    • Obsoletes-Dist
    • Project-URL
  • Deprecated fields:
    • Requires (in favor of Requires-Dist)
    • Provides (in favor of Provides-Dist)
    • Obsoletes (in favor of Obsoletes-Dist)

References

This document specifies version 1.2 of the metadata format. Version 1.0 is specified in PEP 241. Version 1.1 is specified in PEP 314.

[1]reStructuredText markup: http://docutils.sourceforge.net/
[2]RFC 822 Long Header Fields: http://www.freesoft.org/CIE/RFC/822/7.htm
[3]PEP 301, Package Index and Metadata for Distutils: http://www.python.org/dev/peps/pep-0301/
[4]http://pypi.python.org/pypi/

Acknowledgements

Fred Drake, Anthony Baxter and Matthias Klose have all contributed to the ideas presented in this PEP.

Tres Seaver, Jim Fulton, Marc-André Lemburg, Martin von Löwis, Tarek Ziadé, David Lyon and other people at the Distutils-SIG have contributed to the new updated version.

pep-0346 User Defined ("with") Statements

PEP:346
Title:User Defined ("with") Statements
Version:$Revision$
Last-Modified:$Date$
Author:Nick Coghlan <ncoghlan at gmail.com>
Status:Withdrawn
Type:Standards Track
Content-Type:text/x-rst
Created:6-May-2005
Python-Version:2.5
Post-History:

Abstract

This PEP is a combination of PEP 310's "Reliable Acquisition/Release Pairs" with the "Anonymous Block Statements" of Guido's PEP 340. This PEP aims to take the good parts of PEP 340, blend them with parts of PEP 310 and rearrange the lot into an elegant whole. It borrows from various other PEPs in order to paint a complete picture, and is intended to stand on its own.

Author's Note

During the discussion of PEP 340, I maintained drafts of this PEP as PEP 3XX on my own website (since I didn't have CVS access to update a submitted PEP fast enough to track the activity on python-dev).

Since the first draft of this PEP, Guido wrote PEP 343 as a simplified version of PEP 340. PEP 343 (at the time of writing) uses the exact same semantics for the new statements as this PEP, but uses a slightly different mechanism to allow generators to be used to write statement templates. However, Guido has indicated that he intends to accept a new PEP being written by Raymond Hettinger that will integrate PEP 288 and PEP 325, and will permit a generator decorator like the one described in this PEP to be used to write statement templates for PEP 343. The other difference was the choice of keyword ('with' versus 'do') and Guido has stated he will organise a vote on that in the context of PEP 343.

Accordingly, the version of this PEP submitted for archiving on python.org is to be WITHDRAWN immediately after submission. PEP 343 and the combined generator enhancement PEP will cover the important ideas.

Introduction

This PEP proposes that Python's ability to reliably manage resources be enhanced by the introduction of a new with statement that allows factoring out of arbitrary try/finally and some try/except/else boilerplate. The new construct is called a 'user defined statement', and the associated class definitions are called 'statement templates'.

The above is the main point of the PEP. However, if that was all it said, then PEP 310 would be sufficient and this PEP would be essentially redundant. Instead, this PEP recommends additional enhancements that make it natural to write these statement templates using appropriately decorated generators. A side effect of those enhancements is that it becomes important to appropriately deal with the management of resources inside generators.

This is quite similar to PEP 343, but the exceptions that occur are re-raised inside the generators frame, and the issue of generator finalisation needs to be addressed as a result. The template generator decorator suggested by this PEP also creates reusable templates, rather than the single use templates of PEP 340.

In comparison to PEP 340, this PEP eliminates the ability to suppress exceptions, and makes the user defined statement a non-looping construct. The other main difference is the use of a decorator to turn generators into statement templates, and the incorporation of ideas for addressing iterator finalisation.

If all that seems like an ambitious operation. . . well, Guido was the one to set the bar that high when he wrote PEP 340 :)

Relationship with other PEPs

This PEP competes directly with PEP 310 [1], PEP 340 [2] and PEP 343 [3], as those PEPs all describe alternative mechanisms for handling deterministic resource management.

It does not compete with PEP 342 [4] which splits off PEP 340's enhancements related to passing data into iterators. The associated changes to the for loop semantics would be combined with the iterator finalisation changes suggested in this PEP. User defined statements would not be affected.

Neither does this PEP compete with the generator enhancements described in PEP 288 [5]. While this PEP proposes the ability to inject exceptions into generator frames, it is an internal implementation detail, and does not require making that ability publicly available to Python code. PEP 288 is, in part, about making that implementation detail easily accessible.

This PEP would, however, make the generator resource release support described in PEP 325 [6] redundant - iterators which require finalisation should provide an appropriate implementation of the statement template protocol.

User defined statements

To steal the motivating example from PEP 310, correct handling of a synchronisation lock currently looks like this:

the_lock.acquire()
try:
    # Code here executes with the lock held
finally:
    the_lock.release()

Like PEP 310, this PEP proposes that such code be able to be written as:

with the_lock:
    # Code here executes with the lock held

These user defined statements are primarily designed to allow easy factoring of try blocks that are not easily converted to functions. This is most commonly the case when the exception handling pattern is consistent, but the body of the try block changes. With a user-defined statement, it is straightforward to factor out the exception handling into a statement template, with the body of the try clause provided inline in the user code.

The term 'user defined statement' reflects the fact that the meaning of a with statement is governed primarily by the statement template used, and programmers are free to create their own statement templates, just as they are free to create their own iterators for use in for loops.

Usage syntax for user defined statements

The proposed syntax is simple:

with EXPR1 [as VAR1]:
    BLOCK1

Semantics for user defined statements

the_stmt = EXPR1
stmt_enter = getattr(the_stmt, "__enter__", None)
stmt_exit = getattr(the_stmt, "__exit__", None)
if stmt_enter is None or stmt_exit is None:
    raise TypeError("Statement template required")

VAR1 = stmt_enter() # Omit 'VAR1 =' if no 'as' clause
exc = (None, None, None)
try:
    try:
        BLOCK1
    except:
        exc = sys.exc_info()
        raise
finally:
    stmt_exit(*exc)

Other than VAR1, none of the local variables shown above will be visible to user code. Like the iteration variable in a for loop, VAR1 is visible in both BLOCK1 and code following the user defined statement.

Note that the statement template can only react to exceptions, it cannot suppress them. See Rejected Options for an explanation as to why.

Statement template protocol: __enter__

The __enter__() method takes no arguments, and if it raises an exception, BLOCK1 is never executed. If this happens, the __exit__() method is not called. The value returned by this method is assigned to VAR1 if the as clause is used. Object's with no other value to return should generally return self rather than None to permit in-place creation in the with statement.

Statement templates should use this method to set up the conditions that are to exist during execution of the statement (e.g. acquisition of a synchronisation lock).

Statement templates which are not always usable (e.g. closed file objects) should raise a RuntimeError if an attempt is made to call __enter__() when the template is not in a valid state.

Statement template protocol: __exit__

The __exit__() method accepts three arguments which correspond to the three "arguments" to the raise statement: type, value, and traceback. All arguments are always supplied, and will be set to None if no exception occurred. This method will be called exactly once by the with statement machinery if the __enter__() method completes successfully.

Statement templates perform their exception handling in this method. If the first argument is None, it indicates non-exceptional completion of BLOCK1 - execution either reached the end of block, or early completion was forced using a return, break or continue statement. Otherwise, the three arguments reflect the exception that terminated BLOCK1.

Any exceptions raised by the __exit__() method are propagated to the scope containing the with statement. If the user code in BLOCK1 also raised an exception, that exception would be lost, and replaced by the one raised by the __exit__() method.

Factoring out arbitrary exception handling

Consider the following exception handling arrangement:

SETUP_BLOCK
try:
    try:
        TRY_BLOCK
    except exc_type1, exc:
        EXCEPT_BLOCK1
    except exc_type2, exc:
        EXCEPT_BLOCK2
    except:
        EXCEPT_BLOCK3
    else:
        ELSE_BLOCK
finally:
    FINALLY_BLOCK

It can be roughly translated to a statement template as follows:

class my_template(object):

    def __init__(self, *args):
        # Any required arguments (e.g. a file name)
        # get stored in member variables
        # The various BLOCK's will need updating to reflect
        # that.

    def __enter__(self):
        SETUP_BLOCK

    def __exit__(self, exc_type, value, traceback):
        try:
            try:
                if exc_type is not None:
                    raise exc_type, value, traceback
            except exc_type1, exc:
                EXCEPT_BLOCK1
            except exc_type2, exc:
                EXCEPT_BLOCK2
            except:
                EXCEPT_BLOCK3
            else:
                ELSE_BLOCK
        finally:
            FINALLY_BLOCK

Which can then be used as:

with my_template(*args):
    TRY_BLOCK

However, there are two important semantic differences between this code and the original try statement.

Firstly, in the original try statement, if a break, return or continue statement is encountered in TRY_BLOCK, only FINALLY_BLOCK will be executed as the statement completes. With the statement template, ELSE_BLOCK will also execute, as these statements are treated like any other non-exceptional block termination. For use cases where it matters, this is likely to be a good thing (see transaction in the Examples), as this hole where neither the except nor the else clause gets executed is easy to forget when writing exception handlers.

Secondly, the statement template will not suppress any exceptions. If, for example, the original code suppressed the exc_type1 and exc_type2 exceptions, then this would still need to be done inline in the user code:

try:
    with my_template(*args):
        TRY_BLOCK
except (exc_type1, exc_type2):
    pass

However, even in these cases where the suppression of exceptions needs to be made explicit, the amount of boilerplate repeated at the calling site is significantly reduced (See Rejected Options for further discussion of this behaviour).

In general, not all of the clauses will be needed. For resource handling (like files or synchronisation locks), it is possible to simply execute the code that would have been part of FINALLY_BLOCK in the __exit__() method. This can be seen in the following implementation that makes synchronisation locks into statement templates as mentioned at the beginning of this section:

# New methods of synchronisation lock objects

def __enter__(self):
    self.acquire()
    return self

def __exit__(self, *exc_info):
    self.release()

Generators

With their ability to suspend execution, and return control to the calling frame, generators are natural candidates for writing statement templates. Adding user defined statements to the language does not require the generator changes described in this section, thus making this PEP an obvious candidate for a phased implementation (with statements in phase 1, generator integration in phase 2). The suggested generator updates allow arbitrary exception handling to be factored out like this:

@statement_template
def my_template(*arguments):
    SETUP_BLOCK
    try:
        try:
            yield
        except exc_type1, exc:
            EXCEPT_BLOCK1
        except exc_type2, exc:
            EXCEPT_BLOCK2
        except:
            EXCEPT_BLOCK3
        else:
            ELSE_BLOCK
    finally:
        FINALLY_BLOCK

Notice that, unlike the class based version, none of the blocks need to be modified, as shared values are local variables of the generator's internal frame, including the arguments passed in by the invoking code. The semantic differences noted earlier (all non-exceptional block termination triggers the else clause, and the template is unable to suppress exceptions) still apply.

Default value for yield

When creating a statement template with a generator, the yield statement will often be used solely to return control to the body of the user defined statement, rather than to return a useful value.

Accordingly, if this PEP is accepted, yield, like return, will supply a default value of None (i.e. yield and yield None will become equivalent statements).

This same change is being suggested in PEP 342. Obviously, it would only need to be implemented once if both PEPs were accepted :)

Template generator decorator: statement_template

As with PEP 343, a new decorator is suggested that wraps a generator in an object with the appropriate statement template semantics. Unlike PEP 343, the templates suggested here are reusable, as the generator is instantiated anew in each call to __enter__(). Additionally, any exceptions that occur in BLOCK1 are re-raised in the generator's internal frame:

class template_generator_wrapper(object):

    def __init__(self, func, func_args, func_kwds):
         self.func = func
         self.args = func_args
         self.kwds = func_kwds
         self.gen = None

    def __enter__(self):
        if self.gen is not None:
            raise RuntimeError("Enter called without exit!")
        self.gen = self.func(*self.args, **self.kwds)
        try:
            return self.gen.next()
        except StopIteration:
            raise RuntimeError("Generator didn't yield")

    def __exit__(self, *exc_info):
        if self.gen is None:
            raise RuntimeError("Exit called without enter!")
        try:
            try:
                if exc_info[0] is not None:
                    self.gen._inject_exception(*exc_info)
                else:
                    self.gen.next()
            except StopIteration:
                pass
            else:
                raise RuntimeError("Generator didn't stop")
        finally:
            self.gen = None

def statement_template(func):
    def factory(*args, **kwds):
        return template_generator_wrapper(func, args, kwds)
    return factory

Template generator wrapper: __enter__() method

The template generator wrapper has an __enter__() method that creates a new instance of the contained generator, and then invokes next() once. It will raise a RuntimeError if the last generator instance has not been cleaned up, or if the generator terminates instead of yielding a value.

Template generator wrapper: __exit__() method

The template generator wrapper has an __exit__() method that simply invokes next() on the generator if no exception is passed in. If an exception is passed in, it is re-raised in the contained generator at the point of the last yield statement.

In either case, the generator wrapper will raise a RuntimeError if the internal frame does not terminate as a result of the operation. The __exit__() method will always clean up the reference to the used generator instance, permitting __enter__() to be called again.

A StopIteration raised by the body of the user defined statement may be inadvertently suppressed inside the __exit__() method, but this is unimportant, as the originally raised exception still propagates correctly.

Injecting exceptions into generators

To implement the __exit__() method of the template generator wrapper, it is necessary to inject exceptions into the internal frame of the generator. This is new implementation level behaviour that has no current Python equivalent.

The injection mechanism (referred to as _inject_exception in this PEP) raises an exception in the generator's frame with the specified type, value and traceback information. This means that the exception looks like the original if it is allowed to propagate.

For the purposes of this PEP, there is no need to make this capability available outside the Python implementation code.

Generator finalisation

To support resource management in template generators, this PEP will eliminate the restriction on yield statements inside the try block of a try/finally statement. Accordingly, generators which require the use of a file or some such object can ensure the object is managed correctly through the use of try/finally or with statements.

This restriction will likely need to be lifted globally - it would be difficult to restrict it so that it was only permitted inside generators used to define statement templates. Accordingly, this PEP includes suggestions designed to ensure generators which are not used as statement templates are still finalised appropriately.

Generator finalisation: TerminateIteration exception

A new exception is proposed:

class TerminateIteration(Exception): pass

The new exception is injected into a generator in order to request finalisation. It should not be suppressed by well-behaved code.

Generator finalisation: __del__() method

To ensure a generator is finalised eventually (within the limits of Python's garbage collection), generators will acquire a __del__() method with the following semantics:

def __del__(self):
    try:
        self._inject_exception(TerminateIteration, None, None)
    except TerminateIteration:
        pass

Deterministic generator finalisation

There is a simple way to provide deterministic finalisation of generators - give them appropriate __enter__() and __exit__() methods:

def __enter__(self):
    return self

def __exit__(self, *exc_info):
    try:
        self._inject_exception(TerminateIteration, None, None)
    except TerminateIteration:
        pass

Then any generator can be finalised promptly by wrapping the relevant for loop inside a with statement:

with all_lines(filenames) as lines:
    for line in lines:
        print lines

(See the Examples for the definition of all_lines, and the reason it requires prompt finalisation)

Compare the above example to the usage of file objects:

with open(filename) as f:
    for line in f:
        print f

Generators as user defined statement templates

When used to implement a user defined statement, a generator should yield only once on a given control path. The result of that yield will then be provided as the result of the generator's __enter__() method. Having a single yield on each control path ensures that the internal frame will terminate when the generator's __exit__() method is called. Multiple yield statements on a single control path will result in a RuntimeError being raised by the __exit__() method when the internal frame fails to terminate correctly. Such an error indicates a bug in the statement template.

To respond to exceptions, or to clean up resources, it is sufficient to wrap the yield statement in an appropriately constructed try statement. If execution resumes after the yield without an exception, the generator knows that the body of the do statement completed without incident.

Examples

  1. A template for ensuring that a lock, acquired at the start of a block, is released when the block is left:

    # New methods on synchronisation locks
        def __enter__(self):
            self.acquire()
            return self
    
        def __exit__(self, *exc_info):
            lock.release()
    

    Used as follows:

    with myLock:
        # Code here executes with myLock held.  The lock is
        # guaranteed to be released when the block is left (even
        # if via return or by an uncaught exception).
    
  2. A template for opening a file that ensures the file is closed when the block is left:

    # New methods on file objects
        def __enter__(self):
            if self.closed:
                raise RuntimeError, "Cannot reopen closed file handle"
            return self
    
        def __exit__(self, *args):
            self.close()
    

    Used as follows:

    with open("/etc/passwd") as f:
        for line in f:
            print line.rstrip()
    
  3. A template for committing or rolling back a database transaction:

    def transaction(db):
        try:
            yield
        except:
            db.rollback()
        else:
            db.commit()
    

    Used as follows:

    with transaction(the_db):
        make_table(the_db)
        add_data(the_db)
        # Getting to here automatically triggers a commit
        # Any exception automatically triggers a rollback
    
  4. It is possible to nest blocks and combine templates:

    @statement_template
    def lock_opening(lock, filename, mode="r"):
        with lock:
            with open(filename, mode) as f:
                yield f
    

    Used as follows:

    with lock_opening(myLock, "/etc/passwd") as f:
        for line in f:
            print line.rstrip()
    
  5. Redirect stdout temporarily:

    @statement_template
    def redirected_stdout(new_stdout):
        save_stdout = sys.stdout
        try:
            sys.stdout = new_stdout
            yield
        finally:
            sys.stdout = save_stdout
    

    Used as follows:

    with open(filename, "w") as f:
        with redirected_stdout(f):
            print "Hello world"
    
  6. A variant on open() that also returns an error condition:

    @statement_template
    def open_w_error(filename, mode="r"):
        try:
            f = open(filename, mode)
        except IOError, err:
            yield None, err
        else:
            try:
                yield f, None
            finally:
                f.close()
    

    Used as follows:

    do open_w_error("/etc/passwd", "a") as f, err:
        if err:
            print "IOError:", err
        else:
            f.write("guido::0:0::/:/bin/sh\n")
    
  7. Find the first file with a specific header:

    for name in filenames:
        with open(name) as f:
            if f.read(2) == 0xFEB0:
                break
    
  8. Find the first item you can handle, holding a lock for the entire loop, or just for each iteration:

    with lock:
        for item in items:
            if handle(item):
                break
    
    for item in items:
        with lock:
            if handle(item):
                break
    
  9. Hold a lock while inside a generator, but release it when returning control to the outer scope:

    @statement_template
    def released(lock):
        lock.release()
        try:
            yield
        finally:
            lock.acquire()
    

    Used as follows:

    with lock:
        for item in items:
            with released(lock):
                yield item
    
  10. Read the lines from a collection of files (e.g. processing multiple configuration sources):

    def all_lines(filenames):
        for name in filenames:
            with open(name) as f:
                for line in f:
                    yield line
    

    Used as follows:

    with all_lines(filenames) as lines:
        for line in lines:
            update_config(line)
    
  11. Not all uses need to involve resource management:

    @statement_template
    def tag(*args, **kwds):
        name = cgi.escape(args[0])
        if kwds:
            kwd_pairs = ["%s=%s" % cgi.escape(key), cgi.escape(value)
                         for key, value in kwds]
            print '<%s %s>' % name, " ".join(kwd_pairs)
        else:
            print '<%s>' % name
        yield
        print '</%s>' % name
    

    Used as follows:

    with tag('html'):
        with tag('head'):
           with tag('title'):
              print 'A web page'
        with tag('body'):
           for par in pars:
              with tag('p'):
                 print par
           with tag('a', href="http://www.python.org"):
               print "Not a dead parrot!"
    
  12. From PEP 343, another useful example would be an operation that blocks signals. The use could be like this:

    from signal import blocked_signals
    
    with blocked_signals():
        # code executed without worrying about signals
    

    An optional argument might be a list of signals to be blocked; by default all signals are blocked. The implementation is left as an exercise to the reader.

  13. Another use for this feature is for Decimal contexts:

    # New methods on decimal Context objects
    
    def __enter__(self):
        if self._old_context is not None:
            raise RuntimeError("Already suspending other Context")
        self._old_context = getcontext()
        setcontext(self)
    
    def __exit__(self, *args):
        setcontext(self._old_context)
        self._old_context = None
    

    Used as follows:

    with decimal.Context(precision=28):
       # Code here executes with the given context
       # The context always reverts after this statement
    

Open Issues

None, as this PEP has been withdrawn.

Rejected Options

Having the basic construct be a looping construct

The major issue with this idea, as illustrated by PEP 340's block statements, is that it causes problems with factoring try statements that are inside loops, and contain break and continue statements (as these statements would then apply to the block construct, instead of the original loop). As a key goal is to be able to factor out arbitrary exception handling (other than suppression) into statement templates, this is a definite problem.

There is also an understandability problem, as can be seen in the Examples. In the example showing acquisition of a lock either for an entire loop, or for each iteration of the loop, if the user defined statement was itself a loop, moving it from outside the for loop to inside the for loop would have major semantic implications, beyond those one would expect.

Finally, with a looping construct, there are significant problems with TOOWTDI, as it is frequently unclear whether a particular situation should be handled with a conventional for loop or the new looping construct. With the current PEP, there is no such problem - for loops continue to be used for iteration, and the new do statements are used to factor out exception handling.

Another issue, specifically with PEP 340's anonymous block statements, is that they make it quite difficult to write statement templates directly (i.e. not using a generator). This problem is addressed by the current proposal, as can be seen by the relative simplicity of the various class based implementations of statement templates in the Examples.

Allowing statement templates to suppress exceptions

Earlier versions of this PEP gave statement templates the ability to suppress exceptions. The BDFL expressed concern over the associated complexity, and I agreed after reading an article by Raymond Chen about the evils of hiding flow control inside macros in C code [7].

Removing the suppression ability eliminated a whole lot of complexity from both the explanation and implementation of user defined statements, further supporting it as the correct choice. Older versions of the PEP had to jump through some horrible hoops to avoid inadvertently suppressing exceptions in __exit__() methods - that issue does not exist with the current suggested semantics.

There was one example (auto_retry) that actually used the ability to suppress exceptions. This use case, while not quite as elegant, has significantly more obvious control flow when written out in full in the user code:

def attempts(num_tries):
    return reversed(xrange(num_tries))

for retry in attempts(3):
    try:
        make_attempt()
    except IOError:
        if not retry:
            raise

For what it's worth, the perverse could still write this as:

for attempt in auto_retry(3, IOError):
    try:
        with attempt:
            make_attempt()
    except FailedAttempt:
        pass

To protect the innocent, the code to actually support that is not included here.

Differentiating between non-exceptional exits

Earlier versions of this PEP allowed statement templates to distinguish between exiting the block normally, and exiting via a return, break or continue statement. The BDFL flirted with a similar idea in PEP 343 and its associated discussion. This added significant complexity to the description of the semantics, and it required each and every statement template to decide whether or not those statements should be treated like exceptions, or like a normal mechanism for exiting the block.

This template-by-template decision process raised great potential for confusion - consider if one database connector provided a transaction template that treated early exits like an exception, whereas a second connector treated them as normal block termination.

Accordingly, this PEP now uses the simplest solution - early exits appear identical to normal block termination as far as the statement template is concerned.

Not injecting raised exceptions into generators

PEP 343 suggests simply invoking next() unconditionally on generators used to define statement templates. This means the template generators end up looking rather unintuitive, and the retention of the ban against yielding inside try/finally means that Python's exception handling capabilities cannot be used to deal with management of multiple resources.

The alternative which this PEP advocates (injecting raised exceptions into the generator frame), means that multiple resources can be managed elegantly as shown by lock_opening in the Examples

Making all generators statement templates

Separating the template object from the generator itself makes it possible to have reusable generator templates. That is, the following code will work correctly if this PEP is accepted:

open_it = lock_opening(parrot_lock, "dead_parrot.txt")

with open_it as f:
    # use the file for a while

with open_it as f:
    # use the file again

The second benefit is that iterator generators and template generators are very different things - the decorator keeps that distinction clear, and prevents one being used where the other is required.

Finally, requiring the decorator allows the native methods of generator objects to be used to implement generator finalisation.

Using do as the keyword

do was an alternative keyword proposed during the PEP 340 discussion. It reads well with appropriately named functions, but it reads poorly when used with methods, or with objects that provide native statement template support.

When do was first suggested, the BDFL had rejected PEP 310's with keyword, based on a desire to use it for a Pascal/Delphi style with statement. Since then, the BDFL has retracted this objection, as he no longer intends to provide such a statement. This change of heart was apparently based on the C# developers reasons for not providing the feature [8].

Not having a keyword

This is an interesting option, and can be made to read quite well. However, it's awkward to look up in the documentation for new users, and strikes some as being too magical. Accordingly, this PEP goes with a keyword based suggestion.

Enhancing try statements

This suggestion involves give bare try statements a signature similar to that proposed for with statements.

I think that trying to write a with statement as an enhanced try statement makes as much sense as trying to write a for loop as an enhanced while loop. That is, while the semantics of the former can be explained as a particular way of using the latter, the former is not an instance of the latter. The additional semantics added around the more fundamental statement result in a new construct, and the two different statements shouldn't be confused.

This can be seen by the fact that the 'enhanced' try statement still needs to be explained in terms of a 'non-enhanced' try statement. If it's something different, it makes more sense to give it a different name.

Having the template protocol directly reflect try statements

One suggestion was to have separate methods in the protocol to cover different parts of the structure of a generalised try statement. Using the terms try, except, else and finally, we would have something like:

class my_template(object):

    def __init__(self, *args):
        # Any required arguments (e.g. a file name)
        # get stored in member variables
        # The various BLOCK's will need to updated to reflect
        # that.

    def __try__(self):
        SETUP_BLOCK

    def __except__(self, exc, value, traceback):
        if isinstance(exc, exc_type1):
            EXCEPT_BLOCK1
        if isinstance(exc, exc_type2):
            EXCEPT_BLOCK2
        else:
            EXCEPT_BLOCK3

    def __else__(self):
        ELSE_BLOCK

    def __finally__(self):
        FINALLY_BLOCK

Aside from preferring the addition of two method slots rather than four, I consider it significantly easier to be able to simply reproduce a slightly modified version of the original try statement code in the __exit__() method (as shown in Factoring out arbitrary exception handling), rather than have to split the functionality amongst several different methods (or figure out which method to use if not all clauses are used by the template).

To make this discussion less theoretical, here is the transaction example implemented using both the two method and the four method protocols instead of a generator. Both implementations guarantee a commit if a break, return or continue statement is encountered (as does the generator-based implementation in the Examples section):

class transaction_2method(object):

    def __init__(self, db):
        self.db = db

    def __enter__(self):
        pass

    def __exit__(self, exc_type, *exc_details):
        if exc_type is None:
            self.db.commit()
        else:
            self.db.rollback()

class transaction_4method(object):

    def __init__(self, db):
        self.db = db
        self.commit = False

    def __try__(self):
        self.commit = True

    def __except__(self, exc_type, exc_value, traceback):
        self.db.rollback()
        self.commit = False

    def __else__(self):
        pass

    def __finally__(self):
        if self.commit:
            self.db.commit()
            self.commit = False

There are two more minor points, relating to the specific method names in the suggestion. The name of the __try__() method is misleading, as SETUP_BLOCK executes before the try statement is entered, and the name of the __else__() method is unclear in isolation, as numerous other Python statements include an else clause.

Iterator finalisation (WITHDRAWN)

The ability to use user defined statements inside generators is likely to increase the need for deterministic finalisation of iterators, as resource management is pushed inside the generators, rather than being handled externally as is currently the case.

The PEP currently suggests handling this by making all generators statement templates, and using with statements to handle finalisation. However, earlier versions of this PEP suggested the following, more complex, solution, that allowed the author of a generator to flag the need for finalisation, and have for loops deal with it automatically. It is included here as a long, detailed rejected option.

Iterator protocol addition: __finish__

An optional new method for iterators is proposed, called __finish__(). It takes no arguments, and should not return anything.

The __finish__ method is expected to clean up all resources the iterator has open. Iterators with a __finish__() method are called 'finishable iterators' for the remainder of the PEP.

Best effort finalisation

A finishable iterator should ensure that it provides a __del__ method that also performs finalisation (e.g. by invoking the __finish__() method). This allows Python to still make a best effort at finalisation in the event that deterministic finalisation is not applied to the iterator.

Deterministic finalisation

If the iterator used in a for loop has a __finish__() method, the enhanced for loop semantics will guarantee that that method will be executed, regardless of the means of exiting the loop. This is important for iterator generators that utilise user defined statements or the now permitted try/finally statements, or for new iterators that rely on timely finalisation to release allocated resources (e.g. releasing a thread or database connection back into a pool).

for loop syntax

No changes are suggested to for loop syntax. This is just to define the statement parts needed for the description of the semantics:

for VAR1 in EXPR1:
    BLOCK1
else:
    BLOCK2

Updated for loop semantics

When the target iterator does not have a __finish__() method, a for loop will execute as follows (i.e. no change from the status quo):

itr = iter(EXPR1)
exhausted = False
while True:
    try:
        VAR1 = itr.next()
    except StopIteration:
        exhausted = True
        break
    BLOCK1
if exhausted:
    BLOCK2

When the target iterator has a __finish__() method, a for loop will execute as follows:

itr = iter(EXPR1)
exhausted = False
try:
    while True:
        try:
            VAR1 = itr.next()
        except StopIteration:
            exhausted = True
            break
        BLOCK1
    if exhausted:
        BLOCK2
finally:
    itr.__finish__()

The implementation will need to take some care to avoid incurring the try/finally overhead when the iterator does not have a __finish__() method.

Generator iterator finalisation: __finish__() method

When enabled with the appropriate decorator, generators will have a __finish__() method that raises TerminateIteration in the internal frame:

def __finish__(self):
    try:
        self._inject_exception(TerminateIteration)
    except TerminateIteration:
        pass

A decorator (e.g. needs_finish()) is required to enable this feature, so that existing generators (which are not expecting finalisation) continue to work as expected.

Partial iteration of finishable iterators

Partial iteration of a finishable iterator is possible, although it requires some care to ensure the iterator is still finalised promptly (it was made finishable for a reason!). First, we need a class to enable partial iteration of a finishable iterator by hiding the iterator's __finish__() method from the for loop:

class partial_iter(object):

    def __init__(self, iterable):
        self.iter = iter(iterable)

    def __iter__(self):
        return self

    def next(self):
        return self.itr.next()

Secondly, an appropriate statement template is needed to ensure the the iterator is finished eventually:

@statement_template
def finishing(iterable):
      itr = iter(iterable)
      itr_finish = getattr(itr, "__finish__", None)
      if itr_finish is None:
          yield itr
      else:
          try:
              yield partial_iter(itr)
          finally:
              itr_finish()

This can then be used as follows:

do finishing(finishable_itr) as itr:
    for header_item in itr:
        if end_of_header(header_item):
            break
        # process header item
    for body_item in itr:
        # process body item

Note that none of the above is needed for an iterator that is not finishable - without a __finish__() method, it will not be promptly finalised by the for loop, and hence inherently allows partial iteration. Allowing partial iteration of non-finishable iterators as the default behaviour is a key element in keeping this addition to the iterator protocol backwards compatible.

Acknowledgements

The acknowledgements section for PEP 340 applies, since this text grew out of the discussion of that PEP, but additional thanks go to Michael Hudson, Paul Moore and Guido van Rossum for writing PEP 310 and PEP 340 in the first place, and to (in no meaningful order) Fredrik Lundh, Phillip J. Eby, Steven Bethard, Josiah Carlson, Greg Ewing, Tim Delaney and Arnold deVos for prompting particular ideas that made their way into this text.

References

[1]Reliable Acquisition/Release Pairs (http://www.python.org/dev/peps/pep-0310/)
[2]Anonymous block statements (http://www.python.org/dev/peps/pep-0340/)
[3]Anonymous blocks, redux (http://www.python.org/dev/peps/pep-0343/)
[4]Enhanced Iterators (http://www.python.org/dev/peps/pep-0342/)
[5]Generator Attributes and Exceptions (http://www.python.org/dev/peps/pep-0288/)
[6]Resource-Release Support for Generators (http://www.python.org/dev/peps/pep-0325/)
[7]A rant against flow control macros (http://blogs.msdn.com/oldnewthing/archive/2005/01/06/347666.aspx)
[8]Why doesn't C# have a 'with' statement? (http://msdn.microsoft.com/vcsharp/programming/language/ask/withstatement/)

pep-0347 Migrating the Python CVS to Subversion

PEP:347
Title:Migrating the Python CVS to Subversion
Version:$Revision$
Last-Modified:$Date$
Author:Martin von Lรถwis <martin at v.loewis.de>
Discussions-To:<python-dev at python.org>
Status:Final
Type:Process
Content-Type:text/x-rst
Created:14-Jul-2004
Post-History:14-Jul-2004

Abstract

The Python source code is currently managed in a CVS repository on sourceforge.net. This PEP proposes to move it to a Subversion repository on svn.python.org.

Rationale

This change has two aspects: moving from CVS to Subversion, and moving from SourceForge to python.org. For each, a rationale will be given.

Moving to Subversion

CVS has a number of limitations that have been eliminated by Subversion. For the development of Python, the most notable improvements are:

  • the ability to rename files and directories, and to remove directories, while keeping the history of these files.
  • support for change sets (sets of correlated changes to multiple files) through global revision numbers. Change sets are transactional.
  • atomic, fast tagging: a cvs tag might take many minutes; a Subversion tag (svn cp) will complete quickly, and atomically. Likewise, branches are very efficient.
  • support for offline diffs, which is useful when creating patches.

Moving to python.org

SourceForge has kindly provided an important infrastructure for the past years. Unfortunately, the attention that SF received has also caused repeated overload situations in the past, to which the SF operators could not always respond in a timely manner. In particular, for CVS, they had to reduce the load on the primary CVS server by introducing a second, read-only CVS server for anonymous access. This server is regularly synchronized, but lags behind the read-write CVS repository between synchronizations. As a result, users without commit access can see recent changes to the repository only after a delay.

On python.org, it would be possible to make the repository accessible for anonymous access.

Migration Procedure

To move the Python CVS repository, the following steps need to be executed. The steps are elaborated upon in the following sections.

  1. Collect SSH keys for all current committers, along with usernames to appear in commit messages.
  2. At the beginning of the migration, announce that the repository on SourceForge closed.
  3. 24 hours after the last commit, download the CVS repository.
  4. Convert the CVS repository into a Subversion repository.
  5. Publish the repository with write access for committers, and read-only anonymous access.
  6. Disable CVS access on SF.

Collect SSH keys

After some discussion, svn+ssh was selected as the best method for write access to the repository. Developers can continue to use their SSH keys, but they must be installed on python.org.

In order to avoid having to create a new Unix user for each developer, a single account should be used, with command= attributes in the authorized_keys files.

The lines in the authorized_keys file should read like this (wrapped for better readability):

command="/usr/bin/svnserve --root=/svnroot -t
--tunnel-user='<username>'",no-port-forwarding,
no-X11-forwarding,no-agent-forwarding,no-pty
ssh-dss <key> <comment>

As the usernames, the real names should be used instead of the SF account names, so that people can be better identified in log messages.

Administrator Access

Administrator access to the pythondev account should be granted to all current admins of the Python SF project. To distinguish between shell login and svnserve login, admins need to maintain two keys. Using OpenSSH, the following procedure can be used to create a second key:

cd .ssh
ssh-keygen -t DSA -f pythondev -C <user>@pythondev
vi config

In the config file, the following lines need to be added:

Host pythondev
  Hostname dinsdale.python.org
  User pythondev
  IdentityFile ~/.ssh/pythondev

Then, shell login becomes possible through "ssh pythondev".

Downloading the CVS Repository

The CVS repository can be downloaded from

http://cvs.sourceforge.net/cvstarballs/python-cvsroot.tar.bz2

Since this tarball is generated only once a day, some time must pass after the repository freeze before the tarball can be picked up. It should be verified that the last commit, as recorded on the python-commits mailing list, is indeed included in the tarball.

After the conversion, the converted CVS tarball should be kept forever on www.python.org/archive/python-cvsroot-<date>.tar.bz2

Converting the CVS Repository

The Python CVS repository contains two modules: distutils and python. The python module is further structured into dist and nondist, where dist only contains src (the python code proper). nondist contains various subdirectories.

These should be reorganized in the Subversion repository to get shorter URLs, following the <project>/{trunk,tags,branches} structure. A project will be created for each nondist directory, plus for src (called python), plus distutils. Reorganizing the repository is best done in the CVS tree, as shown below.

The fsfs backend should be used as the repository format (which requires Subversion 1.1). The fsfs backend has the advantage of being more backup-friendly, as it allows incremental repository backups, without requiring any dump commands to be run.

The conversion should be done using the cvs2svn utility, available e.g. in the cvs2svn Debian package. As cvs2svn does not currently support the project/trunk structure, each project needs to be converted separately. To get each conversion result into a separate directory in the target repository, svnadmin load must be used.

Subversion has a different view on binary-vs-text files than CVS. To correctly carry the CVS semantics forward, svn:eol-style should be set to native on all files that are not marked binary in the CVS.

In summary, the conversion script is:

#!/bin/sh
rm cvs2svn-*
rm -rf python py.new
tar xjf python-cvsroot.tar.bz2
rm -rf python/CVSROOT
svnadmin create --fs-type fsfs py.new
mv python/python python/orig
mv python/orig/dist/src python/python
mv python/orig/nondist/* python
# nondist/nondist is empty
rmdir python/nondist
rm -rf python/orig
for a in python/*
do
  b=`basename $a`
  cvs2svn -q --dump-only --encoding=latin1 --force-branch=cnri-16-start \
  --force-branch=descr-branch --force-branch=release152p1-patches \
  --force-tag=r16b1 $a
  svn mkdir -m"Conversion to SVN" file:///`pwd`/py.new/$b
  svnadmin load -q --parent-dir $b py.new < cvs2svn-dump
  rm cvs2svn-dump
done

Sample results of this conversion are available at

http://www.dcl.hpi.uni-potsdam.de/pysvn/

Publish the Repository

The repository should be published at http://svn.python.org/projects. Read-write access should be granted to all current SF committers through svn+ssh://pythondev@svn.python.org/; read-only anonymous access through WebDAV should also be granted.

As an option, websvn (available e.g. from the Debian websvn package) could be provided. Unfortunately, in the test installation, websvn breaks because it runs out of memory.

The current SF project admins should get write access to the authorized_keys2 file of the pythondev account.

Disable CVS

It appears that CVS cannot be disabled entirely. Only the user interface can be removed from the project page; the repository itself remains available. If desired, write access to the python and distutils modules can be disabled through a CVS commitinfo entry.

Discussion

Several alternatives had been suggested to the procedure above. The rejected alternatives are shortly discussed here:

  • create multiple repositories, one for python and one for distutils. This would have allowed even shorter URLs, but was rejected because a single repository supports moving code across projects.

  • Several people suggested to create the project/trunk structure through standard cvs2svn, followed by renames. This would have the disadvantage that old revisions use different path names than recent revisions; the suggested approach through dump files works without renames.

  • Several people also expressed concern about the administrative overhead that hosting the repository on python.org would cause to pydotorg admins. As a specific alternative, BerliOS has been suggested. The pydotorg admins themselves haven't objected to the additional workload; migrating the repository again if they get overworked is an option.

  • Different authentication strategies were discussed. As alternatives to svn+ssh were suggested

    • Subversion over WebDAV, using SSL and basic authentication, with pydotorg-generated passwords mailed to the user. People did not like that approach, since they would need to store the password on disk (because they can't remember it); this is a security risk.
    • Subversion over WebDAV, using SSL client certificates. This would work, but would require us to administer a certificate authority.
  • Instead of hosting this on python.org, people suggested hosting it elsewhere. One issue is whether this alternative should be free or commercial; several people suggested it should better be commercial, to reduce the load on the volunteers. In particular:

    • Greg Stein suggested http://www.wush.net/subversion.php. They offer 5 GB for $90/month, with 200 GB download/month. The data is on a RAID drive and fully backed up. Anonymous access and email commit notifications are supported. wush.net elaborated the following details:

      • The machine would be a Virtuozzo Virtual Private Server (VPS), hosted at PowerVPS.
      • The default repository URL would be http://python.wush.net/svn/projectname/, but anything else could be arranged
      • we would get SSH login to the machine, with sudo capabilities.
      • They have a Web interface for management of the various SVN repositories that we want to host, and to manage user accounts. While svn+ssh would be supported, the user interface does not yet support it.
      • For offsite mirroring/backup, they suggest to use rsync instead of download of repository tarballs.

      Bob Ippolito reported that they had used wush.net for a commercial project for about 6 months, after which time they left wush.net, because the service was down for three days, with nobody reachable, and no explanation when it came back.

pep-0348 Exception Reorganization for Python 3.0

PEP:348
Title:Exception Reorganization for Python 3.0
Version:$Revision$
Last-Modified:$Date$
Author:Brett Cannon <brett at python.org>
Status:Rejected
Type:Standards Track
Content-Type:text/x-rst
Created:28-Jul-2005
Post-History:

Note

This PEP has been rejected [20].

Abstract

Python, as of version 2.4, has 38 exceptions (including warnings) in the built-in namespace in a rather shallow hierarchy. These classes have come about over the years without a chance to learn from experience. This PEP proposes doing a reorganization of the hierarchy for Python 3.0 when backwards-compatibility is not as much of an issue.

Along with this reorganization, adding a requirement that all objects passed to a raise statement must inherit from a specific superclass is proposed. This is to have guarantees about the basic interface of exceptions and to further enhance the natural hierarchy of exceptions.

Lastly, bare except clauses will be changed to be semantically equivalent to except Exception. Most people currently use bare except clause for this purpose and with the exception hierarchy reorganization becomes a viable default.

Rationale For Wanting Change

Exceptions are a critical part of Python. While exceptions are traditionally used to signal errors in a program, they have also grown to be used for flow control for things such as iterators.

While their importance is great, there is a lack of structure to them. This stems from the fact that any object can be raised as an exception. Because of this you have no guarantee in terms of what kind of object will be raised, destroying any possible hierarchy raised objects might adhere to.

But exceptions do have a hierarchy, showing the severity of the exception. The hierarchy also groups related exceptions together to simplify catching them in except clauses. To allow people to be able to rely on this hierarchy, a common superclass that all raise objects must inherit from is being proposed. It also allows guarantees about the interface to raised objects to be made (see PEP 344 [2]). A discussion about all of this has occurred before on python-dev [4].

As bare except clauses stand now, they catch all exceptions. While this can be handy, it is rather overreaching for the common case. Thanks to having a required superclass, catching all exceptions is as easy as catching just one specific exception. This allows bare except clauses to be used for a more useful purpose. Once again, this has been discussed on python-dev [5].

Finally, slight changes to the exception hierarchy will make it much more reasonable in terms of structure. By minor rearranging exceptions that should not typically be caught can be allowed to propagate to the top of the execution stack, terminating the interpreter as intended.

Philosophy of Reorganization

For the reorganization of the hierarchy, there was a general philosophy followed that developed from discussion of earlier drafts of this PEP [7], [8], [9], [10], [11], [12]. First and foremost was to not break anything that works. This meant that renaming exceptions was out of the question unless the name was deemed severely bad. This also meant no removal of exceptions unless they were viewed as truly misplaced. The introduction of new exceptions were only done in situations where there might be a use for catching a superclass of a category of exceptions. Lastly, existing exceptions would have their inheritance tree changed only if it was felt they were truly misplaced to begin with.

For all new exceptions, the proper suffix had to be chosen. For those that signal an error, "Error" is to be used. If the exception is a warning, then "Warning". "Exception" is to be used when none of the other suffixes are proper to use and no specific suffix is a better fit.

After that it came down to choosing which exceptions should and should not inherit from Exception. This was for the purpose of making bare except clauses more useful.

Lastly, the entire existing hierarchy had to inherit from the new exception meant to act as the required superclass for all exceptions to inherit from.

New Hierarchy

Note

Exceptions flagged with "stricter inheritance" will no longer inherit from a certain class. A "broader inheritance" flag means a class has been added to the exception's inheritance tree. All comparisons are against the Python 2.4 exception hierarchy.

+-- BaseException (new; broader inheritance for subclasses)
    +-- Exception
        +-- GeneratorExit (defined in PEP 342 [1])
        +-- StandardError
            +-- ArithmeticError
                +-- DivideByZeroError
                +-- FloatingPointError
                +-- OverflowError
            +-- AssertionError
            +-- AttributeError
            +-- EnvironmentError
                +-- IOError
                +-- EOFError
                +-- OSError
            +-- ImportError
            +-- LookupError
                +-- IndexError
                +-- KeyError
            +-- MemoryError
            +-- NameError
                +-- UnboundLocalError
            +-- NotImplementedError (stricter inheritance)
            +-- SyntaxError
                +-- IndentationError
                    +-- TabError
            +-- TypeError
            +-- RuntimeError
            +-- UnicodeError
                +-- UnicodeDecodeError
                +-- UnicodeEncodeError
                +-- UnicodeTranslateError
            +-- ValueError
            +-- ReferenceError
        +-- StopIteration
        +-- SystemError
        +-- Warning
            +-- DeprecationWarning
            +-- FutureWarning
            +-- PendingDeprecationWarning
            +-- RuntimeWarning
            +-- SyntaxWarning
            +-- UserWarning
        + -- WindowsError
    +-- KeyboardInterrupt (stricter inheritance)
    +-- SystemExit (stricter inheritance)

Differences Compared to Python 2.4

A more thorough explanation of terms is needed when discussing inheritance changes. Inheritance changes result in either broader or more restrictive inheritance. "Broader" is when a class has an inheritance tree like cls, A and then becomes cls, B, A. "Stricter" is the reverse.

BaseException

The superclass that all exceptions must inherit from. It's name was chosen to reflect that it is at the base of the exception hierarchy while being an exception itself. "Raisable" was considered as a name, it was passed on because its name did not properly reflect the fact that it is an exception itself.

Direct inheritance of BaseException is not expected, and will be discouraged for the general case. Most user-defined exceptions should inherit from Exception instead. This allows catching Exception to continue to work in the common case of catching all exceptions that should be caught. Direct inheritance of BaseException should only be done in cases where an entirely new category of exception is desired.

But, for cases where all exceptions should be caught blindly, except BaseException will work.

KeyboardInterrupt and SystemExit

Both exceptions are no longer under Exception. This is to allow bare except clauses to act as a more viable default case by catching exceptions that inherit from Exception. With both KeyboardInterrupt and SystemExit acting as signals that the interpreter is expected to exit, catching them in the common case is the wrong semantics.

NotImplementedError

Inherits from Exception instead of from RuntimeError.

Originally inheriting from RuntimeError, NotImplementedError does not have any direct relation to the exception meant for use in user code as a quick-and-dirty exception. Thus it now directly inherits from Exception.

Required Superclass for raise

By requiring all objects passed to a raise statement to inherit from a specific superclass, all exceptions are guaranteed to have certain attributes. If PEP 344 [2] is accepted, the attributes outlined there will be guaranteed to be on all exceptions raised. This should help facilitate debugging by making the querying of information from exceptions much easier.

The proposed hierarchy has BaseException as the required base class.

Implementation

Enforcement is straightforward. Modifying RAISE_VARARGS to do an inheritance check first before raising an exception should be enough. For the C API, all functions that set an exception will have the same inheritance check applied.

Bare except Clauses Catch Exception

In most existing Python 2.4 code, bare except clauses are too broad in the exceptions they catch. Typically only exceptions that signal an error are desired to be caught. This means that exceptions that are used to signify that the interpreter should exit should not be caught in the common case.

With KeyboardInterrupt and SystemExit moved to inherit from BaseException instead of Exception, changing bare except clauses to act as except Exception becomes a much more reasonable default. This change also will break very little code since these semantics are what most people want for bare except clauses.

The complete removal of bare except clauses has been argued for. The case has been made that they violate both Only One Way To Do It (OOWTDI) and Explicit Is Better Than Implicit (EIBTI) as listed in the Zen of Python [18]. But Practicality Beats Purity (PBP), also in the Zen of Python, trumps both of these in this case. The BDFL has stated that bare except clauses will work this way [17].

Implementation

The compiler will emit the bytecode for except Exception whenever a bare except clause is reached.

Transition Plan

Because of the complexity and clutter that would be required to add all features planned in this PEP, the transition plan is very simple. In Python 2.5 BaseException is added. In Python 3.0, all remaining features (required superclass, change in inheritance, bare except clauses becoming the same as except Exception) will go into affect. In order to make all of this work in a backwards-compatible way in Python 2.5 would require very deep hacks in the exception machinery which could be error-prone and lead to a slowdown in performance for little benefit.

To help with the transition, the documentation will be changed to reflect several programming guidelines:

  • When one wants to catch all exceptions, catch BaseException
  • To catch all exceptions that do not represent the termination of the interpreter, catch Exception explicitly
  • Explicitly catch KeyboardInterrupt and SystemExit; don't rely on inheritance from Exception to lead to the capture
  • Always catch NotImplementedError explicitly instead of relying on the inheritance from RuntimeError

The documentation for the 'exceptions' module [6], tutorial [19], and PEP 290 [3] will all require updating.

Rejected Ideas

DeprecationWarning Inheriting From PendingDeprecationWarning

This was originally proposed because a DeprecationWarning can be viewed as a PendingDeprecationWarning that is being removed in the next version. But since enough people thought the inheritance could logically work the other way around, the idea was dropped.

AttributeError Inheriting From TypeError or NameError

Viewing attributes as part of the interface of a type caused the idea of inheriting from TypeError. But that partially defeats the thinking of duck typing and thus the idea was dropped.

Inheriting from NameError was suggested because objects can be viewed as having their own namespace where the attributes live and when an attribute is not found it is a namespace failure. This was also dropped as a possibility since not everyone shared this view.

Removal of EnvironmentError

Originally proposed based on the idea that EnvironmentError was an unneeded distinction, the BDFL overruled this idea [13].

Introduction of MacError and UnixError

Proposed to add symmetry to WindowsError, the BDFL said they won't be used enough [13]. The idea of then removing WindowsError was proposed and accepted as reasonable, thus completely negating the idea of adding these exceptions.

SystemError Subclassing SystemExit

Proposed because a SystemError is meant to lead to a system exit, the idea was removed since CriticalError indicates this better.

ControlFlowException Under Exception

It has been suggested that ControlFlowException should inherit from Exception. This idea has been rejected based on the thinking that control flow exceptions typically do not all need to be caught by a single except clause.

Rename NameError to NamespaceError

NameError is considered more succinct and leaves open no possible mistyping of the capitalization of "Namespace" [14].

Renaming RuntimeError or Introducing SimpleError

The thinking was that RuntimeError was in no way an obvious name for an exception meant to be used when a situation did not call for the creation of a new exception. The renaming was rejected on the basis that the exception is already used throughout the interpreter [15]. Rejection of SimpleError was founded on the thought that people should be free to use whatever exception they choose and not have one so blatently suggested [16].

Renaming Existing Exceptions

Various renamings were suggested but non garnered more than a +0 vote (renaming ReferenceError to WeakReferenceError). The thinking was that the existing names were fine and no one had actively complained about them ever. To minimize backwards-compatibility issues and causing existing Python programmers extra pain, the renamings were removed.

Have EOFError Subclass IOError

The original thought was that since EOFError deals directly with I/O, it should subclass IOError. But since EOFError is used more as a signal that an event has occurred (the exhaustion of an I/O port), it should not subclass such a specific error exception.

Have MemoryError and SystemError Have a Common Superclass

Both classes deal with the interpreter, so why not have them have a common superclass? Because one of them means that the interpreter is in a state that it should not recover from while the other does not.

Common Superclass for PendingDeprecationWarning and DeprecationWarning

Grouping the deprecation warning exceptions together makes intuitive sense. But this sensical idea does not extend well when one considers how rarely either warning is used, let along at the same time.

Removing WindowsError

Originally proposed based on the idea that having such a platform-specific exception should not be in the built-in namespace. It turns out, though, enough code exists that uses the exception to warrant it staying.

Superclass for KeyboardInterrupt and SystemExit

Proposed to make catching non-Exception inheriting exceptions easier along with easing the transition to the new hierarchy, the idea was rejected by the BDFL [17]. The argument that existing code did not show enough instances of the pair of exceptions being caught and thus did not justify cluttering the built-in namespace was used.

Acknowledgements

Thanks to Robert Brewer, Josiah Carlson, Nick Coghlan, Timothy Delaney, Jack Diedrich, Fred L. Drake, Jr., Philip J. Eby, Greg Ewing, James Y. Knight, MA Lemburg, Guido van Rossum, Stephen J. Turnbull, Raymond Hettinger, and everyone else I missed for participating in the discussion.

References

[1]PEP 342 (Coroutines via Enhanced Generators) http://www.python.org/dev/peps/pep-0342/
[2](1, 2) PEP 344 (Exception Chaining and Embedded Tracebacks) http://www.python.org/dev/peps/pep-0344/
[3]PEP 290 (Code Migration and Modernization) http://www.python.org/dev/peps/pep-0290/
[4]python-dev Summary (An exception is an exception, unless it doesn't inherit from Exception) http://www.python.org/dev/summary/2004-08-01_2004-08-15.html#an-exception-is-an-exception-unless-it-doesn-t-inherit-from-exception
[5]python-dev email (PEP, take 2: Exception Reorganization for Python 3.0) http://mail.python.org/pipermail/python-dev/2005-August/055116.html
[6]exceptions module http://docs.python.org/library/exceptions.html
[7]python-dev thread (Pre-PEP: Exception Reorganization for Python 3.0) http://mail.python.org/pipermail/python-dev/2005-July/055020.html, http://mail.python.org/pipermail/python-dev/2005-August/055065.html
[8]python-dev thread (PEP, take 2: Exception Reorganization for Python 3.0) http://mail.python.org/pipermail/python-dev/2005-August/055103.html
[9]python-dev thread (Reorg PEP checked in) http://mail.python.org/pipermail/python-dev/2005-August/055138.html
[10]python-dev thread (Major revision of PEP 348 committed) http://mail.python.org/pipermail/python-dev/2005-August/055199.html
[11]python-dev thread (Exception Reorg PEP revised yet again) http://mail.python.org/pipermail/python-dev/2005-August/055292.html
[12]python-dev thread (PEP 348 (exception reorg) revised again) http://mail.python.org/pipermail/python-dev/2005-August/055412.html
[13](1, 2) python-dev email (Pre-PEP: Exception Reorganization for Python 3.0) http://mail.python.org/pipermail/python-dev/2005-July/055019.html
[14]python-dev email (PEP, take 2: Exception Reorganization for Python 3.0) http://mail.python.org/pipermail/python-dev/2005-August/055159.html
[15]python-dev email (Exception Reorg PEP checked in) http://mail.python.org/pipermail/python-dev/2005-August/055149.html
[16]python-dev email (Exception Reorg PEP checked in) http://mail.python.org/pipermail/python-dev/2005-August/055175.html
[17](1, 2) python-dev email (PEP 348 (exception reorg) revised again) http://mail.python.org/pipermail/python-dev/2005-August/055423.html
[18]PEP 20 (The Zen of Python) http://www.python.org/dev/peps/pep-0020/
[19]Python Tutorial http://docs.python.org/tutorial/
[20]python-dev email (Bare except clauses in PEP 348) http://mail.python.org/pipermail/python-dev/2005-August/055676.html

pep-0349 Allow str() to return unicode strings

PEP: 349
Title: Allow str() to return unicode strings
Version: $Revision$
Last-Modified: $Date$
Author: Neil Schemenauer <nas at arctrix.com>
Status: Deferred
Type: Standards Track
Content-Type: text/plain
Created: 02-Aug-2005
Python-Version: 2.5
Post-History: 06-Aug-2005

Abstract

    This PEP proposes to change the str() built-in function so that it
    can return unicode strings.  This change would make it easier to
    write code that works with either string type and would also make
    some existing code handle unicode strings.  The C function
    PyObject_Str() would remain unchanged and the function
    PyString_New() would be added instead.


Rationale

    Python has had a Unicode string type for some time now but use of
    it is not yet widespread.  There is a large amount of Python code
    that assumes that string data is represented as str instances.
    The long term plan for Python is to phase out the str type and use
    unicode for all string data.  Clearly, a smooth migration path
    must be provided.

    We need to upgrade existing libraries, written for str instances,
    to be made capable of operating in an all-unicode string world.
    We can't change to an all-unicode world until all essential
    libraries are made capable for it.  Upgrading the libraries in one
    shot does not seem feasible.  A more realistic strategy is to
    individually make the libraries capable of operating on unicode
    strings while preserving their current all-str environment
    behaviour.

    First, we need to be able to write code that can accept unicode
    instances without attempting to coerce them to str instances.  Let
    us label such code as Unicode-safe.  Unicode-safe libraries can be
    used in an all-unicode world.

    Second, we need to be able to write code that, when provided only
    str instances, will not create unicode results.  Let us label such
    code as str-stable.  Libraries that are str-stable can be used by
    libraries and applications that are not yet Unicode-safe.
    
    Sometimes it is simple to write code that is both str-stable and
    Unicode-safe.  For example, the following function just works:

        def appendx(s):
            return s + 'x'

    That's not too surprising since the unicode type is designed to
    make the task easier.  The principle is that when str and unicode
    instances meet, the result is a unicode instance.  One notable
    difficulty arises when code requires a string representation of an
    object; an operation traditionally accomplished by using the str()
    built-in function.
    
    Using the current str() function makes the code not Unicode-safe.
    Replacing a str() call with a unicode() call makes the code not
    str-stable.  Changing str() so that it could return unicode
    instances would solve this problem.  As a further benefit, some code
    that is currently not Unicode-safe because it uses str() would
    become Unicode-safe.


Specification

    A Python implementation of the str() built-in follows:

        def str(s):
            """Return a nice string representation of the object.  The
            return value is a str or unicode instance.
            """
            if type(s) is str or type(s) is unicode:
                return s
            r = s.__str__()
            if not isinstance(r, (str, unicode)):
                raise TypeError('__str__ returned non-string')
            return r
            
    The following function would be added to the C API and would be the
    equivalent to the str() built-in (ideally it be called PyObject_Str,
    but changing that function could cause a massive number of
    compatibility problems):

        PyObject *PyString_New(PyObject *);

    A reference implementation is available on Sourceforge [1] as a
    patch.

                

Backwards Compatibility

    Some code may require that str() returns a str instance.  In the
    standard library, only one such case has been found so far.  The
    function email.header_decode() requires a str instance and the
    email.Header.decode_header() function tries to ensure this by
    calling str() on its argument.  The code was fixed by changing
    the line "header = str(header)" to:

        if isinstance(header, unicode):
            header = header.encode('ascii')

    Whether this is truly a bug is questionable since decode_header()
    really operates on byte strings, not character strings.  Code that
    passes it a unicode instance could itself be considered buggy.


Alternative Solutions

    A new built-in function could be added instead of changing str().
    Doing so would introduce virtually no backwards compatibility
    problems.  However, since the compatibility problems are expected to
    rare, changing str() seems preferable to adding a new built-in.

    The basestring type could be changed to have the proposed behaviour,
    rather than changing str().  However, that would be confusing
    behaviour for an abstract base type.


References

    [1] http://www.python.org/sf/1266570


Copyright

    This document has been placed in the public domain.



pep-0350 Codetags

PEP:350
Title:Codetags
Version:$Revision$
Last-Modified:$Date$
Author:Micah Elliott <mde at tracos.org>
Status:Rejected
Type:Informational
Content-Type:text/x-rst
Created:27-Jun-2005
Post-History:10-Aug-2005, 26-Sep-2005

Rejection Notice

This PEP has been rejected. While the community may be interested, there is no desire to make the standard library conform to this standard.

Abstract

This informational PEP aims to provide guidelines for consistent use of codetags, which would enable the construction of standard utilities to take advantage of the codetag information, as well as making Python code more uniform across projects. Codetags also represent a very lightweight programming micro-paradigm and become useful for project management, documentation, change tracking, and project health monitoring. This is submitted as a PEP because its ideas are thought to be Pythonic, although the concepts are not unique to Python programming. Herein are the definition of codetags, the philosophy behind them, a motivation for standardized conventions, some examples, a specification, a toolset description, and possible objections to the Codetag project/paradigm.

This PEP is also living as a wiki [1] for people to add comments.

What Are Codetags?

Programmers widely use ad-hoc code comment markup conventions to serve as reminders of sections of code that need closer inspection or review. Examples of markup include FIXME, TODO, XXX, BUG, but there many more in wide use in existing software. Such markup will henceforth be referred to as codetags. These codetags may show up in application code, unit tests, scripts, general documentation, or wherever suitable.

Codetags have been under discussion and in use (hundreds of codetags in the Python 2.4 sources) in many places (e.g., c2 [3]) for many years. See References for further historic and current information.

Philosophy

If you subscribe to most of these values, then codetags will likely be useful for you.

  1. As much information as possible should be contained inside the source code (application code or unit tests). This along with use of codetags impedes duplication. Most documentation can be generated from that source code; e.g., by using help2man, man2html, docutils, epydoc/pydoc, ctdoc, etc.
  2. Information should be almost never duplicated -- it should be recorded in a single original format and all other locations should be automatically generated from the original, or simply be referenced. This is famously known as the Single Point Of Truth (SPOT) or Don't Repeat Yourself (DRY) rule.
  3. Documentation that gets into customers' hands should be auto-generated from single sources into all other output formats. People want documentation in many forms. It is thus important to have a documentation system that can generate all of these.
  4. The developers are the documentation team. They write the code and should know the code the best. There should not be a dedicated, disjoint documentation team for any non-huge project.
  5. Plain text (with non-invasive markup) is the best format for writing anything. All other formats are to be generated from the plain text.

Codetag design was influenced by the following goals:

  1. Comments should be short whenever possible.
  2. Codetag fields should be optional and of minimal length. Default values and custom fields can be set by individual code shops.
  3. Codetags should be minimalistic. The quicker it is to jot something down, the more likely it is to get jotted.
  4. The most common use of codetags will only have zero to two fields specified, and these should be the easiest to type and read.

Motivation

  • Various productivity tools can be built around codetags.

    See Tools.

  • Encourages consistency.

    Historically, a subset of these codetags has been used informally in the majority of source code in existence, whether in Python or in other languages. Tags have been used in an inconsistent manner with different spellings, semantics, format, and placement. For example, some programmers might include datestamps and/or user identifiers, limit to a single line or not, spell the codetag differently than others, etc.

  • Encourages adherence to SPOT/DRY principle.

    E.g., generating a roadmap dynamically from codetags instead of keeping TODOs in sync with separate roadmap document.

  • Easy to remember.

    All codetags must be concise, intuitive, and semantically non-overlapping with others. The format must also be simple.

  • Use not required/imposed.

    If you don't use codetags already, there's no obligation to start, and no risk of affecting code (but see Objections). A small subset can be adopted and the Tools will still be useful (a few codetags have probably already been adopted on an ad-hoc basis anyway). Also it is very easy to identify and remove (and possibly record) a codetag that is no longer deemed useful.

  • Gives a global view of code.

    Tools can be used to generate documentation and reports.

  • A logical location for capturing CRCs/Stories/Requirements.

    The XP community often does not electronically capture Stories, but codetags seem like a good place to locate them.

  • Extremely lightweight process.

    Creating tickets in a tracking system for every thought degrades development velocity. Even if a ticketing system is employed, codetags are useful for simply containing links to those tickets.

Examples

This shows a simple codetag as commonly found in sources everywhere (with the addition of a trailing <>):

# FIXME: Seems like this loop should be finite. <>
while True: ...

The following contrived example demonstrates a typical use of codetags. It uses some of the available fields to specify the assignees (a pair of programmers with initials MDE and CLE), the Date of expected completion (Week 14), and the Priority of the item (2):

# FIXME: Seems like this loop should be finite. <MDE,CLE d:14w p:2>
while True: ...

This codetag shows a bug with fields describing author, discovery (origination) date, due date, and priority:

# BUG: Crashes if run on Sundays.
# <MDE 2005-09-04 d:14w p:2>
if day == 'Sunday': ...

Here is a demonstration of how not to use codetags. This has many problems: 1) Codetags cannot share a line with code; 2) Missing colon after mnemonic; 3) A codetag referring to codetags is usually useless, and worse, it is not completable; 4) No need to have a bunch of fields for a trivial codetag; 5) Fields with unknown values (t:XXX) should not be used:

i = i + 1   # TODO Add some more codetags.
# <JRNewbie 2005-04-03 d:2005-09-03 t:XXX d:14w p:0 s:inprogress>

Specification

This describes the format: syntax, mnemonic names, fields, and semantics, and also the separate DONE File.

General Syntax

Each codetag should be inside a comment, and can be any number of lines. It should not share a line with code. It should match the indentation of surrounding code. The end of the codetag is marked by a pair of angle brackets <> containing optional fields, which must not be split onto multiple lines. It is preferred to have a codetag in # comments instead of string comments. There can be multiple fields per codetag, all of which are optional.

In short, a codetag consists of a mnemonic, a colon, commentary text, an opening angle bracket, an optional list of fields, and a closing angle bracket. E.g.,

# MNEMONIC: Some (maybe multi-line) commentary. <field field ...>

Mnemonics

The codetags of interest are listed below, using the following format:

recommended mnemonic (& synonym list)
canonical name: semantics
TODO (MILESTONE, MLSTN, DONE, YAGNI, TBD, TOBEDONE)
To do: Informal tasks/features that are pending completion.
FIXME (XXX, DEBUG, BROKEN, REFACTOR, REFACT, RFCTR, OOPS, SMELL, NEEDSWORK, INSPECT)
Fix me: Areas of problematic or ugly code needing refactoring or cleanup.
BUG (BUGFIX)
Bugs: Reported defects tracked in bug database.
NOBUG (NOFIX, WONTFIX, DONTFIX, NEVERFIX, UNFIXABLE, CANTFIX)
Will Not Be Fixed: Problems that are well-known but will never be addressed due to design problems or domain limitations.
REQ (REQUIREMENT, STORY)
Requirements: Satisfactions of specific, formal requirements.
RFE (FEETCH, NYI, FR, FTRQ, FTR)
Requests For Enhancement: Roadmap items not yet implemented.
IDEA
Ideas: Possible RFE candidates, but less formal than RFE.
??? (QUESTION, QUEST, QSTN, WTF)
Questions: Misunderstood details.
!!! (ALERT)
Alerts: In need of immediate attention.
HACK (CLEVER, MAGIC)
Hacks: Temporary code to force inflexible functionality, or simply a test change, or workaround a known problem.
PORT (PORTABILITY, WKRD)
Portability: Workarounds specific to OS, Python version, etc.
CAVEAT (CAV, CAVT, WARNING, CAUTION)
Caveats: Implementation details/gotchas that stand out as non-intuitive.
NOTE (HELP)
Notes: Sections where a code reviewer found something that needs discussion or further investigation.
FAQ
Frequently Asked Questions: Interesting areas that require external explanation.
GLOSS (GLOSSARY)
Glossary: Definitions for project glossary.
SEE (REF, REFERENCE)
See: Pointers to other code, web link, etc.
TODOC (DOCDO, DODOC, NEEDSDOC, EXPLAIN, DOCUMENT)
Needs Documentation: Areas of code that still need to be documented.
CRED (CREDIT, THANKS)
Credits: Accreditations for external provision of enlightenment.
STAT (STATUS)
Status: File-level statistical indicator of maturity of this file.
RVD (REVIEWED, REVIEW)
Reviewed: File-level indicator that review was conducted.

File-level codetags might be better suited as properties in the revision control system, but might still be appropriately specified in a codetag.

Some of these are temporary (e.g., FIXME) while others are persistent (e.g., REQ). A mnemonic was chosen over a synonym using three criteria: descriptiveness, length (shorter is better), commonly used.

Choosing between FIXME and XXX is difficult. XXX seems to be more common, but much less descriptive. Furthermore, XXX is a useful placeholder in a piece of code having a value that is unknown. Thus FIXME is the preferred spelling. Sun says [4] that XXX and FIXME are slightly different, giving XXX higher severity. However, with decades of chaos on this topic, and too many millions of developers who won't be influenced by Sun, it is easy to rightly call them synonyms.

DONE is always a completed TODO item, but this should probably be indicated through the revision control system and/or a completion recording mechanism (see DONE File).

It may be a useful metric to count NOTE tags: a high count may indicate a design (or other) problem. But of course the majority of codetags indicate areas of code needing some attention.

An FAQ is probably more appropriately documented in a wiki where users can more easily view and contribute.

Fields

All fields are optional. The proposed standard fields are described in this section. Note that upper case field characters are intended to be replaced.

The Originator/Assignee and Origination Date/Week fields are the most common and don't usually require a prefix.

This lengthy list of fields is liable to scare people (the intended minimalists) away from adopting codetags, but keep in mind that these only exist to support programmers who either 1) like to keep BUG or RFE codetags in a complete form, or 2) are using codetags as their complete and only tracking system. In other words, many of these fields will be used very rarely. They are gathered largely from industry-wide conventions, and example sources include GCC Bugzilla [5] and Python's SourceForge [6] tracking systems.

AAA[,BBB]...
List of Originator or Assignee initials (the context determines which unless both should exist). It is also okay to use usernames such as MicahE instead of initials. Initials (in upper case) are the preferred form.
a:AAA[,BBB]...
List of Assignee initials. This is necessary only in (rare) cases where a codetag has both an assignee and an originator, and they are different. Otherwise the a: prefix is omitted, and context determines the intent. E.g., FIXME usually has an Assignee, and NOTE usually has an Originator, but if a FIXME was originated (and initialed) by a reviewer, then the assignee's initials would need a a: prefix.
YYYY[-MM[-DD]] or WW[.D]w
The Origination Date indicating when the comment was added, in ISO 8601 [2] format (digits and hyphens only). Or Origination Week, an alternative form for specifying an Origination Date. A day of the week can be optionally specified. The w suffix is necessary for distinguishing from a date.
d:YYYY[-MM[-DD]] or d:WW[.D]w
Due Date (d) target completion (estimate). Or Due Week (d), an alternative to specifying a Due Date.
p:N
Priority (p) level. Range (N) is from 0..3 with 3 being the highest. 0..3 are analogous to low, medium, high, and showstopper/critical. The Severity field could be factored into this single number, and doing so is recommended since having both is subject to varying interpretation. The range and order should be customizable. The existence of this field is important for any tool that itemizes codetags. Thus a (customizable) default value should be supported.
t:NNNN
Tracker (t) number corresponding to associated Ticket ID in separate tracking system.

The following fields are also available but expected to be less common.

c:AAAA
Category (c) indicating some specific area affected by this item.
s:AAAA
Status (s) indicating state of item. Examples are "unexplored", "understood", "inprogress", "fixed", "done", "closed". Note that when an item is completed it is probably better to remove the codetag and record it in a DONE File.
i:N
Development cycle Iteration (i). Useful for grouping codetags into completion target groups.
r:N
Development cycle Release (r). Useful for grouping codetags into completion target groups.

To summarize, the non-prefixed fields are initials and origination date, and the prefixed fields are: assignee (a), due (d), priority (p), tracker (t), category (c), status (s), iteration (i), and release (r).

It should be possible for groups to define or add their own fields, and these should have upper case prefixes to distinguish them from the standard set. Examples of custom fields are Operating System (O), Severity (S), Affected Version (A), Customer (C), etc.

DONE File

Some codetags have an ability to be completed (e.g., FIXME, TODO, BUG). It is often important to retain completed items by recording them with a completion date stamp. Such completed items are best stored in a single location, global to a project (or maybe a package). The proposed format is most easily described by an example, say ~/src/fooproj/DONE:

# TODO: Recurse into subdirs only on blue
# moons. <MDE 2003-09-26>
[2005-09-26 Oops, I underestimated this one a bit.  Should have
used Warsaw's First Law!]

# FIXME: ...
...

You can see that the codetag is copied verbatim from the original source file. The date stamp is then entered on the following line with an optional post-mortem commentary. The entry is terminated by a blank line (\n\n).

It may sound burdensome to have to delete codetag lines every time one gets completed. But in practice it is quite easy to setup a Vim or Emacs mapping to auto-record a codetag deletion in this format (sans the commentary).

Tools

Currently, programmers (and sometimes analysts) typically use grep to generate a list of items corresponding to a single codetag. However, various hypothetical productivity tools could take advantage of a consistent codetag format. Some example tools follow.

Document Generator
Possible docs: glossary, roadmap, manpages
Codetag History
Track (with revision control system interface) when a BUG tag (or any codetag) originated/resolved in a code section
Code Statistics
A project Health-O-Meter
Codetag Lint
Notify of invalid use of codetags, and aid in porting to codetags
Story Manager/Browser
An electronic means to replace XP notecards. In MVC terms, the codetag is the Model, and the Story Manager could be a graphical Viewer/Controller to do visual rearrangement, prioritization, and assignment, milestone management.
Any Text Editor
Used for changing, removing, adding, rearranging, recording codetags.

There are some tools already in existence that take advantage of a smaller set of pseudo-codetags (see References). There is also an example codetags implementation under way, known as the Codetag Project [7].

Objections

Objection:Extreme Programming argues that such codetags should not ever exist in code since the code is the documentation.
Defense:Maybe you should put the codetags in the unit test files instead. Besides, it's tough to generate documentation from uncommented source code.

Objection:Too much existing code has not followed proposed guidelines.
Defense:[Simple] utilities (ctlint) could convert existing codes.

Objection:Causes duplication with tracking system.
Defense:Not really, unless fields are abused. If an item exists in the tracker, a simple ticket number in the codetag tracker field is sufficient. Maybe a duplicated title would be acceptable. Furthermore, it's too burdensome to have a ticket filed for every item that pops into a developer's mind on-the-go. Additionally, the tracking system could possibly be obviated for simple or small projects that can reasonably fit the relevant data into a codetag.

Objection:Codetags are ugly and clutter code.
Defense:That is a good point. But I'd still rather have such info in a single place (the source code) than various other documents, likely getting duplicated or forgotten about. The completed codetags can be sent off to the DONE File, or to the bit bucket.

Objection:Codetags (and all comments) get out of date.
Defense:Not so much if other sources (externally visible documentation) depend on their being accurate.

Objection:Codetags tend to only rarely have estimated completion dates of any sort. OK, the fields are optional, but you want to suggest fields that actually will be widely used.
Defense:If an item is inestimable don't bother with specifying a date field. Using tools to display items with order and/or color by due date and/or priority, it is easier to make estimates. Having your roadmap be a dynamic reflection of your codetags makes you much more likely to keep the codetags accurate.

Objection:Named variables for the field parameters in the <> should be used instead of cryptic one-character prefixes. I.e., <MDE p:3> should rather be <author=MDE, priority=3>.
Defense:It is just too much typing/verbosity to spell out fields. I argue that p:3 i:2 is as readable as priority=3, iteration=2 and is much more likely to by typed and remembered (see bullet C in Philosophy). In this case practicality beats purity. There are not many fields to keep track of so one letter prefixes are suitable.

Objection:Synonyms should be deprecated since it is better to have a single way to spell something.
Defense:Many programmers prefer short mnemonic names, especially in comments. This is why short mnemonics were chosen as the primary names. However, others feel that an explicit spelling is less confusing and less prone to error. There will always be two camps on this subject. Thus synonyms (and complete, full spellings) should remain supported.

Objection:It is cruel to use [for mnemonics] opaque acronyms and abbreviations which drop vowels; it's hard to figure these things out. On that basis I hate: MLSTN RFCTR RFE FEETCH, NYI, FR, FTRQ, FTR WKRD RVDBY
Defense:Mnemonics are preferred since they are pretty easy to remember and take up less space. If programmers didn't like dropping vowels we would be able to fit very little code on a line. The space is important for those who write comments that often fit on a single line. But when using a canon everywhere it is much less likely to get something to fit on a line.

Objection:It takes too long to type the fields.
Defense:Then don't use (most or any of) them, especially if you're the only programmer. Terminating a codetag with <> is a small chore, and in doing so you enable the use of the proposed tools. Editor auto-completion of codetags is also useful: You can program your editor to stamp a template (e.g. # FIXME . <MDE {date}>) with just a keystroke or two.

Objection:WorkWeek is an obscure and uncommon time unit.
Defense:That's true but it is a highly suitable unit of granularity for estimation/targeting purposes, and it is very compact. The ISO 8601 [2] is widely understood but allows you to only specify either a specific day (restrictive) or month (broad).

Objection:I aesthetically dislike for the comment to be terminated with <> in the empty field case.
Defense:It is necessary to have a terminator since codetags may be followed by non-codetag comments. Or codetags could be limited to a single line, but that's prohibitive. I can't think of any single-character terminator that is appropriate and significantly better than <>. Maybe @ could be a terminator, but then most codetags will have an unnecessary @.

Objection:I can't use codetags when writing HTML, or less specifically, XML. Maybe @fields@ would be a better than <fields> as the delimiters.
Defense:Maybe you're right, but <> looks nicer whenever applicable. XML/SGML could use @ while more common programming languages stick to <>.

pep-0351 The freeze protocol

PEP:351
Title:The freeze protocol
Version:2.5
Last-Modified:$Date$
Author:Barry Warsaw <barry at python.org>
Status:Rejected
Type:Standards Track
Content-Type:text/x-rst
Created:14-Apr-2005
Post-History:

Abstract

This PEP describes a simple protocol for requesting a frozen, immutable copy of a mutable object. It also defines a new built-in function which uses this protocol to provide an immutable copy on any cooperating object.

Rejection Notice

This PEP was rejected. For a rationale, see this thread on python-dev [1].

Rationale

Built-in objects such dictionaries and sets accept only immutable objects as keys. This means that mutable objects like lists cannot be used as keys to a dictionary. However, a Python programmer can convert a list to a tuple; the two objects are similar, but the latter is immutable, and can be used as a dictionary key.

It is conceivable that third party objects also have similar mutable and immutable counterparts, and it would be useful to have a standard protocol for conversion of such objects.

sets.Set objects expose a "protocol for automatic conversion to immutable" so that you can create sets.Sets of sets.Sets. PEP 218 deliberately dropped this feature from built-in sets. This PEP advances that the feature is still useful and proposes a standard mechanism for its support.

Proposal

It is proposed that a new built-in function called freeze() is added.

If freeze() is passed an immutable object, as determined by hash() on that object not raising a TypeError, then the object is returned directly.

If freeze() is passed a mutable object (i.e. hash() of that object raises a TypeError), then freeze() will call that object's __freeze__() method to get an immutable copy. If the object does not have a __freeze__() method, then a TypeError is raised.

Sample implementations

Here is a Python implementation of the freeze() built-in:

def freeze(obj):
    try:
        hash(obj)
        return obj
    except TypeError:
        freezer = getattr(obj, '__freeze__', None)
        if freezer:
            return freezer()
        raise TypeError('object is not freezable')``

Here are some code samples which show the intended semantics:

class xset(set):
    def __freeze__(self):
        return frozenset(self)

class xlist(list):
    def __freeze__(self):
        return tuple(self)

class imdict(dict):
    def __hash__(self):
        return id(self)

    def _immutable(self, *args, **kws):
        raise TypeError('object is immutable')

    __setitem__ = _immutable
    __delitem__ = _immutable
    clear       = _immutable
    update      = _immutable
    setdefault  = _immutable
    pop         = _immutable
    popitem     = _immutable

class xdict(dict):
    def __freeze__(self):
        return imdict(self)

>>> s = set([1, 2, 3])
>>> {s: 4}
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
TypeError: set objects are unhashable
>>> t = freeze(s)
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
  File "/usr/tmp/python-lWCjBK.py", line 9, in freeze
TypeError: object is not freezable
>>> t = xset(s)
>>> u = freeze(t)
>>> {u: 4}
{frozenset([1, 2, 3]): 4}
>>> x = 'hello'
>>> freeze(x) is x
True
>>> d = xdict(a=7, b=8, c=9)
>>> hash(d)
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
TypeError: dict objects are unhashable
>>> hash(freeze(d))
-1210776116
>>> {d: 4}
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
TypeError: dict objects are unhashable
>>> {freeze(d): 4}
{{'a': 7, 'c': 9, 'b': 8}: 4}

Reference implementation

Patch 1335812 [2] provides the C implementation of this feature. It adds the freeze() built-in, along with implementations of the __freeze__() method for lists and sets. Dictionaries are not easily freezable in current Python, so an implementation of dict.__freeze__() is not provided yet.

Open issues

  • Should we define a similar protocol for thawing frozen objects?
  • Should dicts and sets automatically freeze their mutable keys?
  • Should we support "temporary freezing" (perhaps with a method called __congeal__()) a la __as_temporarily_immutable__() in sets.Set?
  • For backward compatibility with sets.Set, should we support __as_immutable__()? Or should __freeze__() just be renamed to __as_immutable__()?

pep-0352 Required Superclass for Exceptions

PEP:352
Title:Required Superclass for Exceptions
Version:$Revision$
Last-Modified:$Date$
Author:Brett Cannon, Guido van Rossum
Status:Final
Type:Standards Track
Content-Type:text/x-rst
Created:27-Oct-2005
Post-History:

Abstract

In Python 2.4 and before, any (classic) class can be raised as an exception. The plan for 2.5 was to allow new-style classes, but this makes the problem worse -- it would mean any class (or instance) can be raised! This is a problem as it prevents any guarantees from being made about the interface of exceptions. This PEP proposes introducing a new superclass that all raised objects must inherit from. Imposing the restriction will allow a standard interface for exceptions to exist that can be relied upon. It also leads to a known hierarchy for all exceptions to adhere to.

One might counter that requiring a specific base class for a particular interface is unPythonic. However, in the specific case of exceptions there's a good reason (which has generally been agreed to on python-dev): requiring hierarchy helps code that wants to catch exceptions by making it possible to catch all exceptions explicitly by writing except BaseException: instead of except *:. [2]

Introducing a new superclass for exceptions also gives us the chance to rearrange the exception hierarchy slightly for the better. As it currently stands, all exceptions in the built-in namespace inherit from Exception. This is a problem since this includes two exceptions (KeyboardInterrupt and SystemExit) that often need to be excepted from the application's exception handling: the default behavior of shutting the interpreter down without a traceback is usually more desirable than whatever the application might do (with the possible exception of applications that emulate Python's interactive command loop with >>> prompt). Changing it so that these two exceptions inherit from the common superclass instead of Exception will make it easy for people to write except clauses that are not overreaching and not catch exceptions that should propagate up.

This PEP is based on previous work done for PEP 348 [1].

Requiring a Common Superclass

This PEP proposes introducing a new exception named BaseException that is a new-style class and has a single attribute, args. Below is the code as the exception will work in Python 3.0 (how it will work in Python 2.x is covered in the Transition Plan section):

class BaseException(object):

    """Superclass representing the base of the exception hierarchy.

    Provides an 'args' attribute that contains all arguments passed
    to the constructor.  Suggested practice, though, is that only a
    single string argument be passed to the constructor.

    """

    def __init__(self, *args):
        self.args = args

    def __str__(self):
        if len(self.args) == 1:
            return str(self.args[0])
        else:
            return str(self.args)

    def __repr__(self):
        return "%s(*%s)" % (self.__class__.__name__, repr(self.args))

No restriction is placed upon what may be passed in for args for backwards-compatibility reasons. In practice, though, only a single string argument should be used. This keeps the string representation of the exception to be a useful message about the exception that is human-readable; this is why the __str__ method special-cases on length-1 args value. Including programmatic information (e.g., an error code number) should be stored as a separate attribute in a subclass.

The raise statement will be changed to require that any object passed to it must inherit from BaseException. This will make sure that all exceptions fall within a single hierarchy that is anchored at BaseException [2]. This also guarantees a basic interface that is inherited from BaseException. The change to raise will be enforced starting in Python 3.0 (see the Transition Plan below).

With BaseException being the root of the exception hierarchy, Exception will now inherit from it.

Exception Hierarchy Changes

With the exception hierarchy now even more important since it has a basic root, a change to the existing hierarchy is called for. As it stands now, if one wants to catch all exceptions that signal an error and do not mean the interpreter should be allowed to exit, you must specify all but two exceptions specifically in an except clause or catch the two exceptions separately and then re-raise them and have all other exceptions fall through to a bare except clause:

except (KeyboardInterrupt, SystemExit):
    raise
except:
    ...

That is needlessly explicit. This PEP proposes moving KeyboardInterrupt and SystemExit to inherit directly from BaseException.

- BaseException
  |- KeyboardInterrupt
  |- SystemExit
  |- Exception
     |- (all other current built-in exceptions)

Doing this makes catching Exception more reasonable. It would catch only exceptions that signify errors. Exceptions that signal that the interpreter should exit will not be caught and thus be allowed to propagate up and allow the interpreter to terminate.

KeyboardInterrupt has been moved since users typically expect an application to exit when they press the interrupt key (usually Ctrl-C). If people have overly broad except clauses the expected behaviour does not occur.

SystemExit has been moved for similar reasons. Since the exception is raised when sys.exit() is called the interpreter should normally be allowed to terminate. Unfortunately overly broad except clauses can prevent the explicitly requested exit from occurring.

To make sure that people catch Exception most of the time, various parts of the documentation and tutorials will need to be updated to strongly suggest that Exception be what programmers want to use. Bare except clauses or catching BaseException directly should be discouraged based on the fact that KeyboardInterrupt and SystemExit almost always should be allowed to propagate up.

Transition Plan

Since semantic changes to Python are being proposed, a transition plan is needed. The goal is to end up with the new semantics being used in Python 3.0 while providing a smooth transition for 2.x code. All deprecations mentioned in the plan will lead to the removal of the semantics starting in the version following the initial deprecation.

Here is BaseException as implemented in the 2.x series:

class BaseException(object):

    """Superclass representing the base of the exception hierarchy.

    The __getitem__ method is provided for backwards-compatibility
    and will be deprecated at some point.  The 'message' attribute
    is also deprecated.

    """

    def __init__(self, *args):
        self.args = args

    def __str__(self):
        return str(self.args[0]
                   if len(self.args) <= 1
                   else self.args)

    def __repr__(self):
        func_args = repr(self.args) if self.args else "()"
        return self.__class__.__name__ + func_args

    def __getitem__(self, index):
        """Index into arguments passed in during instantiation.

        Provided for backwards-compatibility and will be
        deprecated.

        """
        return self.args[index]

    def _get_message(self):
        """Method for 'message' property."""
        warnings.warn("the 'message' attribute has been deprecated "
                        "since Python 2.6")
        return self.args[0] if len(args) == 1 else ''

    message = property(_get_message,
                        doc="access the 'message' attribute; "
                            "deprecated and provided only for "
                            "backwards-compatibility")

Deprecation of features in Python 2.9 is optional. This is because it is not known at this time if Python 2.9 (which is slated to be the last version in the 2.x series) will actively deprecate features that will not be in 3.0. It is conceivable that no deprecation warnings will be used in 2.9 since there could be such a difference between 2.9 and 3.0 that it would make 2.9 too "noisy" in terms of warnings. Thus the proposed deprecation warnings for Python 2.9 will be revisited when development of that version begins, to determine if they are still desired.

  • Python 2.5 [done]
    • all standard exceptions become new-style classes [done]
    • introduce BaseException [done]
    • Exception, KeyboardInterrupt, and SystemExit inherit from BaseException [done]
    • deprecate raising string exceptions [done]
  • Python 2.6 [done]
    • deprecate catching string exceptions [done]
    • deprecate message attribute (see Retracted Ideas) [done]
  • Python 2.7 [done]
    • deprecate raising exceptions that do not inherit from BaseException
  • Python 3.0 [done]
    • drop everything that was deprecated above:
      • string exceptions (both raising and catching) [done]
      • all exceptions must inherit from BaseException [done]
      • drop __getitem__, message [done]

Retracted Ideas

A previous version of this PEP that was implemented in Python 2.5 included a 'message' attribute on BaseException. Its purpose was to begin a transition to BaseException accepting only a single argument. This was to tighten the interface and to force people to use attributes in subclasses to carry arbitrary information with an exception instead of cramming it all into args.

Unfortunately, while implementing the removal of the args attribute in Python 3.0 at the PyCon 2007 sprint [4], it was discovered that the transition was very painful, especially for C extension modules. It was decided that it would be better to deprecate the message attribute in Python 2.6 (and remove it in Python 2.7 and Python 3.0) and consider a more long-term transition strategy in Python 3.0 to remove multiple-argument support in BaseException in preference of accepting only a single argument. Thus the introduction of message and the original deprecation of args has been retracted.

References

[1]PEP 348 (Exception Reorganization for Python 3.0) http://www.python.org/dev/peps/pep-0348/
[2](1, 2) python-dev Summary for 2004-08-01 through 2004-08-15 http://www.python.org/dev/summary/2004-08-01_2004-08-15.html#an-exception-is-an-exception-unless-it-doesn-t-inherit-from-exception
[3]SF patch #1104669 (new-style exceptions) http://www.python.org/sf/1104669
[4]python-3000 email ("How far to go with cleaning up exceptions") http://mail.python.org/pipermail/python-3000/2007-March/005911.html

pep-0353 Using ssize_t as the index type

PEP:353
Title:Using ssize_t as the index type
Version:$Revision$
Last-Modified:$Date$
Author:Martin von Lรถwis <martin at v.loewis.de>
Status:Final
Type:Standards Track
Content-Type:text/x-rst
Created:18-Dec-2005
Post-History:

Abstract

In Python 2.4, indices of sequences are restricted to the C type int. On 64-bit machines, sequences therefore cannot use the full address space, and are restricted to 2**31 elements. This PEP proposes to change this, introducing a platform-specific index type Py_ssize_t. An implementation of the proposed change is in http://svn.python.org/projects/python/branches/ssize_t.

Rationale

64-bit machines are becoming more popular, and the size of main memory increases beyond 4GiB. On such machines, Python currently is limited, in that sequences (strings, unicode objects, tuples, lists, array.arrays, ...) cannot contain more than 2GiElements.

Today, very few machines have memory to represent larger lists: as each pointer is 8B (in a 64-bit machine), one needs 16GiB to just hold the pointers of such a list; with data in the list, the memory consumption grows even more. However, there are three container types for which users request improvements today:

  • strings (currently restricted to 2GiB)
  • mmap objects (likewise; plus the system typically won't keep the whole object in memory concurrently)
  • Numarray objects (from Numerical Python)

As the proposed change will cause incompatibilities on 64-bit machines, it should be carried out while such machines are not in wide use (IOW, as early as possible).

Specification

A new type Py_ssize_t is introduced, which has the same size as the compiler's size_t type, but is signed. It will be a typedef for ssize_t where available.

The internal representation of the length fields of all container types is changed from int to ssize_t, for all types included in the standard distribution. In particular, PyObject_VAR_HEAD is changed to use Py_ssize_t, affecting all extension modules that use that macro.

All occurrences of index and length parameters and results are changed to use Py_ssize_t, including the sequence slots in type objects, and the buffer interface.

New conversion functions PyInt_FromSsize_t and PyInt_AsSsize_t, are introduced. PyInt_FromSsize_t will transparently return a long int object if the value exceeds the LONG_MAX; PyInt_AsSsize_t will transparently process long int objects.

New function pointer typedefs ssizeargfunc, ssizessizeargfunc, ssizeobjargproc, ssizessizeobjargproc, and lenfunc are introduced. The buffer interface function types are now called readbufferproc, writebufferproc, segcountproc, and charbufferproc.

A new conversion code 'n' is introduced for PyArg_ParseTuple Py_BuildValue, PyObject_CallFunction and PyObject_CallMethod. This code operates on Py_ssize_t.

The conversion codes 's#' and 't#' will output Py_ssize_t if the macro PY_SSIZE_T_CLEAN is defined before Python.h is included, and continue to output int if that macro isn't defined.

At places where a conversion from size_t/Py_ssize_t to int is necessary, the strategy for conversion is chosen on a case-by-case basis (see next section).

To prevent loading extension modules that assume a 32-bit size type into an interpreter that has a 64-bit size type, Py_InitModule4 is renamed to Py_InitModule4_64.

Conversion guidelines

Module authors have the choice whether they support this PEP in their code or not; if they support it, they have the choice of different levels of compatibility.

If a module is not converted to support this PEP, it will continue to work unmodified on a 32-bit system. On a 64-bit system, compile-time errors and warnings might be issued, and the module might crash the interpreter if the warnings are ignored.

Conversion of a module can either attempt to continue using int indices, or use Py_ssize_t indices throughout.

If the module should continue to use int indices, care must be taken when calling functions that return Py_ssize_t or size_t, in particular, for functions that return the length of an object (this includes the strlen function and the sizeof operator). A good compiler will warn when a Py_ssize_t/size_t value is truncated into an int. In these cases, three strategies are available:

  • statically determine that the size can never exceed an int (e.g. when taking the sizeof a struct, or the strlen of a file pathname). In this case, write:

    some_int = Py_SAFE_DOWNCAST(some_value, Py_ssize_t, int);
    

    This will add an assertion in debug mode that the value really fits into an int, and just add a cast otherwise.

  • statically determine that the value shouldn't overflow an int unless there is a bug in the C code somewhere. Test whether the value is smaller than INT_MAX, and raise an InternalError if it isn't.

  • otherwise, check whether the value fits an int, and raise a ValueError if it doesn't.

The same care must be taken for tp_as_sequence slots, in addition, the signatures of these slots change, and the slots must be explicitly recast (e.g. from intargfunc to ssizeargfunc). Compatibility with previous Python versions can be achieved with the test:

#if PY_VERSION_HEX < 0x02050000 && !defined(PY_SSIZE_T_MIN)
typedef int Py_ssize_t;
#define PY_SSIZE_T_MAX INT_MAX
#define PY_SSIZE_T_MIN INT_MIN
#endif

and then using Py_ssize_t in the rest of the code. For the tp_as_sequence slots, additional typedefs might be necessary; alternatively, by replacing:

PyObject* foo_item(struct MyType* obj, int index)
{
  ...
}

with:

PyObject* foo_item(PyObject* _obj, Py_ssize_t index)
{
   struct MyType* obj = (struct MyType*)_obj;
   ...
}

it becomes possible to drop the cast entirely; the type of foo_item should then match the sq_item slot in all Python versions.

If the module should be extended to use Py_ssize_t indices, all usages of the type int should be reviewed, to see whether it should be changed to Py_ssize_t. The compiler will help in finding the spots, but a manual review is still necessary.

Particular care must be taken for PyArg_ParseTuple calls: they need all be checked for s# and t# converters, and PY_SSIZE_T_CLEAN must be defined before including Python.h if the calls have been updated accordingly.

Fredrik Lundh has written a scanner [1] which checks the code of a C module for usage of APIs whose signature has changed.

Discussion

Why not size_t

An initial attempt to implement this feature tried to use size_t. It quickly turned out that this cannot work: Python uses negative indices in many places (to indicate counting from the end). Even in places where size_t would be usable, too many reformulations of code where necessary, e.g. in loops like:

for(index = length-1; index >= 0; index--)

This loop will never terminate if index is changed from int to size_t.

Why not Py_intptr_t

Conceptually, Py_intptr_t and Py_ssize_t are different things: Py_intptr_t needs to be the same size as void*, and Py_ssize_t the same size as size_t. These could differ, e.g. on machines where pointers have segment and offset. On current flat-address space machines, there is no difference, so for all practical purposes, Py_intptr_t would have worked as well.

Doesn't this break much code?

With the changes proposed, code breakage is fairly minimal. On a 32-bit system, no code will break, as Py_ssize_t is just a typedef for int.

On a 64-bit system, the compiler will warn in many places. If these warnings are ignored, the code will continue to work as long as the container sizes don't exceeed 2**31, i.e. it will work nearly as good as it does currently. There are two exceptions to this statement: if the extension module implements the sequence protocol, it must be updated, or the calling conventions will be wrong. The other exception is the places where Py_ssize_t is output through a pointer (rather than a return value); this applies most notably to codecs and slice objects.

If the conversion of the code is made, the same code can continue to work on earlier Python releases.

Doesn't this consume too much memory?

One might think that using Py_ssize_t in all tuples, strings, lists, etc. is a waste of space. This is not true, though: on a 32-bit machine, there is no change. On a 64-bit machine, the size of many containers doesn't change, e.g.

  • in lists and tuples, a pointer immediately follows the ob_size member. This means that the compiler currently inserts a 4 padding bytes; with the change, these padding bytes become part of the size.
  • in strings, the ob_shash field follows ob_size. This field is of type long, which is a 64-bit type on most 64-bit systems (except Win64), so the compiler inserts padding before it as well.

Open Issues

  • Marc-Andre Lemburg commented that complete backwards compatibility with existing source code should be preserved. In particular, functions that have Py_ssize_t* output arguments should continue to run correctly even if the callers pass int*.

    It is not clear what strategy could be used to implement that requirement.

pep-0354 Enumerations in Python

PEP:354
Title:Enumerations in Python
Version:$Revision$
Last-Modified:$Date$
Author:Ben Finney <ben+python at benfinney.id.au>
Status:Superseded
Type:Standards Track
Content-Type:text/x-rst
Created:20-Dec-2005
Python-Version:2.6
Post-History:20-Dec-2005
Superseded-By:435

Rejection Notice

This PEP has been rejected. This doesn't slot nicely into any of the existing modules (like collections), and the Python standard library eschews having lots of individual data structures in their own modules. Also, the PEP has generated no widespread interest. For those who need enumerations, there are cookbook recipes and PyPI packages that meet these needs.

Note: this PEP was superseded by PEP 435, which has been accepted in May 2013.

Abstract

This PEP specifies an enumeration data type for Python.

An enumeration is an exclusive set of symbolic names bound to arbitrary unique values. Values within an enumeration can be iterated and compared, but the values have no inherent relationship to values outside the enumeration.

Motivation

The properties of an enumeration are useful for defining an immutable, related set of constant values that have a defined sequence but no inherent semantic meaning. Classic examples are days of the week (Sunday through Saturday) and school assessment grades ('A' through 'D', and 'F'). Other examples include error status values and states within a defined process.

It is possible to simply define a sequence of values of some other basic type, such as int or str, to represent discrete arbitrary values. However, an enumeration ensures that such values are distinct from any others, and that operations without meaning ("Wednesday times two") are not defined for these values.

Specification

An enumerated type is created from a sequence of arguments to the type's constructor:

>>> Weekdays = enum('sun', 'mon', 'tue', 'wed', 'thu', 'fri', 'sat')
>>> Grades = enum('A', 'B', 'C', 'D', 'F')

Enumerations with no values are meaningless. The exception EnumEmptyError is raised if the constructor is called with no value arguments.

The values are bound to attributes of the new enumeration object:

>>> today = Weekdays.mon

The values can be compared:

>>> if today == Weekdays.fri:
...     print "Get ready for the weekend"

Values within an enumeration cannot be meaningfully compared except with values from the same enumeration. The comparison operation functions return NotImplemented [1] when a value from an enumeration is compared against any value not from the same enumeration or of a different type:

>>> gym_night = Weekdays.wed
>>> gym_night.__cmp__(Weekdays.mon)
1
>>> gym_night.__cmp__(Weekdays.wed)
0
>>> gym_night.__cmp__(Weekdays.fri)
-1
>>> gym_night.__cmp__(23)
NotImplemented
>>> gym_night.__cmp__("wed")
NotImplemented
>>> gym_night.__cmp__(Grades.B)
NotImplemented

This allows the operation to succeed, evaluating to a boolean value:

>>> gym_night = Weekdays.wed
>>> gym_night < Weekdays.mon
False
>>> gym_night < Weekdays.wed
False
>>> gym_night < Weekdays.fri
True
>>> gym_night < 23
False
>>> gym_night > 23
True
>>> gym_night > "wed"
True
>>> gym_night > Grades.B
True

Coercing a value from an enumeration to a str results in the string that was specified for that value when constructing the enumeration:

>>> gym_night = Weekdays.wed
>>> str(gym_night)
'wed'

The sequence index of each value from an enumeration is exported as an integer via that value's index attribute:

>>> gym_night = Weekdays.wed
>>> gym_night.index
3

An enumeration can be iterated, returning its values in the sequence they were specified when the enumeration was created:

>>> print [str(day) for day in Weekdays]
['sun', 'mon', 'tue', 'wed', 'thu', 'fri', 'sat']

Values from an enumeration are hashable, and can be used as dict keys:

>>> plans = {}
>>> plans[Weekdays.sat] = "Feed the horse"

The normal usage of enumerations is to provide a set of possible values for a data type, which can then be used to map to other information about the values:

>>> for report_grade in Grades:
...     report_students[report_grade] = \
...         [s for s in students if students.grade == report_grade]

Rationale -- Other designs considered

All in one class

Some implementations have the enumeration and its values all as attributes of a single object or class.

This PEP specifies a design where the enumeration is a container, and the values are simple comparables. It was felt that attempting to place all the properties of enumeration within a single class complicates the design without apparent benefit.

Metaclass for creating enumeration classes

The enumerations specified in this PEP are instances of an enum type. Some alternative designs implement each enumeration as its own class, and a metaclass to define common properties of all enumerations.

One motivation for having a class (rather than an instance) for each enumeration is to allow subclasses of enumerations, extending and altering an existing enumeration. A class, though, implies that instances of that class will be created; it is difficult to imagine what it means to have separate instances of a "days of the week" class, where each instance contains all days. This usually leads to having each class follow the Singleton pattern, further complicating the design.

In contrast, this PEP specifies enumerations that are not expected to be extended or modified. It is, of course, possible to create a new enumeration from the string values of an existing one, or even subclass the enum type if desired.

Hiding attributes of enumerated values

A previous design had the enumerated values hiding as much as possible about their implementation, to the point of not exporting the string key and sequence index.

The design in this PEP acknowledges that programs will often find it convenient to know the enumerated value's enumeration type, sequence index, and string key specified for the value. These are exported by the enumerated value as attributes.

Implementation

This design is based partly on a recipe [2] from the Python Cookbook.

The PyPI package enum [3] provides a Python implementation of the data types described in this PEP.

References and Footnotes

[1]The NotImplemented return value from comparison operations signals the Python interpreter to attempt alternative comparisons or other fallbacks. <http://docs.python.org/reference/datamodel.html#the-standard-type-hierarchy>
[2]"First Class Enums in Python", Zoran Isailovski, Python Cookbook recipe 413486 <http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/413486>
[3]Python Package Index, package enum <http://cheeseshop.python.org/pypi/enum/>

pep-0355 Path - Object oriented filesystem paths

PEP: 355
Title: Path - Object oriented filesystem paths
Version: $Revision$
Last-Modified: $Date$
Author: Bjรถrn Lindqvist <bjourne at gmail.com>
Status: Rejected
Type: Standards Track
Content-Type: text/plain
Created: 24-Jan-2006
Python-Version: 2.5
Post-History: 

Rejection Notice

    This PEP has been rejected (in this form).  The proposed path class
    is the ultimate kitchen sink; but the notion that it's better to
    implement *all* functionality that uses a path as a method on a single
    class is an anti-pattern.  (E.g.why not open()?  Or execfile()?)
    Subclassing from str is a particularly bad idea; many string
    operations make no sense when applied to a path.  This PEP has
    lingered, and while the discussion flares up from time to time,
    it's time to put this PEP out of its misery.  A less far-fetched
    proposal might be more palatable.


Abstract

    This PEP describes a new class, Path, to be added to the os
    module, for handling paths in an object oriented fashion.  The
    "weak" deprecation of various related functions is also discussed
    and recommended.


Background

    The ideas expressed in this PEP are not recent, but have been
    debated in the Python community for many years.  Many have felt
    that the API for manipulating file paths as offered in the os.path
    module is inadequate.  The first proposal for a Path object was
    raised by Just van Rossum on python-dev in 2001 [2].  In 2003,
    Jason Orendorff released version 1.0 of the "path module" which
    was the first public implementation that used objects to represent
    paths [3].

    The path module quickly became very popular and numerous attempts
    were made to get the path module included in the Python standard
    library; [4], [5], [6], [7].

    This PEP summarizes the ideas and suggestions people have
    expressed about the path module and proposes that a modified
    version should be included in the standard library.


Motivation

    Dealing with filesystem paths is a common task in any programming
    language, and very common in a high-level language like Python.
    Good support for this task is needed, because:

    - Almost every program uses paths to access files.  It makes sense
      that a task, that is so often performed, should be as intuitive
      and as easy to perform as possible.

    - It makes Python an even better replacement language for
      over-complicated shell scripts.

    Currently, Python has a large number of different functions
    scattered over half a dozen modules for handling paths.  This
    makes it hard for newbies and experienced developers to to choose
    the right method.

    The Path class provides the following enhancements over the
    current common practice:

    - One "unified" object provides all functionality from previous
      functions.

    - Subclassability - the Path object can be extended to support
      paths other than filesystem paths.  The programmer does not need
      to learn a new API, but can reuse his or her knowledge of Path
      to deal with the extended class.

    - With all related functionality in one place, the right approach
      is easier to learn as one does not have to hunt through many
      different modules for the right functions.

    - Python is an object oriented language.  Just like files,
      datetimes and sockets are objects so are paths, they are not
      merely strings to be passed to functions.  Path objects is
      inherently a pythonic idea.

    - Path takes advantage of properties.  Properties make for more
      readable code.

      if imgpath.ext == 'jpg':
          jpegdecode(imgpath)

      Is better than:

      if os.path.splitexit(imgpath)[1] == 'jpg':
          jpegdecode(imgpath)


Rationale

    The following points summarize the design:

    - Path extends from string, therefore all code which expects
      string pathnames need not be modified and no existing code will
      break.

    - A Path object can be created either by using the classmethod
      Path.cwd, by instantiating the class with a string representing
      a path or by using the default constructor which is equivalent
      to Path(".").

    - Path provides common pathname manipulation, pattern expansion,
      pattern matching and other high-level file operations including
      copying.  Basically Path provides everything path-related except
      the manipulation of file contents, for which file objects are
      better suited.

    - Platform incompatibilities are dealt with by not instantiating
      system specific methods.


Specification

    This class defines the following public interface (docstrings have
    been extracted from the reference implementation, and shortened
    for brevity; see the reference implementation for more detail):

    class Path(str):

        # Special Python methods:
        def __new__(cls, *args) => Path
            """
            Creates a new path object concatenating the *args.  *args
            may only contain Path objects or strings.  If *args is
            empty, Path(os.curdir) is created.
            """
        def __repr__(self): ...
        def __add__(self, more): ...
        def __radd__(self, other): ...

        # Alternative constructor.
        def cwd(cls): ...

        # Operations on path strings:
        def abspath(self) => Path
            """Returns the absolute path of self as a new Path object."""
        def normcase(self): ...
        def normpath(self): ...
        def realpath(self): ...
        def expanduser(self): ...
        def expandvars(self): ...
        def basename(self): ...
        def expand(self): ...
        def splitpath(self) => (Path, str)
            """p.splitpath() -> Return (p.parent, p.name)."""
        def stripext(self) => Path
            """p.stripext() -> Remove one file extension from the path."""
        def splitunc(self): ...  [1]
        def splitall(self): ...
        def relpath(self): ...
        def relpathto(self, dest): ...

        # Properties about the path:
        parent => Path
            """This Path's parent directory as a new path object."""
        name => str
            """The name of this file or directory without the full path."""
        ext => str
            """
            The file extension or an empty string if Path refers to a
            file without an extension or a directory.
            """
        drive => str
            """
            The drive specifier.  Always empty on systems that don't
            use drive specifiers.
            """
        namebase => str
            """
            The same as path.name, but with one file extension
            stripped off.
            """
        uncshare[1]

        # Operations that return lists of paths:
        def listdir(self, pattern = None): ...
        def dirs(self, pattern = None): ...
        def files(self, pattern = None): ...
        def walk(self, pattern = None): ...
        def walkdirs(self, pattern = None): ...
        def walkfiles(self, pattern = None): ...
        def match(self, pattern) => bool
            """Returns True if self.name matches the given pattern."""

        def matchcase(self, pattern) => bool
            """
            Like match() but is guaranteed to be case sensitive even
            on platforms with case insensitive filesystems.
            """
        def glob(self, pattern):

        # Methods for retrieving information about the filesystem
        # path:
        def exists(self): ...
        def isabs(self): ...
        def isdir(self): ...
        def isfile(self): ...
        def islink(self): ...
        def ismount(self): ...
        def samefile(self, other): ...  [1]
        def atime(self): ...
            """Last access time of the file."""
        def mtime(self): ...
            """Last-modified time of the file."""
        def ctime(self): ...
            """
            Return the system's ctime which, on some systems (like
            Unix) is the time of the last change, and, on others (like
            Windows), is the creation time for path.
            """
        def size(self): ...
        def access(self, mode): ...  [1]
        def stat(self): ...
        def lstat(self): ...
        def statvfs(self): ...  [1]
        def pathconf(self, name): ...  [1]

        # Methods for manipulating information about the filesystem
        # path.
        def utime(self, times) => None
        def chmod(self, mode) => None
        def chown(self, uid, gid) => None [1]
        def rename(self, new) => None
        def renames(self, new) => None

        # Create/delete operations on directories
        def mkdir(self, mode = 0777): ...
        def makedirs(self, mode = 0777): ...
        def rmdir(self): ...
        def removedirs(self): ...

        # Modifying operations on files
        def touch(self): ...
        def remove(self): ...
        def unlink(self): ...

        # Modifying operations on links
        def link(self, newpath): ...
        def symlink(self, newlink): ...
        def readlink(self): ...
        def readlinkabs(self): ...

        # High-level functions from shutil
        def copyfile(self, dst): ...
        def copymode(self, dst): ...
        def copystat(self, dst): ...
        def copy(self, dst): ...
        def copy2(self, dst): ...
        def copytree(self, dst, symlinks = True): ...
        def move(self, dst): ...
        def rmtree(self, ignore_errors = False, onerror = None): ...

        # Special stuff from os
        def chroot(self): ...  [1]
        def startfile(self): ...  [1]


Replacing older functions with the Path class

    In this section, "a ==> b" means that b can be used as a
    replacement for a.

    In the following examples, we assume that the Path class is
    imported with "from path import Path".

    1. Replacing os.path.join
    --------------------------

    os.path.join(os.getcwd(), "foobar")
    ==>
    Path(Path.cwd(), "foobar")

    os.path.join("foo", "bar", "baz")
    ==>
    Path("foo", "bar", "baz")


    2. Replacing os.path.splitext
    ------------------------------

    fname = "Python2.4.tar.gz"
    os.path.splitext(fname)[1]
    ==>
    fname = Path("Python2.4.tar.gz")
    fname.ext

    Or if you want both parts:

    fname = "Python2.4.tar.gz"
    base, ext = os.path.splitext(fname)
    ==>
    fname = Path("Python2.4.tar.gz")
    base, ext = fname.namebase, fname.extx


    3. Replacing glob.glob
    -----------------------

    lib_dir = "/lib"
    libs = glob.glob(os.path.join(lib_dir, "*s.o"))
    ==>
    lib_dir = Path("/lib")
    libs = lib_dir.files("*.so")


Deprecations

    Introducing this module to the standard library introduces a need
    for the "weak" deprecation of a number of existing modules and
    functions.  These modules and functions are so widely used that
    they cannot be truly deprecated, as in generating
    DeprecationWarning.  Here "weak deprecation" means notes in the
    documentation only.

    The table below lists the existing functionality that should be
    deprecated.

        Path method/property    Deprecates function
        --------------------    -------------------
        normcase()              os.path.normcase()
        normpath()              os.path.normpath()
        realpath()              os.path.realpath()
        expanduser()            os.path.expanduser()
        expandvars()            os.path.expandvars()
        parent                  os.path.dirname()
        name                    os.path.basename()
        splitpath()             os.path.split()
        drive                   os.path.splitdrive()
        ext                     os.path.splitext()
        splitunc()              os.path.splitunc()
        __new__()               os.path.join(), os.curdir
        listdir()               os.listdir() [fnmatch.filter()]
        match()                 fnmatch.fnmatch()
        matchcase()             fnmatch.fnmatchcase()
        glob()                  glob.glob()
        exists()                os.path.exists()
        isabs()                 os.path.isabs()
        isdir()                 os.path.isdir()
        isfile()                os.path.isfile()
        islink()                os.path.islink()
        ismount()               os.path.ismount()
        samefile()              os.path.samefile()
        atime()                 os.path.getatime()
        ctime()                 os.path.getctime()
        mtime()                 os.path.getmtime()
        size()                  os.path.getsize()
        cwd()                   os.getcwd()
        access()                os.access()
        stat()                  os.stat()
        lstat()                 os.lstat()
        statvfs()               os.statvfs()
        pathconf()              os.pathconf()
        utime()                 os.utime()
        chmod()                 os.chmod()
        chown()                 os.chown()
        rename()                os.rename()
        renames()               os.renames()
        mkdir()                 os.mkdir()
        makedirs()              os.makedirs()
        rmdir()                 os.rmdir()
        removedirs()            os.removedirs()
        remove()                os.remove()
        unlink()                os.unlink()
        link()                  os.link()
        symlink()               os.symlink()
        readlink()              os.readlink()
        chroot()                os.chroot()
        startfile()             os.startfile()
        copyfile()              shutil.copyfile()
        copymode()              shutil.copymode()
        copystat()              shutil.copystat()
        copy()                  shutil.copy()
        copy2()                 shutil.copy2()
        copytree()              shutil.copytree()
        move()                  shutil.move()
        rmtree()                shutil.rmtree()

    The Path class deprecates the whole of os.path, shutil, fnmatch
    and glob.  A big chunk of os is also deprecated.


Closed Issues

    A number contentious issues have been resolved since this PEP
    first appeared on python-dev:

    * The __div__() method was removed.  Overloading the / (division)
      operator may be "too much magic" and make path concatenation
      appear to be division.  The method can always be re-added later
      if the BDFL so desires.  In its place, __new__() got an *args
      argument that accepts both Path and string objects.  The *args
      are concatenated with os.path.join() which is used to construct
      the Path object.  These changes obsoleted the problematic
      joinpath() method which was removed.

    * The methods and the properties getatime()/atime,
      getctime()/ctime, getmtime()/mtime and getsize()/size duplicated
      each other.  These methods and properties have been merged to
      atime(), ctime(), mtime() and size().  The reason they are not
      properties instead, is because there is a possibility that they
      may change unexpectedly.  The following example is not
      guaranteed to always pass the assertion:

        p = Path("foobar")
        s = p.size()
        assert p.size() == s


Open Issues

    Some functionality of Jason Orendorff's path module have been
    omitted:

    * Function for opening a path - better handled by the builtin
      open().

    * Functions for reading and writing whole files - better handled
      by file objects' own read() and write() methods.

    * A chdir() function may be a worthy inclusion.

    * A deprecation schedule needs to be set up.  How much
      functionality should Path implement?  How much of existing
      functionality should it deprecate and when?

    * The name obviously has to be either "path" or "Path," but where
      should it live?  In its own module or in os?

    * Due to Path subclassing either str or unicode, the following
      non-magic, public methods are available on Path objects:

        capitalize(), center(), count(), decode(), encode(),
        endswith(), expandtabs(), find(), index(), isalnum(),
        isalpha(), isdigit(), islower(), isspace(), istitle(),
        isupper(), join(), ljust(), lower(), lstrip(), replace(),
        rfind(), rindex(), rjust(), rsplit(), rstrip(), split(),
        splitlines(), startswith(), strip(), swapcase(), title(),
        translate(), upper(), zfill()

      On python-dev it has been argued whether this inheritance is
      sane or not.  Most persons debating said that most string
      methods doesn't make sense in the context of filesystem paths --
      they are just dead weight.  The other position, also argued on
      python-dev, is that inheriting from string is very convenient
      because it allows code to "just work" with Path objects without
      having to be adapted for them.

      One of the problems is that at the Python level, there is no way
      to make an object "string-like enough," so that it can be passed
      to the builtin function open() (and other builtins expecting a
      string or buffer), unless the object inherits from either str or
      unicode.  Therefore, to not inherit from string requires changes
      in CPython's core.

    The functions and modules that this new module is trying to
    replace (os.path, shutil, fnmatch, glob and parts of os) are
    expected to be available in future Python versions for a long
    time, to preserve backwards compatibility.


Reference Implementation

    Currently, the Path class is implemented as a thin wrapper around
    the standard library modules fnmatch, glob, os, os.path and
    shutil.  The intention of this PEP is to move functionality from
    the aforementioned modules to Path while they are being
    deprecated.

    For more detail and an implementation see:

        http://wiki.python.org/moin/PathModule


Examples

    In this section, "a ==> b" means that b can be used as a
    replacement for a.

    1. Make all python files in the a directory executable
    ------------------------------------------------------

        DIR = '/usr/home/guido/bin'
        for f in os.listdir(DIR):
            if f.endswith('.py'):
                path = os.path.join(DIR, f)
                os.chmod(path, 0755)
        ==>
        for f in Path('/usr/home/guido/bin').files("*.py"):
            f.chmod(0755)

    2. Delete emacs backup files
    ----------------------------

        def delete_backups(arg, dirname, names):
            for name in names:
                if name.endswith('~'):
                    os.remove(os.path.join(dirname, name))
        os.path.walk(os.environ['HOME'], delete_backups, None)
        ==>
        d = Path(os.environ['HOME'])
        for f in d.walkfiles('*~'):
            f.remove()

    3. Finding the relative path to a file
    --------------------------------------

        b = Path('/users/peter/')
        a = Path('/users/peter/synergy/tiki.txt')
        a.relpathto(b)

    4. Splitting a path into directory and filename
    -----------------------------------------------

        os.path.split("/path/to/foo/bar.txt")
        ==>
        Path("/path/to/foo/bar.txt").splitpath()

    5. List all Python scripts in the current directory tree
    --------------------------------------------------------

        list(Path().walkfiles("*.py"))


References and Footnotes

    [1] Method is not guaranteed to be available on all platforms.

    [2] "(idea) subclassable string: path object?", van Rossum, 2001
        http://mail.python.org/pipermail/python-dev/2001-August/016663.html

    [3] "path module v1.0 released", Orendorff, 2003
        http://mail.python.org/pipermail/python-announce-list/2003-January/001984.html

    [4] "Some RFE for review", Birkenfeld, 2005
        http://mail.python.org/pipermail/python-dev/2005-June/054438.html

    [5] "path module", Orendorff, 2003
        http://mail.python.org/pipermail/python-list/2003-July/174289.html

    [6] "PRE-PEP: new Path class", Roth, 2004
        http://mail.python.org/pipermail/python-list/2004-January/201672.html

    [7] http://wiki.python.org/moin/PathClass


Copyright

    This document has been placed in the public domain.


pep-0356 Python 2.5 Release Schedule

PEP: 356
Title: Python 2.5 Release Schedule
Version: $Revision$
Last-Modified: $Date$
Author: Neal Norwitz, Guido van Rossum, Anthony Baxter
Status: Final
Type: Informational
Created: 07-Feb-2006
Python-Version: 2.5
Post-History: 

Abstract

    This document describes the development and release schedule for
    Python 2.5.  The schedule primarily concerns itself with PEP-sized
    items.  Small features may be added up to and including the first
    beta release.  Bugs may be fixed until the final release.

    There will be at least two alpha releases, two beta releases, and
    one release candidate.  The release date is planned for 
    12 September 2006.


Release Manager

    Anthony Baxter has volunteered to be Release Manager.

    Martin von Loewis is building the Windows installers,
    Ronald Oussoren is building the Mac installers,
    Fred Drake the doc packages and
    Sean Reifschneider the RPMs.


Release Schedule

    alpha 1: April 5, 2006 [completed]
    alpha 2: April 27, 2006 [completed]
    beta 1:  June 20, 2006 [completed]
    beta 2:  July 11, 2006 [completed]
    beta 3:  August 3, 2006 [completed]
    rc 1:    August 17, 2006 [completed]
    rc 2:    September 12, 2006 [completed]
    final:   September 19, 2006 [completed]


Completed features for 2.5

    PEP 308: Conditional Expressions
    PEP 309: Partial Function Application
    PEP 314: Metadata for Python Software Packages v1.1
    PEP 328: Absolute/Relative Imports
    PEP 338: Executing Modules as Scripts
    PEP 341: Unified try-except/try-finally to try-except-finally
    PEP 342: Coroutines via Enhanced Generators
    PEP 343: The "with" Statement
	(still need updates in Doc/ref and for the contextlib module)
    PEP 352: Required Superclass for Exceptions
    PEP 353: Using ssize_t as the index type
    PEP 357: Allowing Any Object to be Used for Slicing

    - ASCII became the default coding

    - AST-based compiler
      - Access to C AST from Python through new _ast module

    - any()/all() builtin truth functions

    New standard library modules

      - cProfile -- suitable for profiling long running applications
        with minimal overhead

      - ctypes -- optional component of the windows installer

      - ElementTree and cElementTree -- by Fredrik Lundh

      - hashlib -- adds support for SHA-224, -256, -384, and -512
        (replaces old md5 and sha modules)

      - msilib -- for creating MSI files and bdist_msi in distutils.

      - pysqlite

      - uuid

      - wsgiref

    Other notable features

      - Added support for reading shadow passwords (http://python.org/sf/579435)

      - Added support for the Unicode 4.1 UCD

      - Added PEP 302 zipfile/__loader__ support to the following modules:
        warnings, linecache, inspect, traceback, site, and doctest

      - Added pybench Python benchmark suite -- by Marc-Andre Lemburg

      - Add write support for mailboxes from the code in sandbox/mailbox.
        (Owner: A.M. Kuchling.  It would still be good if another person
        would take a look at the new code.)

      - Support for building "fat" Mac binaries (Intel and PPC)

      - Add new icons for Windows with the new Python logo?

      - New utilities in functools to help write wrapper functions that
        support naive introspection (e.g. having f.__name__ return
        the original function name).

      - Upgrade pyexpat to use expat 2.0.

      - Python core now compiles cleanly with g++

Possible features for 2.5

    Each feature below should implemented prior to beta1 or
    will require BDFL approval for inclusion in 2.5.

    - Modules under consideration for inclusion:

    - Add new icons for MacOS and Unix with the new Python logo?
      (Owner: ???)
      MacOS: http://hcs.harvard.edu/~jrus/python/prettified-py-icons.png

    - Check the various bits of code in Demo/ all still work, update or 
      remove the ones that don't.
      (Owner: Anthony)

    - All modules in Modules/ should be updated to be ssize_t clean.
      (Owner: Neal)


Deferred until 2.6:

    - bdist_deb in distutils package
      http://mail.python.org/pipermail/python-dev/2006-February/060926.html

    - bdist_egg in distutils package

    - pure python pgen module
      (Owner: Guido)

    - Remove the fpectl module?

    - Make everything in Modules/ build cleanly with g++


Open issues

    - Bugs that need resolving before release, ie, they block release:

        None

    - Bugs deferred until 2.5.1 (or later)
        http://python.org/sf/1544279 - Socket module is not thread-safe
        http://python.org/sf/1541420 - tools and demo missing from windows
        http://python.org/sf/1542451 - crash with continue in nested try/finally
        http://python.org/sf/1475523 - gettext.py bug (owner: Martin v. Loewis)
        http://python.org/sf/1467929 - %-formatting and dicts
        http://python.org/sf/1446043 - unicode() does not raise LookupError

    - The PEP 302 changes to (at least) pkgutil, runpy and pydoc must be
      documented.

    - test_zipfile64 takes too long and too much disk space for 
      most of the buildbots.  How should this be handled?
      It is currently disabled.

    - should C modules listed in "Undocumented modules" be removed too?
      "timing" (listed as obsolete), "cl" (listed as possibly not up-to-date),
      and "sv" (listed as obsolete hardware specific).

Copyright

    This document has been placed in the public domain.



pep-0357 Allowing Any Object to be Used for Slicing

PEP: 357
Title: Allowing Any Object to be Used for Slicing
Version: $Revision$
Last-Modified: $Date$
Author: Travis Oliphant <oliphant at ee.byu.edu>
Status: Final
Type: Standards Track
Created: 09-Feb-2006
Python-Version: 2.5
Post-History: 

Abstract

    This PEP proposes adding an nb_index slot in PyNumberMethods and an
    __index__ special method so that arbitrary objects can be used
    whenever integers are explicitly needed in Python, such as in slice
    syntax (from which the slot gets its name).

Rationale

    Currently integers and long integers play a special role in
    slicing in that they are the only objects allowed in slice
    syntax. In other words, if X is an object implementing the
    sequence protocol, then X[obj1:obj2] is only valid if obj1 and
    obj2 are both integers or long integers.  There is no way for obj1
    and obj2 to tell Python that they could be reasonably used as
    indexes into a sequence.  This is an unnecessary limitation.

    In NumPy, for example, there are 8 different integer scalars
    corresponding to unsigned and signed integers of 8, 16, 32, and 64
    bits.  These type-objects could reasonably be used as integers in
    many places where Python expects true integers but cannot inherit from 
    the Python integer type because of incompatible memory layouts.  
    There should be some way to be able to tell Python that an object can 
    behave like an integer.

    It is not possible to use the nb_int (and __int__ special method)
    for this purpose because that method is used to *coerce* objects
    to integers.  It would be inappropriate to allow every object that
    can be coerced to an integer to be used as an integer everywhere
    Python expects a true integer.  For example, if __int__ were used
    to convert an object to an integer in slicing, then float objects
    would be allowed in slicing and x[3.2:5.8] would not raise an error
    as it should.

Proposal

    Add an nb_index slot to PyNumberMethods, and a corresponding
    __index__ special method.  Objects could define a function to
    place in the nb_index slot that returns a Python integer
    (either an int or a long). This integer can 
    then be appropriately converted to a Py_ssize_t value whenever 
    Python needs one such as in PySequence_GetSlice, 
    PySequence_SetSlice, and PySequence_DelSlice.

Specification:

    1) The nb_index slot will have the following signature  

       PyObject *index_func (PyObject *self)

       The returned object must be a Python IntType or 
       Python LongType. NULL should be returned on
       error with an appropriate error set. 

    2) The __index__ special method will have the signature

       def __index__(self):
           return obj
       
       where obj must be either an int or a long.

    3) 3 new abstract C-API functions will be added 

       a) The first checks to see if the object supports the index
          slot and if it is filled in. 

          int PyIndex_Check(obj)

          This will return true if the object defines the nb_index
          slot.  

       b) The second is a simple wrapper around the nb_index call that
          raises PyExc_TypeError if the call is not available or if it
          doesn't return an int or long.  Because the
          PyIndex_Check is performed inside the PyNumber_Index call
          you can call it directly and manage any error rather than
          check for compatibility first.

          PyObject *PyNumber_Index (PyObject *obj)

       c) The third call helps deal with the common situation of
          actually needing a Py_ssize_t value from the object to use for
          indexing or other needs.

          Py_ssize_t PyNumber_AsSsize_t(PyObject *obj, PyObject *exc)

          The function calls the nb_index slot of obj if it is
          available and then converts the returned Python integer into
          a Py_ssize_t value.  If this goes well, then the value is
          returned.  The second argument allows control over what
          happens if the integer returned from nb_index cannot fit
          into a Py_ssize_t value.

          If exc is NULL, then the returnd value will be clipped to
          PY_SSIZE_T_MAX or PY_SSIZE_T_MIN depending on whether the
          nb_index slot of obj returned a positive or negative
          integer.  If exc is non-NULL, then it is the error object
          that will be set to replace the PyExc_OverflowError that was
          raised when the Python integer or long was converted to Py_ssize_t.

    4) A new operator.index(obj) function will be added that calls
       equivalent of obj.__index__() and raises an error if obj does not implement
       the special method.
       

Implementation Plan

    1) Add the nb_index slot in object.h and modify typeobject.c to 
       create the __index__ method

    2) Change the ISINT macro in ceval.c to ISINDEX and alter it to 
       accomodate objects with the index slot defined.

    3) Change the _PyEval_SliceIndex function to accommodate objects
       with the index slot defined.

    4) Change all builtin objects (e.g. lists) that use the as_mapping 
       slots for subscript access and use a special-check for integers to 
       check for the slot as well.

    5) Add the nb_index slot to integers and long_integers 
       (which just return themselves)

    6) Add PyNumber_Index C-API to return an integer from any 
       Python Object that has the nb_index slot.  

    7) Add the operator.index(x) function.

    8) Alter arrayobject.c and mmapmodule.c to use the new C-API for their
       sub-scripting and other needs. 

    9) Add unit-tests


Discussion Questions

    Speed: 

    Implementation should not slow down Python because integers and long
    integers used as indexes will complete in the same number of
    instructions.  The only change will be that what used to generate
    an error will now be acceptable.

    Why not use nb_int which is already there?

    The nb_int method is used for coercion and so means something
    fundamentally different than what is requested here.  This PEP
    proposes a method for something that *can* already be thought of as
    an integer communicate that information to Python when it needs an
    integer.  The biggest example of why using nb_int would be a bad
    thing is that float objects already define the nb_int method, but
    float objects *should not* be used as indexes in a sequence.

    Why the name __index__?

    Some questions were raised regarding the name __index__ when other
    interpretations of the slot are possible.  For example, the slot
    can be used any time Python requires an integer internally (such
    as in "mystring" * 3).  The name was suggested by Guido because
    slicing syntax is the biggest reason for having such a slot and 
    in the end no better name emerged. See the discussion thread:
    http://mail.python.org/pipermail/python-dev/2006-February/thread.html#60594 
    for examples of names that were suggested such as "__discrete__" and 
    "__ordinal__".

    Why return PyObject * from nb_index?  

    Intially Py_ssize_t was selected as the return type for the
    nb_index slot.  However, this led to an inability to track and
    distinguish overflow and underflow errors without ugly and brittle
    hacks. As the nb_index slot is used in at least 3 different ways 
    in the Python core (to get an integer, to get a slice end-point, 
    and to get a sequence index), there is quite a bit of flexibility
    needed to handle all these cases.  The importance of having the
    necessary flexibility to handle all the use cases is critical. 
    For example, the initial implementation that returned Py_ssize_t for
    nb_index led to the discovery that on a 32-bit machine with >=2GB of RAM 
    s = 'x' * (2**100) works but len(s) was clipped at 2147483647.
    Several fixes were suggested but eventually it was decided that
    nb_index needed to return a Python Object similar to the nb_int
    and nb_long slots in order to handle overflow correctly.

    Why can't __index__ return any object with the nb_index method?

    This would allow infinite recursion in many different ways that are not
    easy to check for.  This restriction is similar to the requirement that 
    __nonzero__ return an int or a bool. 

Reference Implementation

    Submitted as patch 1436368 to SourceForge.  

Copyright

    This document is placed in the public domain.

pep-0358 The "bytes" Object

PEP: 358
Title: The "bytes" Object
Version: $Revision$
Last-Modified: $Date$
Author: Neil Schemenauer <nas at arctrix.com>, Guido van Rossum <guido at python.org>
Status: Final
Type: Standards Track
Content-Type: text/plain
Created: 15-Feb-2006
Python-Version: 2.6, 3.0
Post-History: 

Update

    This PEP has partially been superseded by PEP 3137.


Abstract

    This PEP outlines the introduction of a raw bytes sequence type.
    Adding the bytes type is one step in the transition to Unicode
    based str objects which will be introduced in Python 3.0.

    The PEP describes how the bytes type should work in Python 2.6, as
    well as how it should work in Python 3.0.  (Occasionally there are
    differences because in Python 2.6, we have two string types, str
    and unicode, while in Python 3.0 we will only have one string
    type, whose name will be str but whose semantics will be like the
    2.6 unicode type.)


Motivation

    Python's current string objects are overloaded.  They serve to hold
    both sequences of characters and sequences of bytes.  This
    overloading of purpose leads to confusion and bugs.  In future
    versions of Python, string objects will be used for holding
    character data.  The bytes object will fulfil the role of a byte
    container.  Eventually the unicode type will be renamed to str
    and the old str type will be removed.


Specification

    A bytes object stores a mutable sequence of integers that are in
    the range 0 to 255.  Unlike string objects, indexing a bytes
    object returns an integer.  Assigning or comparing an object that
    is not an integer to an element causes a TypeError exception.
    Assigning an element to a value outside the range 0 to 255 causes
    a ValueError exception.  The .__len__() method of bytes returns
    the number of integers stored in the sequence (i.e. the number of
    bytes).

    The constructor of the bytes object has the following signature:

        bytes([initializer[, encoding]])

    If no arguments are provided then a bytes object containing zero
    elements is created and returned.  The initializer argument can be
    a string (in 2.6, either str or unicode), an iterable of integers,
    or a single integer.  The pseudo-code for the constructor
    (optimized for clear semantics, not for speed) is:

        def bytes(initializer=0, encoding=None):
            if isinstance(initializer, int): # In 2.6, int -> (int, long)
                initializer = [0]*initializer
            elif isinstance(initializer, basestring):
                if isinstance(initializer, unicode): # In 3.0, "if True"
                    if encoding is None:
                        # In 3.0, raise TypeError("explicit encoding required")
                        encoding = sys.getdefaultencoding()
                    initializer = initializer.encode(encoding)
                initializer = [ord(c) for c in initializer]
            else:
                if encoding is not None:
                    raise TypeError("no encoding allowed for this initializer")
                tmp = []
                for c in initializer:
                    if not isinstance(c, int):
                        raise TypeError("initializer must be iterable of ints")
                    if not 0 <= c < 256:
                        raise ValueError("initializer element out of range")
                    tmp.append(c)
                initializer = tmp
            new = <new bytes object of length len(initializer)>
            for i, c in enumerate(initializer):
                new[i] = c
            return new

    The .__repr__() method returns a string that can be evaluated to
    generate a new bytes object containing a bytes literal:

        >>> bytes([10, 20, 30])
        b'\n\x14\x1e'

    The object has a .decode() method equivalent to the .decode()
    method of the str object.  The object has a classmethod .fromhex()
    that takes a string of characters from the set [0-9a-fA-F ] and
    returns a bytes object (similar to binascii.unhexlify).  For
    example:

        >>> bytes.fromhex('5c5350ff')
	b'\\SP\xff'
        >>> bytes.fromhex('5c 53 50 ff')
	b'\\SP\xff'

    The object has a .hex() method that does the reverse conversion
    (similar to binascii.hexlify):

        >> bytes([92, 83, 80, 255]).hex()
        '5c5350ff'

    The bytes object has some methods similar to list methods, and
    others similar to str methods.  Here is a complete list of
    methods, with their approximate signatures:

        .__add__(bytes) -> bytes
        .__contains__(int | bytes) -> bool
        .__delitem__(int | slice) -> None
        .__delslice__(int, int) -> None
        .__eq__(bytes) -> bool
        .__ge__(bytes) -> bool
        .__getitem__(int | slice) -> int | bytes
        .__getslice__(int, int) -> bytes
        .__gt__(bytes) -> bool
        .__iadd__(bytes) -> bytes
        .__imul__(int) -> bytes
        .__iter__() -> iterator
        .__le__(bytes) -> bool
        .__len__() -> int
        .__lt__(bytes) -> bool
        .__mul__(int) -> bytes
        .__ne__(bytes) -> bool
        .__reduce__(...) -> ...
        .__reduce_ex__(...) -> ...
        .__repr__() -> str
        .__reversed__() -> bytes
        .__rmul__(int) -> bytes
        .__setitem__(int | slice, int | iterable[int]) -> None
        .__setslice__(int, int, iterable[int]) -> Bote
        .append(int) -> None
        .count(int) -> int
        .decode(str) -> str | unicode # in 3.0, only str
        .endswith(bytes) -> bool
        .extend(iterable[int]) -> None
        .find(bytes) -> int
        .index(bytes | int) -> int
        .insert(int, int) -> None
        .join(iterable[bytes]) -> bytes
        .partition(bytes) -> (bytes, bytes, bytes)
        .pop([int]) -> int
        .remove(int) -> None
        .replace(bytes, bytes) -> bytes
        .rindex(bytes | int) -> int
        .rpartition(bytes) -> (bytes, bytes, bytes)
        .split(bytes) -> list[bytes]
        .startswith(bytes) -> bool
        .reverse() -> None
        .rfind(bytes) -> int
        .rindex(bytes | int) -> int
        .rsplit(bytes) -> list[bytes]
        .translate(bytes, [bytes]) -> bytes

    Note the conspicuous absence of .isupper(), .upper(), and friends.
    (But see "Open Issues" below.)  There is no .__hash__() because
    the object is mutable.  There is no use case for a .sort() method.

    The bytes type also supports the buffer interface, supporting
    reading and writing binary (but not character) data.


Out of Scope Issues

    * Python 3k will have a much different I/O subsystem.  Deciding
      how that I/O subsystem will work and interact with the bytes
      object is out of the scope of this PEP.  The expectation however
      is that binary I/O will read and write bytes, while text I/O
      will read strings.  Since the bytes type supports the buffer
      interface, the existing binary I/O operations in Python 2.6 will
      support bytes objects.

    * It has been suggested that a special method named .__bytes__()
      be added to the language to allow objects to be converted into
      byte arrays.  This decision is out of scope.

    * A bytes literal of the form b"..." is also proposed.  This is
      the subject of PEP 3112.


Open Issues

    * The .decode() method is redundant since a bytes object b can
      also be decoded by calling unicode(b, <encoding>) (in 2.6) or
      str(b, <encoding>) (in 3.0).  Do we need encode/decode methods
      at all?  In a sense the spelling using a constructor is cleaner.

    * Need to specify the methods still more carefully.

    * Pickling and marshalling support need to be specified.

    * Should all those list methods really be implemented?

    * A case could be made for supporting .ljust(), .rjust(),
      .center() with a mandatory second argument.

    * A case could be made for supporting .split() with a mandatory
      argument.

    * A case could even be made for supporting .islower(), .isupper(),
      .isspace(), .isalpha(), .isalnum(), .isdigit() and the
      corresponding conversions (.lower() etc.), using the ASCII
      definitions for letters, digits and whitespace.  If this is
      accepted, the cases for .ljust(), .rjust(), .center() and
      .split() become much stronger, and they should have default
      arguments as well, using an ASCII space or all ASCII whitespace
      (for .split()).


Frequently Asked Questions

    Q: Why have the optional encoding argument when the encode method of
       Unicode objects does the same thing?

    A: In the current version of Python, the encode method returns a str
       object and we cannot change that without breaking code.  The
       construct bytes(s.encode(...)) is expensive because it has to
       copy the byte sequence multiple times.  Also, Python generally
       provides two ways of converting an object of type A into an
       object of type B: ask an A instance to convert itself to a B, or
       ask the type B to create a new instance from an A. Depending on
       what A and B are, both APIs make sense; sometimes reasons of
       decoupling require that A can't know about B, in which case you
       have to use the latter approach; sometimes B can't know about A,
       in which case you have to use the former.


    Q: Why does bytes ignore the encoding argument if the initializer is
       a str?  (This only applies to 2.6.)

    A: There is no sane meaning that the encoding can have in that case.
       str objects *are* byte arrays and they know nothing about the
       encoding of character data they contain.  We need to assume that
       the programmer has provided a str object that already uses the
       desired encoding. If you need something other than a pure copy of
       the bytes then you need to first decode the string.  For example:

           bytes(s.decode(encoding1), encoding2)


    Q: Why not have the encoding argument default to Latin-1 (or some
       other encoding that covers the entire byte range) rather than
       ASCII?

    A: The system default encoding for Python is ASCII.  It seems least
       confusing to use that default.  Also, in Py3k, using Latin-1 as
       the default might not be what users expect.  For example, they
       might prefer a Unicode encoding.  Any default will not always
       work as expected.  At least ASCII will complain loudly if you try
       to encode non-ASCII data.


Copyright

    This document has been placed in the public domain.



pep-0359 The "make" Statement

PEP:359
Title:The "make" Statement
Version:$Revision$
Last-Modified:$Date$
Author:Steven Bethard <steven.bethard at gmail.com>
Status:Withdrawn
Type:Standards Track
Content-Type:text/x-rst
Created:05-Apr-2006
Python-Version:2.6
Post-History:05-Apr-2006, 06-Apr-2006, 13-Apr-2006

Abstract

This PEP proposes a generalization of the class-declaration syntax, the make statement. The proposed syntax and semantics parallel the syntax for class definition, and so:

make <callable> <name> <tuple>:
    <block>

is translated into the assignment:

<name> = <callable>("<name>", <tuple>, <namespace>)

where <namespace> is the dict created by executing <block>. This is mostly syntactic sugar for:

class <name> <tuple>:
    __metaclass__ = <callable>
    <block>

and is intended to help more clearly express the intent of the statement when something other than a class is being created. Of course, other syntax for such a statement is possible, but it is hoped that by keeping a strong parallel to the class statement, an understanding of how classes and metaclasses work will translate into an understanding of how the make-statement works as well.

The PEP is based on a suggestion [1] from Michele Simionato on the python-dev list.

Withdrawal Notice

This PEP was withdrawn at Guido's request [2]. Guido didn't like it, and in particular didn't like how the property use-case puts the instance methods of a property at a different level than other instance methods and requires fixed names for the property functions.

Motivation

Class statements provide two nice facilities to Python:

  1. They execute a block of statements and provide the resulting bindings as a dict to the metaclass.
  2. They encourage DRY (don't repeat yourself) by allowing the class being created to know the name it is being assigned.

Thus in a simple class statement like:

class C(object):
    x = 1
    def foo(self):
        return 'bar'

the metaclass (type) gets called with something like:

C = type('C', (object,), {'x':1, 'foo':<function foo at ...>})

The class statement is just syntactic sugar for the above assignment statement, but clearly a very useful sort of syntactic sugar. It avoids not only the repetition of C, but also simplifies the creation of the dict by allowing it to be expressed as a series of statements.

Historically, type instances (a.k.a. class objects) have been the only objects blessed with this sort of syntactic support. The make statement aims to extend this support to other sorts of objects where such syntax would also be useful.

Example: simple namespaces

Let's say I have some attributes in a module that I access like:

mod.thematic_roletype
mod.opinion_roletype

mod.text_format
mod.html_format

and since "Namespaces are one honking great idea", I'd like to be able to access these attributes instead as:

mod.roletypes.thematic
mod.roletypes.opinion

mod.format.text
mod.format.html

I currently have two main options:

  1. Turn the module into a package, turn roletypes and format into submodules, and move the attributes to the submodules.
  2. Create roletypes and format classes, and move the attributes to the classes.

The former is a fair chunk of refactoring work, and produces two tiny modules without much content. The latter keeps the attributes local to the module, but creates classes when there is no intention of ever creating instances of those classes.

In situations like this, it would be nice to simply be able to declare a "namespace" to hold the few attributes. With the new make statement, I could introduce my new namespaces with something like:

make namespace roletypes:
    thematic = ...
    opinion = ...

make namespace format:
    text = ...
    html = ...

and keep my attributes local to the module without making classes that are never intended to be instantiated. One definition of namespace that would make this work is:

class namespace(object):
    def __init__(self, name, args, kwargs):
        self.__dict__.update(kwargs)

Given this definition, at the end of the make-statements above, roletypes and format would be namespace instances.

Example: GUI objects

In GUI toolkits, objects like frames and panels are often associated with attributes and functions. With the make-statement, code that looks something like:

root = Tkinter.Tk()
frame = Tkinter.Frame(root)
frame.pack()
def say_hi():
    print "hi there, everyone!"
hi_there = Tkinter.Button(frame, text="Hello", command=say_hi)
hi_there.pack(side=Tkinter.LEFT)
root.mainloop()

could be rewritten to group the Button's function with its declaration:

root = Tkinter.Tk()
frame = Tkinter.Frame(root)
frame.pack()
make Tkinter.Button hi_there(frame):
    text = "Hello"
    def command():
        print "hi there, everyone!"
hi_there.pack(side=Tkinter.LEFT)
root.mainloop()

Example: custom descriptors

Since descriptors are used to customize access to an attribute, it's often useful to know the name of that attribute. Current Python doesn't give an easy way to find this name and so a lot of custom descriptors, like Ian Bicking's setonce descriptor [3], have to hack around this somehow. With the make-statement, you could create a setonce attribute like:

class A(object):
    ...
    make setonce x:
        "A's x attribute"
    ...

where the setonce descriptor would be defined like:

class setonce(object):

    def __init__(self, name, args, kwargs):
        self._name = '_setonce_attr_%s' % name
        self.__doc__ = kwargs.pop('__doc__', None)

    def __get__(self, obj, type=None):
        if obj is None:
            return self
        return getattr(obj, self._name)

    def __set__(self, obj, value):
        try:
            getattr(obj, self._name)
        except AttributeError:
            setattr(obj, self._name, value)
        else:
            raise AttributeError("Attribute already set")

    def set(self, obj, value):
        setattr(obj, self._name, value)

    def __delete__(self, obj):
        delattr(obj, self._name)

Note that unlike the original implementation, the private attribute name is stable since it uses the name of the descriptor, and therefore instances of class A are pickleable.

Example: property namespaces

Python's property type takes three function arguments and a docstring argument which, though relevant only to the property, must be declared before it and then passed as arguments to the property call, e.g.:

class C(object):
    ...
    def get_x(self):
        ...
    def set_x(self):
        ...
    x = property(get_x, set_x, "the x of the frobulation")

This issue has been brought up before, and Guido [4] and others [5] have briefly mused over alternate property syntaxes to make declaring properties easier. With the make-statement, the following syntax could be supported:

class C(object):
    ...
    make block_property x:
        '''The x of the frobulation'''
        def fget(self):
            ...
        def fset(self):
            ...

with the following definition of block_property:

def block_property(name, args, block_dict):
    fget = block_dict.pop('fget', None)
    fset = block_dict.pop('fset', None)
    fdel = block_dict.pop('fdel', None)
    doc = block_dict.pop('__doc__', None)
    assert not block_dict
    return property(fget, fset, fdel, doc)

Example: interfaces

Guido [6] and others have occasionally suggested introducing interfaces into python. Most suggestions have offered syntax along the lines of:

interface IFoo:
    """Foo blah blah"""

    def fumble(name, count):
        """docstring"""

but since there is currently no way in Python to declare an interface in this manner, most implementations of Python interfaces use class objects instead, e.g. Zope's:

class IFoo(Interface):
    """Foo blah blah"""

    def fumble(name, count):
        """docstring"""

With the new make-statement, these interfaces could instead be declared as:

make Interface IFoo:
    """Foo blah blah"""

    def fumble(name, count):
        """docstring"""

which makes the intent (that this is an interface, not a class) much clearer.

Specification

Python will translate a make-statement:

make <callable> <name> <tuple>:
    <block>

into the assignment:

<name> = <callable>("<name>", <tuple>, <namespace>)

where <namespace> is the dict created by executing <block>. The <tuple> expression is optional; if not present, an empty tuple will be assumed.

A patch is available implementing these semantics [7].

The make-statement introduces a new keyword, make. Thus in Python 2.6, the make-statement will have to be enabled using from __future__ import make_statement.

Open Issues

Keyword

Does the make keyword break too much code? Originally, the make statement used the keyword create (a suggestion due to Nick Coghlan). However, investigations into the standard library [8] and Zope+Plone code [9] revealed that create would break a lot more code, so make was adopted as the keyword instead. However, there are still a few instances where make would break code. Is there a better keyword for the statement?

Some possible keywords and their counts in the standard library (plus some installed packages):

  • make - 2 (both in tests)
  • create - 19 (including existing function in imaplib)
  • build - 83 (including existing class in distutils.command.build)
  • construct - 0
  • produce - 0

The make-statement as an alternate constructor

Currently, there are not many functions which have the signature (name, args, kwargs). That means that something like:

make dict params:
    x = 1
    y = 2

is currently impossible because the dict constructor has a different signature. Does this sort of thing need to be supported? One suggestion, by Carl Banks, would be to add a __make__ magic method that if found would be called instead of __call__. For types, the __make__ method would be identical to __call__ and thus unnecessary, but dicts could support the make-statement by defining a __make__ method on the dict type that looks something like:

def __make__(cls, name, args, kwargs):
    return cls(**kwargs)

Of course, rather than adding another magic method, the dict type could just grow a classmethod something like dict.fromblock that could be used like:

make dict.fromblock params:
    x = 1
    y = 2

So the question is, will many types want to use the make-statement as an alternate constructor? And if so, does that alternate constructor need to have the same name as the original constructor?

Customizing the dict in which the block is executed

Should users of the make-statement be able to determine in which dict object the code is executed? This would allow the make-statement to be used in situations where a normal dict object would not suffice, e.g. if order and repeated names must be allowed. Allowing this sort of customization could allow XML to be written without repeating element names, and with nesting of make-statements corresponding to nesting of XML elements:

make Element html:
    make Element body:
        text('before first h1')
        make Element h1:
            attrib(style='first')
            text('first h1')
            tail('after first h1')
        make Element h1:
            attrib(style='second')
            text('second h1')
            tail('after second h1')

If the make-statement tried to get the dict in which to execute its block by calling the callable's __make_dict__ method, the following code would allow the make-statement to be used as above:

class Element(object):

    class __make_dict__(dict):

        def __init__(self, *args, **kwargs):
            self._super = super(Element.__make_dict__, self)
            self._super.__init__(*args, **kwargs)
            self.elements = []
            self.text = None
            self.tail = None
            self.attrib = {}

        def __getitem__(self, name):
            try:
                return self._super.__getitem__(name)
            except KeyError:
                if name in ['attrib', 'text', 'tail']:
                    return getattr(self, 'set_%s' % name)
                else:
                    return globals()[name]

        def __setitem__(self, name, value):
            self._super.__setitem__(name, value)
            self.elements.append(value)

        def set_attrib(self, **kwargs):
            self.attrib = kwargs

        def set_text(self, text):
            self.text = text

        def set_tail(self, text):
            self.tail = text

    def __new__(cls, name, args, edict):
        get_element = etree.ElementTree.Element
        result = get_element(name, attrib=edict.attrib)
        result.text = edict.text
        result.tail = edict.tail
        for element in edict.elements:
            result.append(element)
        return result

Note, however, that the code to support this is somewhat fragile -- it has to magically populate the namespace with attrib, text and tail, and it assumes that every name binding inside the make statement body is creating an Element. As it stands, this code would break with the introduction of a simple for-loop to any one of the make-statement bodies, because the for-loop would bind a name to a non-Element object. This could be worked around by adding some sort of isinstance check or attribute examination, but this still results in a somewhat fragile solution.

It has also been pointed out that the with-statement can provide equivalent nesting with a much more explicit syntax:

with Element('html') as html:
    with Element('body') as body:
        body.text = 'before first h1'
        with Element('h1', style='first') as h1:
            h1.text = 'first h1'
            h1.tail = 'after first h1'
        with Element('h1', style='second') as h1:
            h1.text = 'second h1'
            h1.tail = 'after second h1'

And if the repetition of the element names here is too much of a DRY violoation, it is also possible to eliminate all as-clauses except for the first by adding a few methods to Element. [10]

So are there real use-cases for executing the block in a dict of a different type? And if so, should the make-statement be extended to support them?

Optional Extensions

Remove the make keyword

It might be possible to remove the make keyword so that such statements would begin with the callable being called, e.g.:

namespace ns:
    badger = 42
    def spam():
        ...

interface C(...):
    ...

However, almost all other Python statements begin with a keyword, and removing the keyword would make it harder to look up this construct in the documentation. Additionally, this would add some complexity in the grammar and so far I (Steven Bethard) have not been able to implement the feature without the keyword.

Removing __metaclass__ in Python 3000

As a side-effect of its generality, the make-statement mostly eliminates the need for the __metaclass__ attribute in class objects. Thus in Python 3000, instead of:

class <name> <bases-tuple>:
    __metaclass__ = <metaclass>
    <block>

metaclasses could be supported by using the metaclass as the callable in a make-statement:

make <metaclass> <name> <bases-tuple>:
    <block>

Removing the __metaclass__ hook would simplify the BUILD_CLASS opcode a bit.

Removing class statements in Python 3000

In the most extreme application of make-statements, the class statement itself could be deprecated in favor of make type statements.

pep-0360 Externally Maintained Packages

PEP:360
Title:Externally Maintained Packages
Version:$Revision$
Last-Modified:$Date$
Author:Brett Cannon <brett at python.org>
Status:Final
Type:Process
Content-Type:text/x-rst
Created:30-May-2006
Post-History:

Warning

No new modules are to be added to this PEP. It has been deemed dangerous to codify external maintenance of any code checked into Python's code repository. Code contributors should expect Python's development methodology to be used for any and all code checked into Python's code repository.

Abstract

There are many great pieces of Python software developed outside of the Python standard library (a.k.a., the "stdlib"). Sometimes it makes sense to incorporate these externally maintained packages into the stdlib in order to fill a gap in the tools provided by Python.

But by having the packages maintained externally it means Python's developers do not have direct control over the packages' evolution and maintenance. Some package developers prefer to have bug reports and patches go through them first instead of being directly applied to Python's repository.

This PEP is meant to record details of packages in the stdlib that are maintained outside of Python's repository. Specifically, it is meant to keep track of any specific maintenance needs for each package. It should be mentioned that changes needed in order to fix bugs and keep the code running on all of Python's supported platforms will be done directly in Python's repository without worrying about going through the contact developer. This is so that Python itself is not held up by a single bug and allows the whole process to scale as needed.

It also is meant to allow people to know which version of a package is released with which version of Python.

Externally Maintained Packages

The section title is the name of the package as it is known outside of the Python standard library. The "standard library name" is what the package is named within Python. The "contact person" is the Python developer in charge of maintaining the package. The "synchronisation history" lists what external version of the package was included in each version of Python (if different from the previous Python release).

ElementTree

Web site:http://effbot.org/zone/element-index.htm
Standard library name:
 xml.etree
Contact person:Fredrik Lundh

Fredrik has ceded ElementTree maintenance to the core Python development team [1].

Expat XML parser

Web site:http://www.libexpat.org/
Standard library name:
 N/A (this refers to the parser itself, and not the Python bindings)
Contact person:None

Optik

Web site:http://optik.sourceforge.net/
Standard library name:
 optparse
Contact person:Greg Ward

External development seems to have ceased. For new applications, optparse itself has been largely superseded by argparse.

wsgiref

Web site:None
Standard library name:
 wsgiref
Contact Person:Phillip J. Eby

This module is maintained in the standard library, but significant bug reports and patches should pass through the Web-SIG mailing list [2] for discussion.

pep-0361 Python 2.6 and 3.0 Release Schedule

PEP: 361
Title: Python 2.6 and 3.0 Release Schedule
Version: $Revision$
Last-Modified: $Date$
Author: Neal Norwitz, Barry Warsaw
Status: Final
Type: Informational
Created: 29-June-2006
Python-Version: 2.6 and 3.0
Post-History: 17-Mar-2008

Abstract

    This document describes the development and release schedule for
    Python 2.6 and 3.0.  The schedule primarily concerns itself with
    PEP-sized items.  Small features may be added up to and including
    the first beta release.  Bugs may be fixed until the final
    release.

    There will be at least two alpha releases, two beta releases, and
    one release candidate.  The releases are planned for October 2008.

    Python 2.6 is not only the next advancement in the Python 2
    series, it is also a transitional release, helping developers
    begin to prepare their code for Python 3.0.  As such, many
    features are being backported from Python 3.0 to 2.6.  Thus, it
    makes sense to release both versions in at the same time.  The
    precedence for this was set with the Python 1.6 and 2.0 release.

    Until rc, we will be releasing Python 2.6 and 3.0 in lockstep, on
    a monthly release cycle.  The releases will happen on the first
    Wednesday of every month through the beta testing cycle.  Because
    Python 2.6 is ready sooner, and because we have outside deadlines
    we'd like to meet, we've decided to split the rc releases.  Thus
    Python 2.6 final is currently planned to come out two weeks before
    Python 3.0 final.


Release Manager and Crew

    2.6/3.0 Release Manager: Barry Warsaw
    Windows installers: Martin v. Loewis
    Mac installers: Ronald Oussoren
    Documentation: Georg Brandl
    RPMs: Sean Reifschneider


Release Lifespan

    Python 3.0 is no longer being maintained for any purpose.

    Python 2.6.9 is the final security-only source-only maintenance
    release of the Python 2.6 series.  With its release on October 29,
    2013, all official support for Python 2.6 has ended.  Python 2.6
    is no longer being maintained for any purpose.
    

Release Schedule

    Feb 29 2008: Python 2.6a1 and 3.0a3 are released
    Apr 02 2008: Python 2.6a2 and 3.0a4 are released
    May 08 2008: Python 2.6a3 and 3.0a5 are released
    Jun 18 2008: Python 2.6b1 and 3.0b1 are released
    Jul 17 2008: Python 2.6b2 and 3.0b2 are released
    Aug 20 2008: Python 2.6b3 and 3.0b3 are released
    Sep 12 2008: Python 2.6rc1 is released
    Sep 17 2008: Python 2.6rc2 and 3.0rc1 released
    Oct 01 2008: Python 2.6 final released
    Nov 06 2008: Python 3.0rc2 released
    Nov 21 2008: Python 3.0rc3 released
    Dec 03 2008: Python 3.0 final released
    Dec 04 2008: Python 2.6.1 final released
    Apr 14 2009: Python 2.6.2 final released
    Oct 02 2009: Python 2.6.3 final released
    Oct 25 2009: Python 2.6.4 final released
    Mar 19 2010: Python 2.6.5 final released
    Aug 24 2010: Python 2.6.6 final released
    Jun 03 2011: Python 2.6.7 final released (security-only)
    Apr 10 2012: Python 2.6.8 final released (security-only)
    Oct 29 2013: Python 2.6.9 final released (security-only)


Completed features for 3.0

    See PEP 3000 [#pep3000] and PEP 3100 [#pep3100] for details on the
    Python 3.0 project.


Completed features for 2.6

    PEPs:

        - 352: Raising a string exception now triggers a TypeError.
             Attempting to catch a string exception raises DeprecationWarning.
             BaseException.message has been deprecated. [#pep352]
        - 358: The "bytes" Object [#pep358]
        - 366: Main module explicit relative imports [#pep366]
        - 370: Per user site-packages directory [#pep370]
        - 3112: Bytes literals in Python 3000 [#pep3112]
        - 3127: Integer Literal Support and Syntax [#pep3127]
        - 371: Addition of the multiprocessing package [#pep371]

    New modules in the standard library:

        - json
        - new enhanced turtle module
	- ast

    Deprecated modules and functions in the standard library:

        - buildtools
        - cfmfile
        - commands.getstatus()
        - macostools.touched()
        - md5
        - MimeWriter
        - mimify
        - popen2, os.popen[234]()
        - posixfile
        - sets
        - sha

    Modules removed from the standard library:

        - gopherlib
        - rgbimg
        - macfs

    Warnings for features removed in Py3k:

        - builtins: apply, callable, coerce, dict.has_key, execfile,
          reduce, reload
        - backticks and <>
        - float args to xrange
        - coerce and all its friends
        - comparing by default comparison
        - {}.has_key()
        - file.xreadlines
        - softspace removal for print() function
        - removal of modules because of PEP 4/3100/3108

    Other major features:

        - with/as will be keywords
        - a __dir__() special method to control dir() was added [1]
        - AtheOS support stopped.
        - warnings module implemented in C
        - compile() takes an AST and can convert to byte code


Possible features for 2.6

    New features *should* be implemented prior to alpha2, particularly
    any C modifications or behavioral changes.  New features *must* be
    implemented prior to beta1 or will require Release Manager approval.

    The following PEPs are being worked on for inclusion in 2.6: None.

    Each non-trivial feature listed here that is not a PEP must be
    discussed on python-dev.  Other enhancements include:

        - distutils replacement (requires a PEP)

    New modules in the standard library:

        - winerror
          http://python.org/sf/1505257
          (Patch rejected, module should be written in C)

        - setuptools
          BDFL pronouncement for inclusion in 2.5:
          http://mail.python.org/pipermail/python-dev/2006-April/063964.html

          PJE's withdrawal from 2.5 for inclusion in 2.6:
          http://mail.python.org/pipermail/python-dev/2006-April/064145.html

    Modules to gain a DeprecationWarning (as specified for Python 2.6
    or through negligence):

        - rfc822
        - mimetools
        - multifile
        - compiler package (or a Py3K warning instead?)

    - Convert Parser/*.c to use the C warnings module rather than printf

    - Add warnings for Py3k features removed:
      * __getslice__/__setslice__/__delslice__
      * float args to PyArgs_ParseTuple
      * __cmp__?
      * other comparison changes?
      * int division?
      * All PendingDeprecationWarnings (e.g. exceptions)
      * using zip() result as a list
      * the exec statement (use function syntax)
      * function attributes that start with func_* (should use __*__)
      * the L suffix for long literals
      * renaming of __nonzero__ to __bool__
      * multiple inheritance with classic classes? (MRO might change)
      * properties and classic classes? (instance attrs shadow property)

    - use __bool__ method if available and there's no __nonzero__

    - Check the various bits of code in Demo/ and Tools/ all still work,
      update or remove the ones that don't.

    - All modules in Modules/ should be updated to be ssize_t clean.

    - All of Python (including Modules/) should compile cleanly with g++

    - Start removing deprecated features and generally moving towards Py3k

    - Replace all old style tests (operate on import) with unittest or docttest

    - Add tests for all untested modules

    - Document undocumented modules/features

    - bdist_deb in distutils package
      http://mail.python.org/pipermail/python-dev/2006-February/060926.html

    - bdist_egg in distutils package

    - pure python pgen module
      (Owner: Guido)
      Deferral to 2.6:
      http://mail.python.org/pipermail/python-dev/2006-April/064528.html

    - Remove the fpectl module?


Deferred until 2.7

    None


Open issues

    How should import warnings be handled?
    http://mail.python.org/pipermail/python-dev/2006-June/066345.html
    http://python.org/sf/1515609
    http://python.org/sf/1515361

References

.. [1] Adding a __dir__() magic method

   http://mail.python.org/pipermail/python-dev/2006-July/067139.html

.. [#pep358] PEP 358 (The "bytes" Object)

   http://www.python.org/dev/peps/pep-0358

.. [#pep366] PEP 366 (Main module explicit relative imports)

   http://www.python.org/dev/peps/pep-0366

.. [#pep367] PEP 367 (New Super)

   http://www.python.org/dev/peps/pep-0367

.. [#pep371] PEP 371 (Addition of the multiprocessing package)

   http://www.python.org/dev/peps/pep-0371

.. [#pep3000] PEP 3000 (Python 3000)

   http://www.python.org/dev/peps/pep-3000

.. [#pep3100] PEP 3100 (Miscellaneous Python 3.0 Plans)

   http://www.python.org/dev/peps/pep-3100

.. [#pep3112] PEP 3112 (Bytes literals in Python 3000)

   http://www.python.org/dev/peps/pep-3112

.. [#pep3127] PEP 3127 (Integer Literal Support and Syntax)

   http://www.python.org/dev/peps/pep-3127

.. _Google calendar:

   http://www.google.com/calendar/ical/b6v58qvojllt0i6ql654r1vh00%40group.calendar.google.com/public/basic.ics


Copyright

    This document has been placed in the public domain.



pep-0362 Function Signature Object

PEP:362
Title:Function Signature Object
Version:$Revision$
Last-Modified:$Date$
Author:Brett Cannon <brett at python.org>, Jiwon Seo <seojiwon at gmail.com>, Yury Selivanov <yselivanov at sprymix.com>, Larry Hastings <larry at hastings.org>
Status:Final
Type:Standards Track
Content-Type:text/x-rst
Created:21-Aug-2006
Python-Version:3.3
Post-History:04-Jun-2012
Resolution:http://mail.python.org/pipermail/python-dev/2012-June/120682.html

Abstract

Python has always supported powerful introspection capabilities, including introspecting functions and methods (for the rest of this PEP, "function" refers to both functions and methods). By examining a function object you can fully reconstruct the function's signature. Unfortunately this information is stored in an inconvenient manner, and is spread across a half-dozen deeply nested attributes.

This PEP proposes a new representation for function signatures. The new representation contains all necessary information about a function and its parameters, and makes introspection easy and straightforward.

However, this object does not replace the existing function metadata, which is used by Python itself to execute those functions. The new metadata object is intended solely to make function introspection easier for Python programmers.

Signature Object

A Signature object represents the call signature of a function and its return annotation. For each parameter accepted by the function it stores a Parameter object in its parameters collection.

A Signature object has the following public attributes and methods:

  • return_annotation : object

    The "return" annotation for the function. If the function has no "return" annotation, this attribute is set to Signature.empty.

  • parameters : OrderedDict

    An ordered mapping of parameters' names to the corresponding Parameter objects.

  • bind(*args, **kwargs) -> BoundArguments

    Creates a mapping from positional and keyword arguments to parameters. Raises a TypeError if the passed arguments do not match the signature.

  • bind_partial(*args, **kwargs) -> BoundArguments

    Works the same way as bind(), but allows the omission of some required arguments (mimics functools.partial behavior.) Raises a TypeError if the passed arguments do not match the signature.

  • replace(parameters=<optional>, *, return_annotation=<optional>) -> Signature

    Creates a new Signature instance based on the instance replace was invoked on. It is possible to pass different parameters and/or return_annotation to override the corresponding properties of the base signature. To remove return_annotation from the copied Signature, pass in Signature.empty.

    Note that the '=<optional>' notation, means that the argument is optional. This notation applies to the rest of this PEP.

Signature objects are immutable. Use Signature.replace() to make a modified copy:

>>> def foo() -> None:
...     pass
>>> sig = signature(foo)

>>> new_sig = sig.replace(return_annotation="new return annotation")
>>> new_sig is not sig
True
>>> new_sig.return_annotation != sig.return_annotation
True
>>> new_sig.parameters == sig.parameters
True

>>> new_sig = new_sig.replace(return_annotation=new_sig.empty)
>>> new_sig.return_annotation is Signature.empty
True

There are two ways to instantiate a Signature class:

  • Signature(parameters=<optional>, *, return_annotation=Signature.empty)

    Default Signature constructor. Accepts an optional sequence of Parameter objects, and an optional return_annotation. Parameters sequence is validated to check that there are no parameters with duplicate names, and that the parameters are in the right order, i.e. positional-only first, then positional-or-keyword, etc.

  • Signature.from_function(function)

    Returns a Signature object reflecting the signature of the function passed in.

It's possible to test Signatures for equality. Two signatures are equal when their parameters are equal, their positional and positional-only parameters appear in the same order, and they have equal return annotations.

Changes to the Signature object, or to any of its data members, do not affect the function itself.

Signature also implements __str__:

>>> str(Signature.from_function((lambda *args: None)))
'(*args)'

>>> str(Signature())
'()'

Parameter Object

Python's expressive syntax means functions can accept many different kinds of parameters with many subtle semantic differences. We propose a rich Parameter object designed to represent any possible function parameter.

A Parameter object has the following public attributes and methods:

  • name : str

    The name of the parameter as a string. Must be a valid python identifier name (with the exception of POSITIONAL_ONLY parameters, which can have it set to None.)

  • default : object

    The default value for the parameter. If the parameter has no default value, this attribute is set to Parameter.empty.

  • annotation : object

    The annotation for the parameter. If the parameter has no annotation, this attribute is set to Parameter.empty.

  • kind

    Describes how argument values are bound to the parameter. Possible values:

    • Parameter.POSITIONAL_ONLY - value must be supplied as a positional argument.

      Python has no explicit syntax for defining positional-only parameters, but many built-in and extension module functions (especially those that accept only one or two parameters) accept them.

    • Parameter.POSITIONAL_OR_KEYWORD - value may be supplied as either a keyword or positional argument (this is the standard binding behaviour for functions implemented in Python.)

    • Parameter.KEYWORD_ONLY - value must be supplied as a keyword argument. Keyword only parameters are those which appear after a "*" or "*args" entry in a Python function definition.

    • Parameter.VAR_POSITIONAL - a tuple of positional arguments that aren't bound to any other parameter. This corresponds to a "*args" parameter in a Python function definition.

    • Parameter.VAR_KEYWORD - a dict of keyword arguments that aren't bound to any other parameter. This corresponds to a "**kwargs" parameter in a Python function definition.

    Always use Parameter.* constants for setting and checking value of the kind attribute.

  • replace(*, name=<optional>, kind=<optional>, default=<optional>, annotation=<optional>) -> Parameter

    Creates a new Parameter instance based on the instance replaced was invoked on. To override a Parameter attribute, pass the corresponding argument. To remove an attribute from a Parameter, pass Parameter.empty.

Parameter constructor:

  • Parameter(name, kind, *, annotation=Parameter.empty, default=Parameter.empty)

    Instantiates a Parameter object. name and kind are required, while annotation and default are optional.

Two parameters are equal when they have equal names, kinds, defaults, and annotations.

Parameter objects are immutable. Instead of modifying a Parameter object, you can use Parameter.replace() to create a modified copy like so:

>>> param = Parameter('foo', Parameter.KEYWORD_ONLY, default=42)
>>> str(param)
'foo=42'

>>> str(param.replace())
'foo=42'

>>> str(param.replace(default=Parameter.empty, annotation='spam'))
"foo:'spam'"

BoundArguments Object

Result of a Signature.bind call. Holds the mapping of arguments to the function's parameters.

Has the following public attributes:

  • arguments : OrderedDict

    An ordered, mutable mapping of parameters' names to arguments' values. Contains only explicitly bound arguments. Arguments for which bind() relied on a default value are skipped.

  • args : tuple

    Tuple of positional arguments values. Dynamically computed from the 'arguments' attribute.

  • kwargs : dict

    Dict of keyword arguments values. Dynamically computed from the 'arguments' attribute.

The arguments attribute should be used in conjunction with Signature.parameters for any arguments processing purposes.

args and kwargs properties can be used to invoke functions:

def test(a, *, b):
    ...

sig = signature(test)
ba = sig.bind(10, b=20)
test(*ba.args, **ba.kwargs)

Arguments which could be passed as part of either *args or **kwargs will be included only in the BoundArguments.args attribute. Consider the following example:

def test(a=1, b=2, c=3):
    pass

sig = signature(test)
ba = sig.bind(a=10, c=13)

>>> ba.args
(10,)

>>> ba.kwargs:
{'c': 13}

Implementation

The implementation adds a new function signature() to the inspect module. The function is the preferred way of getting a Signature for a callable object.

The function implements the following algorithm:

  • If the object is not callable - raise a TypeError

  • If the object has a __signature__ attribute and if it is not None - return it

  • If it has a __wrapped__ attribute, return signature(object.__wrapped__)

  • If the object is an instance of FunctionType, construct and return a new Signature for it

  • If the object is a bound method, construct and return a new Signature object, with its first parameter (usually self or cls) removed. (classmethod and staticmethod are supported too. Since both are descriptors, the former returns a bound method, and the latter returns its wrapped function.)

  • If the object is an instance of functools.partial, construct a new Signature from its partial.func attribute, and account for already bound partial.args and partial.kwargs

  • If the object is a class or metaclass:

    • If the object's type has a __call__ method defined in its MRO, return a Signature for it
    • If the object has a __new__ method defined in its MRO, return a Signature object for it
    • If the object has a __init__ method defined in its MRO, return a Signature object for it
  • Return signature(object.__call__)

Note that the Signature object is created in a lazy manner, and is not automatically cached. However, the user can manually cache a Signature by storing it in the __signature__ attribute.

An implementation for Python 3.3 can be found at [1]. The python issue tracking the patch is [2].

Design Considerations

No implicit caching of Signature objects

The first PEP design had a provision for implicit caching of Signature objects in the inspect.signature() function. However, this has the following downsides:

  • If the Signature object is cached then any changes to the function it describes will not be reflected in it. However, If the caching is needed, it can be always done manually and explicitly
  • It is better to reserve the __signature__ attribute for the cases when there is a need to explicitly set to a Signature object that is different from the actual one

Some functions may not be introspectable

Some functions may not be introspectable in certain implementations of Python. For example, in CPython, built-in functions defined in C provide no metadata about their arguments. Adding support for them is out of scope for this PEP.

Signature and Parameter equivalence

We assume that parameter names have semantic significance--two signatures are equal only when their corresponding parameters are equal and have the exact same names. Users who want looser equivalence tests, perhaps ignoring names of VAR_KEYWORD or VAR_POSITIONAL parameters, will need to implement those themselves.

Examples

Visualizing Callable Objects' Signature

Let's define some classes and functions:

from inspect import signature
from functools import partial, wraps


class FooMeta(type):
    def __new__(mcls, name, bases, dct, *, bar:bool=False):
        return super().__new__(mcls, name, bases, dct)

    def __init__(cls, name, bases, dct, **kwargs):
        return super().__init__(name, bases, dct)


class Foo(metaclass=FooMeta):
    def __init__(self, spam:int=42):
        self.spam = spam

    def __call__(self, a, b, *, c) -> tuple:
        return a, b, c

    @classmethod
    def spam(cls, a):
        return a


def shared_vars(*shared_args):
    """Decorator factory that defines shared variables that are
       passed to every invocation of the function"""

    def decorator(f):
        @wraps(f)
        def wrapper(*args, **kwargs):
            full_args = shared_args + args
            return f(*full_args, **kwargs)

        # Override signature
        sig = signature(f)
        sig = sig.replace(tuple(sig.parameters.values())[1:])
        wrapper.__signature__ = sig

        return wrapper
    return decorator


@shared_vars({})
def example(_state, a, b, c):
    return _state, a, b, c


def format_signature(obj):
    return str(signature(obj))

Now, in the python REPL:

>>> format_signature(FooMeta)
'(name, bases, dct, *, bar:bool=False)'

>>> format_signature(Foo)
'(spam:int=42)'

>>> format_signature(Foo.__call__)
'(self, a, b, *, c) -> tuple'

>>> format_signature(Foo().__call__)
'(a, b, *, c) -> tuple'

>>> format_signature(Foo.spam)
'(a)'

>>> format_signature(partial(Foo().__call__, 1, c=3))
'(b, *, c=3) -> tuple'

>>> format_signature(partial(partial(Foo().__call__, 1, c=3), 2, c=20))
'(*, c=20) -> tuple'

>>> format_signature(example)
'(a, b, c)'

>>> format_signature(partial(example, 1, 2))
'(c)'

>>> format_signature(partial(partial(example, 1, b=2), c=3))
'(b=2, c=3)'

Annotation Checker

import inspect
import functools

def checktypes(func):
    '''Decorator to verify arguments and return types

    Example:

        >>> @checktypes
        ... def test(a:int, b:str) -> int:
        ...     return int(a * b)

        >>> test(10, '1')
        1111111111

        >>> test(10, 1)
        Traceback (most recent call last):
          ...
        ValueError: foo: wrong type of 'b' argument, 'str' expected, got 'int'
    '''

    sig = inspect.signature(func)

    types = {}
    for param in sig.parameters.values():
        # Iterate through function's parameters and build the list of
        # arguments types
        type_ = param.annotation
        if type_ is param.empty or not inspect.isclass(type_):
            # Missing annotation or not a type, skip it
            continue

        types[param.name] = type_

        # If the argument has a type specified, let's check that its
        # default value (if present) conforms with the type.
        if param.default is not param.empty and not isinstance(param.default, type_):
            raise ValueError("{func}: wrong type of a default value for {arg!r}". \
                             format(func=func.__qualname__, arg=param.name))

    def check_type(sig, arg_name, arg_type, arg_value):
        # Internal function that encapsulates arguments type checking
        if not isinstance(arg_value, arg_type):
            raise ValueError("{func}: wrong type of {arg!r} argument, " \
                             "{exp!r} expected, got {got!r}". \
                             format(func=func.__qualname__, arg=arg_name,
                                    exp=arg_type.__name__, got=type(arg_value).__name__))

    @functools.wraps(func)
    def wrapper(*args, **kwargs):
        # Let's bind the arguments
        ba = sig.bind(*args, **kwargs)
        for arg_name, arg in ba.arguments.items():
            # And iterate through the bound arguments
            try:
                type_ = types[arg_name]
            except KeyError:
                continue
            else:
                # OK, we have a type for the argument, lets get the corresponding
                # parameter description from the signature object
                param = sig.parameters[arg_name]
                if param.kind == param.VAR_POSITIONAL:
                    # If this parameter is a variable-argument parameter,
                    # then we need to check each of its values
                    for value in arg:
                        check_type(sig, arg_name, type_, value)
                elif param.kind == param.VAR_KEYWORD:
                    # If this parameter is a variable-keyword-argument parameter:
                    for subname, value in arg.items():
                        check_type(sig, arg_name + ':' + subname, type_, value)
                else:
                    # And, finally, if this parameter a regular one:
                    check_type(sig, arg_name, type_, arg)

        result = func(*ba.args, **ba.kwargs)

        # The last bit - let's check that the result is correct
        return_type = sig.return_annotation
        if (return_type is not sig._empty and
                isinstance(return_type, type) and
                not isinstance(result, return_type)):

            raise ValueError('{func}: wrong return type, {exp} expected, got {got}'. \
                             format(func=func.__qualname__, exp=return_type.__name__,
                                    got=type(result).__name__))
        return result

    return wrapper

Acceptance

PEP 362 was accepted by Guido, Friday, June 22, 2012 [3] . The reference implementation was committed to trunk later that day.

pep-0363 Syntax For Dynamic Attribute Access

PEP: 363
Title: Syntax For Dynamic Attribute Access
Version: $Revision$
Last-Modified: $Date$
Author: Ben North <ben at redfrontdoor.org>
Status: Rejected
Type: Standards Track
Content-Type: text/plain
Created: 29-Jan-2007
Post-History: 12-Feb-2007

Abstract

    Dynamic attribute access is currently possible using the "getattr"
    and "setattr" builtins.  The present PEP suggests a new syntax to
    make such access easier, allowing the coder for example to write

        x.('foo_%d' % n) += 1

        z = y.('foo_%d' % n).('bar_%s' % s)

    instead of

        attr_name = 'foo_%d' % n
        setattr(x, attr_name, getattr(x, attr_name) + 1)

        z = getattr(getattr(y, 'foo_%d' % n), 'bar_%s' % s)


Rationale

    Dictionary access and indexing both have a friendly invocation
    syntax: instead of x.__getitem__(12) the coder can write x[12].
    This also allows the use of subscripted elements in an augmented
    assignment, as in "x[12] += 1".  The present proposal brings this
    ease-of-use to dynamic attribute access too.

    Attribute access is currently possible in two ways:

    * When the attribute name is known at code-writing time, the
      ".NAME" trailer can be used, as in

          x.foo = 42
          y.bar += 100

    * When the attribute name is computed dynamically at run-time, the
      "getattr" and "setattr" builtins must be used:

          x = getattr(y, 'foo_%d' % n)
          setattr(z, 'bar_%s' % s, 99)

      The "getattr" builtin also allows the coder to specify a default
      value to be returned in the event that the object does not have
      an attribute of the given name:

          x = getattr(y, 'foo_%d' % n, 0)

    This PEP describes a new syntax for dynamic attribute access ---
    "x.(expr)" --- with examples given in the Abstract above.

    (The new syntax could also allow the provision of a default value in
    the "get" case, as in:

        x = y.('foo_%d' % n, None)

    This 2-argument form of dynamic attribute access would not be
    permitted as the target of an (augmented or normal) assignment.  The
    "Discussion" section below includes opinions specifically on the
    2-argument extension.)

    Finally, the new syntax can be used with the "del" statement, as in

        del x.(attr_name)


Impact On Existing Code

    The proposed new syntax is not currently valid, so no existing
    well-formed programs have their meaning altered by this proposal.

    Across all "*.py" files in the 2.5 distribution, there are around
    600 uses of "getattr", "setattr" or "delattr".  They break down as
    follows (figures have some room for error because they were
    arrived at by partially-manual inspection):

        c.300 uses of plain "getattr(x, attr_name)", which could be
              replaced with the new syntax;

        c.150 uses of the 3-argument form, i.e., with the default
              value; these could be replaced with the 2-argument form
              of the new syntax (the cases break down into c.125 cases
              where the attribute name is a literal string, and c.25
              where it's only known at run-time);

        c.5   uses of the 2-argument form with a literal string
              attribute name, which I think could be replaced with the
              standard "x.attribute" syntax;

        c.120 uses of setattr, of which 15 use getattr to find the
              new value; all could be replaced with the new syntax,
              the 15 where getattr is also involved would show a
              particular increase in clarity;

        c.5   uses which would have to stay as "getattr" because they
              are calls of a variable named "getattr" whose default
              value is the builtin "getattr";

        c.5   uses of the 2-argument form, inside a try/except block
              which catches AttributeError and uses a default value
              instead; these could use 2-argument form of the new
              syntax;

        c.10  uses of "delattr", which could use the new syntax.

    As examples, the line

        setattr(self, attr, change_root(self.root, getattr(self, attr)))

    from Lib/distutils/command/install.py could be rewritten

        self.(attr) = change_root(self.root, self.(attr))

    and the line

        setattr(self, method_name, getattr(self.metadata, method_name))

    from Lib/distutils/dist.py could be rewritten

        self.(method_name) = self.metadata.(method_name)


Performance Impact

    Initial pystone measurements are inconclusive, but suggest there may
    be a performance penalty of around 1% in the pystones score with the
    patched version.  One suggestion is that this is because the longer
    main loop in ceval.c hurts the cache behaviour, but this has not
    been confirmed.

    On the other hand, measurements suggest a speed-up of around 40--45%
    for dynamic attribute access.


Error Cases

    Only strings are permitted as attribute names, so for instance the
    following error is produced:

     >>> x.(99) = 8
        Traceback (most recent call last):
          File "<stdin>", line 1, in <module>
        TypeError: attribute name must be string, not 'int'

    This is handled by the existing PyObject_GetAttr function.


Draft Implementation

    A draft implementation adds a new alternative to the "trailer"
    clause in Grammar/Grammar; a new AST type, "DynamicAttribute" in
    Python.asdl, with accompanying changes to symtable.c, ast.c, and
    compile.c, and three new opcodes (load/store/del) with
    accompanying changes to opcode.h and ceval.c.  The patch consists
    of c.180 additional lines in the core code, and c.100 additional
    lines of tests.  It is available as sourceforge patch #1657573 [1].


Mailing Lists Discussion

    Initial posting of this PEP in draft form was to python-ideas on
    20070209 [2], and the response was generally positive.  The PEP was
    then posted to python-dev on 20070212 [3], and an interesting
    discussion ensued.  A brief summary:

    Initially, there was reasonable (but not unanimous) support for the
    idea, although the precise choice of syntax had a more mixed
    reception.  Several people thought the "." would be too easily
    overlooked, with the result that the syntax could be confused with a
    method/function call.  A few alternative syntaxes were suggested:

        obj.(foo)
        obj.[foo]
        obj.{foo}
        obj{foo}
        obj.*foo
        obj->foo
        obj<-foo
        obj@[foo]
        obj.[[foo]]

    with "obj.[foo]" emerging as the preferred one.  In this initial
    discussion, the two-argument form was universally disliked, so it
    was to be taken out of the PEP.

    Discussion then took a step back to whether this particular feature
    provided enough benefit to justify new syntax.  As well as requiring
    coders to become familiar with the new syntax, there would also be
    the problem of backward compatibility --- code using the new syntax
    would not run on older pythons.

    Instead of new syntax, a new "wrapper class" was proposed, with the
    following specification / conceptual implementation suggested by
    Martin von Loewis:

        class attrs:
           def __init__(self, obj):
             self.obj = obj
           def __getitem__(self, name):
             return getattr(self.obj, name)
           def __setitem__(self, name, value):
             return setattr(self.obj, name, value)
           def __delitem__(self, name):
             return delattr(self, name)
           def __contains__(self, name):
             return hasattr(self, name)

    This was considered a cleaner and more elegant solution to the
    original problem.  (Another suggestion was a mixin class providing
    dictionary-style access to an object's attributes.)

    The decision was made that the present PEP did not meet the burden
    of proof for the introduction of new syntax, a view which had been
    put forward by some from the beginning of the discussion.  The
    wrapper class idea was left open as a possibility for a future PEP.


References

    [1] Sourceforge patch #1657573
        http://sourceforge.net/tracker/index.php?func=detail&aid=1657573&group_id=5470&atid=305470

    [2] http://mail.python.org/pipermail/python-ideas/2007-February/000210.html
        and following posts

    [3] http://mail.python.org/pipermail/python-dev/2007-February/070939.html
        and following posts


Copyright

    This document has been placed in the public domain.


pep-0364 Transitioning to the Py3K Standard Library

PEP:364
Title:Transitioning to the Py3K Standard Library
Version:$Revision$
Last-Modified:$Date$
Author:Barry Warsaw <barry at python.org>
Status:Withdrawn
Type:Standards Track
Content-Type:text/x-rst
Created:01-Mar-2007
Python-Version:2.6
Post-History:

Abstract

PEP 3108 describes the reorganization of the Python standard library for the Python 3.0 release [1]. This PEP describes a mechanism for transitioning from the Python 2.x standard library to the Python 3.0 standard library. This transition will allow and encourage Python programmers to use the new Python 3.0 library names starting with Python 2.6, while maintaining the old names for backward compatibility. In this way, a Python programmer will be able to write forward compatible code without sacrificing interoperability with existing Python programs.

Rationale

PEP 3108 presents a rationale for Python standard library (stdlib) reorganization. The reader is encouraged to consult that PEP for details about why and how the library will be reorganized. Should PEP 3108 be accepted in part or in whole, then it is advantageous to allow Python programmers to begin the transition to the new stdlib module names in Python 2.x, so that they can write forward compatible code starting with Python 2.6.

Note that PEP 3108 proposes to remove some "silly old stuff", i.e. modules that are no longer useful or necessary. The PEP you are reading does not address this because there are no forward compatibility issues for modules that are to be removed, except to stop using such modules.

This PEP concerns only the mechanism by which mappings from old stdlib names to new stdlib names are maintained. Please consult PEP 3108 for all specific module renaming proposals. Specifically see the section titled Modules to Rename for guidelines on the old name to new name mappings. The few examples in this PEP are given for illustrative purposes only and should not be used for specific renaming recommendations.

Supported Renamings

There are at least 4 use cases explicitly supported by this PEP:

  • Simple top-level package name renamings, such as StringIO to stringio;
  • Sub-package renamings where the package name may or may not be renamed, such as email.MIMEText to email.mime.text;
  • Extension module renaming, such as cStringIO to cstringio;
  • Third party renaming of any of the above.

Two use cases supported by this PEP include renaming simple top-level modules, such as StringIO, as well as modules within packages, such as email.MIMEText.

In the former case, PEP 3108 currently recommends StringIO be renamed to stringio, following PEP 8 recommendations [2].

In the latter case, the email 4.0 package distributed with Python 2.5 already renamed email.MIMEText to email.mime.text, although it did so in a one-off, uniquely hackish way inside the email package. The mechanism described in this PEP is general enough to handle all module renamings, obviating the need for the Python 2.5 hack (except for backward compatibility with earlier Python versions).

An additional use case is to support the renaming of C extension modules. As long as the new name for the C module is importable, it can be remapped to the new name. E.g. cStringIO renamed to cstringio.

Third party package renaming is also supported, via several public interfaces accessible by any Python module.

Remappings are not performed recursively.

.mv files

Remapping files are called .mv files; the suffix was chosen to be evocative of the Unix mv(1) command. An .mv file is a simple line-oriented text file. All blank lines and lines that start with a # are ignored. All other lines must contain two whitespace separated fields. The first field is the old module name, and the second field is the new module name. Both module names must be specified using their full dotted-path names. Here is an example .mv file from Python 2.6:

# Map the various string i/o libraries to their new names
StringIO    stringio
cStringIO   cstringio

.mv files can appear anywhere in the file system, and there is a programmatic interface provided to parse them, and register the remappings inside them. By default, when Python starts up, all the .mv files in the oldlib package are read, and their remappings are automatically registered. This is where all the module remappings should be specified for top-level Python 2.x standard library modules.

Implementation Specification

This section provides the full specification for how module renamings in Python 2.x are implemented. The central mechanism relies on various import hooks as described in PEP 302 [3]. Specifically sys.path_importer_cache, sys.path, and sys.meta_path are all employed to provide the necessary functionality.

When Python's import machinery is initialized, the oldlib package is imported. Inside oldlib there is a class called OldStdlibLoader. This class implements the PEP 302 interface and is automatically instantiated, with zero arguments. The constructor reads all the .mv files from the oldlib package directory, automatically registering all the remappings found in those .mv files. This is how the Python 2.x standard library is remapped.

The OldStdlibLoader class should not be instantiated by other Python modules. Instead, you can access the global OldStdlibLoader instance via the sys.stdlib_remapper instance. Use this instance if you want programmatic access to the remapping machinery.

One important implementation detail: as needed by the PEP 302 API, a magic string is added to sys.path, and module __path__ attributes in order to hook in our remapping loader. This magic string is currently <oldlib> and some changes were necessary to Python's site.py file in order to treat all sys.path entries starting with < as special. Specifically, no attempt is made to make them absolute file names (since they aren't file names at all).

In order for the remapping import hooks to work, the module or package must be physically located under its new name. This is because the import hooks catch only modules that are not already imported, and cannot be imported by Python's built-in import rules. Thus, if a module has been moved, say from Lib/StringIO.py to Lib/stringio.py, and the former's .pyc file has been removed, then without the remapper, this would fail:

import StringIO

Instead, with the remapper, this failing import will be caught, the old name will be looked up in the registered remappings, and in this case, the new name stringio will be found. The remapper then attempts to import the new name, and if that succeeds, it binds the resulting module into sys.modules, under both the old and new names. Thus, the above import will result in entries in sys.modules for 'StringIO' and 'stringio', and both will point to the exact same module object.

Note that no way to disable the remapping machinery is proposed, short of moving all the .mv files away or programmatically removing them in some custom start up code. In Python 3.0, the remappings will be eliminated, leaving only the "new" names.

Programmatic Interface

Several methods are added to the sys.stdlib_remapper object, which third party packages can use to register their own remappings. Note however that in all cases, there is one and only one mapping from an old name to a new name. If two .mv files contain different mappings for an old name, or if a programmatic call is made with an old name that is already remapped, the previous mapping is lost. This will not affect any already imported modules.

The following methods are available on the sys.stdlib_remapper object:

  • read_mv_file(filename) -- Read the given file and register all remappings found in the file.
  • read_directory_mv_files(dirname, suffix='.mv') -- List the given directory, reading all files in that directory that have the matching suffix (.mv by default). For each parsed file, register all the remappings found in that file.
  • set_mapping(oldname, newname) -- Register a new mapping from an old module name to a new module name. Both must be the full dotted-path name to the module. newname may be None in which case any existing mapping for oldname will be removed (it is not an error if there is no existing mapping).
  • get_mapping(oldname, default=None) -- Return any registered newname for the given oldname. If there is no registered remapping, default is returned.

Open Issues

  • Should there be a command line switch and/or environment variable to disable all remappings?

  • Should remappings occur recursively?

  • Should we automatically parse package directories for .mv files when the package's __init__.py is loaded? This would allow packages to easily include .mv files for their own remappings. Compare what the email package currently has to do if we place its .mv file in the email package instead of in the oldlib package:

    # Expose old names
    import os, sys
    sys.stdlib_remapper.read_directory_mv_files(os.path.dirname(__file__))
    

    I think we should automatically read a package's directory for any .mv files it might contain.

Reference Implementation

A reference implementation, in the form of a patch against the current (as of this writing) state of the Python 2.6 svn trunk, is available as SourceForge patch #1675334 [4]. Note that this patch includes a rename of cStringIO to cstringio, but this is primarily for illustrative and unit testing purposes. Should the patch be accepted, we might want to split this change off into other PEP 3108 changes.

References

[1]PEP 3108, Standard Library Reorganization, Cannon (http://www.python.org/dev/peps/pep-3108)
[2]PEP 8, Style Guide for Python Code, GvR, Warsaw (http://www.python.org/dev/peps/pep-0008)
[3]PEP 302, New Import Hooks, JvR, Moore (http://www.python.org/dev/peps/pep-0302)
[4]Reference implementation (http://bugs.python.org/issue1675334)

pep-0365 Adding the pkg_resources module

PEP:365
Title:Adding the pkg_resources module
Version:$Revision$
Last-Modified:$Date$
Author:Phillip J. Eby <pje at telecommunity.com>
Status:Rejected
Type:Standards Track
Content-Type:text/x-rst
Created:30-Apr-2007
Post-History:30-Apr-2007

Abstract

This PEP proposes adding an enhanced version of the pkg_resources module to the standard library.

pkg_resources is a module used to find and manage Python package/version dependencies and access bundled files and resources, including those inside of zipped .egg files. Currently, pkg_resources is only available through installing the entire setuptools distribution, but it does not depend on any other part of setuptools; in effect, it comprises the entire runtime support library for Python Eggs, and is independently useful.

In addition, with one feature addition, this module could support easy bootstrap installation of several Python package management tools, including setuptools, workingenv, and zc.buildout.

Proposal

Rather than proposing to include setuptools in the standard library, this PEP proposes only that pkg_resources be added to the standard library for Python 2.6 and 3.0. pkg_resources is considerably more stable than the rest of setuptools, with virtually no new features being added in the last 12 months.

However, this PEP also proposes that a new feature be added to pkg_resources, before being added to the stdlib. Specifically, it should be possible to do something like:

python -m pkg_resources SomePackage==1.2

to request downloading and installation of SomePackage from PyPI. This feature would not be a replacement for easy_install; instead, it would rely on SomePackage having pure-Python .egg files listed for download via the PyPI XML-RPC API, and the eggs would be placed in the $PYTHON_EGG_CACHE directory, where they would not be importable by default. (And no scripts would be installed.) However, if the download egg contains installation bootstrap code, it will be given a chance to run.

These restrictions would allow the code to be extremely simple, yet still powerful enough to support users downloading package management tools such as setuptools, workingenv and zc.buildout, simply by supplying the tool's name on the command line.

Rationale

Many users have requested that setuptools be included in the standard library, to save users needing to go through the awkward process of bootstrapping it. However, most of the bootstrapping complexity comes from the fact that setuptools-installed code cannot use the pkg_resources runtime module unless setuptools is already installed. Thus, installing setuptools requires (in a sense) that setuptools already be installed.

Other Python package management tools, such as workingenv and zc.buildout, have similar bootstrapping issues, since they both make use of setuptools, but also want to provide users with something approaching a "one-step install". The complexity of creating bootstrap utilities for these and any other such tools that arise in future, is greatly reduced if pkg_resources is already present, and is also able to download pre-packaged eggs from PyPI.

(It would also mean that setuptools would not need to be installed in order to simply use eggs, as opposed to building them.)

Finally, in addition to providing access to eggs built via setuptools or other packaging tools, it should be noted that since Python 2.5, the distutils install package metadata (aka PKG-INFO) files that can be read by pkg_resources to identify what distributions are already on sys.path. In environments where Python packages are installed using system package tools (like RPM), the pkg_resources module provides an API for detecting what versions of what packages are installed, even if those packages were installed via the distutils instead of setuptools.

Implementation and Documentation

The pkg_resources implementation is maintained in the Python SVN repository under /sandbox/trunk/setuptools/; see pkg_resources.py and pkg_resources.txt. Documentation for the egg format(s) supported by pkg_resources can be found in doc/formats.txt. HTML versions of these documents are available at:

(These HTML versions are for setuptools 0.6; they may not reflect all of the changes found in the Subversion trunk's .txt versions.)

pep-0366 Main module explicit relative imports

PEP:366
Title:Main module explicit relative imports
Version:$Revision$
Last-Modified:$Date$
Author:Nick Coghlan <ncoghlan at gmail.com>
Status:Final
Type:Standards Track
Content-Type:text/x-rst
Created:1-May-2007
Python-Version:2.6, 3.0
Post-History:1-May-2007, 4-Jul-2007, 7-Jul-2007, 23-Nov-2007

Abstract

This PEP proposes a backwards compatible mechanism that permits the use of explicit relative imports from executable modules within packages. Such imports currently fail due to an awkward interaction between PEP 328 and PEP 338.

By adding a new module level attribute, this PEP allows relative imports to work automatically if the module is executed using the -m switch. A small amount of boilerplate in the module itself will allow the relative imports to work when the file is executed by name.

Guido accepted the PEP in November 2007 [5].

Proposed Change

The major proposed change is the introduction of a new module level attribute, __package__. When it is present, relative imports will be based on this attribute rather than the module __name__ attribute.

As with the current __name__ attribute, setting __package__ will be the responsibility of the PEP 302 loader used to import a module. Loaders which use imp.new_module() to create the module object will have the new attribute set automatically to None. When the import system encounters an explicit relative import in a module without __package__ set (or with it set to None), it will calculate and store the correct value (__name__.rpartition('.')[0] for normal modules and __name__ for package initialisation modules). If __package__ has already been set then the import system will use it in preference to recalculating the package name from the __name__ and __path__ attributes.

The runpy module will explicitly set the new attribute, basing it off the name used to locate the module to be executed rather than the name used to set the module's __name__ attribute. This will allow relative imports to work correctly from main modules executed with the -m switch.

When the main module is specified by its filename, then the __package__ attribute will be set to None. To allow relative imports when the module is executed directly, boilerplate similar to the following would be needed before the first relative import statement:

if __name__ == "__main__" and __package__ is None:
    __package__ = "expected.package.name"

Note that this boilerplate is sufficient only if the top level package is already accessible via sys.path. Additional code that manipulates sys.path would be needed in order for direct execution to work without the top level package already being importable.

This approach also has the same disadvantage as the use of absolute imports of sibling modules - if the script is moved to a different package or subpackage, the boilerplate will need to be updated manually. It has the advantage that this change need only be made once per file, regardless of the number of relative imports.

Note that setting __package__ to the empty string explicitly is permitted, and has the effect of disabling all relative imports from that module (since the import machinery will consider it to be a top level module in that case). This means that tools like runpy do not need to provide special case handling for top level modules when setting __package__.

Rationale for Change

The current inability to use explicit relative imports from the main module is the subject of at least one open SF bug report (#1510172) [1], and has most likely been a factor in at least a few queries on comp.lang.python (such as Alan Isaac's question in [2]).

This PEP is intended to provide a solution which permits explicit relative imports from main modules, without incurring any significant costs during interpreter startup or normal module import.

The section in PEP 338 on relative imports and the main module provides further details and background on this problem.

Reference Implementation

Rev 47142 in SVN implemented an early variant of this proposal which stored the main module's real module name in the __module_name__ attribute. It was reverted due to the fact that 2.5 was already in beta by that time.

Patch 1487 [4] is the proposed implementation for this PEP.

Alternative Proposals

PEP 3122 proposed addressing this problem by changing the way the main module is identified. That's a significant compatibility cost to incur to fix something that is a pretty minor bug in the overall scheme of things, and the PEP was rejected [3].

The advantage of the proposal in this PEP is that its only impact on normal code is the small amount of time needed to set the extra attribute when importing a module. Relative imports themselves should be sped up fractionally, as the package name is cached in the module globals, rather than having to be worked out again for each relative import.

pep-0367 New Super

PEP:367
Title:New Super
Version:$Revision$
Last-Modified:$Date$
Author:Calvin Spealman <ironfroggy at gmail.com>, Tim Delaney <timothy.c.delaney at gmail.com>
Status:Superseded
Type:Standards Track
Content-Type:text/x-rst
Created:28-Apr-2007
Python-Version:2.6
Post-History:28-Apr-2007, 29-Apr-2007 (1), 29-Apr-2007 (2), 14-May-2007

Numbering Note

This PEP has been renumbered to PEP 3135. The text below is the last version submitted under the old number.

Abstract

This PEP proposes syntactic sugar for use of the super type to automatically construct instances of the super type binding to the class that a method was defined in, and the instance (or class object for classmethods) that the method is currently acting upon.

The premise of the new super usage suggested is as follows:

super.foo(1, 2)

to replace the old:

super(Foo, self).foo(1, 2)

and the current __builtin__.super be aliased to __builtin__.__super__ (with __builtin__.super to be removed in Python 3.0).

It is further proposed that assignment to super become a SyntaxError, similar to the behaviour of None.

Rationale

The current usage of super requires an explicit passing of both the class and instance it must operate from, requiring a breaking of the DRY (Don't Repeat Yourself) rule. This hinders any change in class name, and is often considered a wart by many.

Specification

Within the specification section, some special terminology will be used to distinguish similar and closely related concepts. "super type" will refer to the actual builtin type named "super". A "super instance" is simply an instance of the super type, which is associated with a class and possibly with an instance of that class.

Because the new super semantics are not backwards compatible with Python 2.5, the new semantics will require a __future__ import:

from __future__ import new_super

The current __builtin__.super will be aliased to __builtin__.__super__. This will occur regardless of whether the new super semantics are active. It is not possible to simply rename __builtin__.super, as that would affect modules that do not use the new super semantics. In Python 3.0 it is proposed that the name __builtin__.super will be removed.

Replacing the old usage of super, calls to the next class in the MRO (method resolution order) can be made without explicitly creating a super instance (although doing so will still be supported via __super__). Every function will have an implicit local named super. This name behaves identically to a normal local, including use by inner functions via a cell, with the following exceptions:

  1. Assigning to the name super will raise a SyntaxError at compile time;
  2. Calling a static method or normal function that accesses the name super will raise a TypeError at runtime.

Every function that uses the name super, or has an inner function that uses the name super, will include a preamble that performs the equivalent of:

super = __builtin__.__super__(<class>, <instance>)

where <class> is the class that the method was defined in, and <instance> is the first parameter of the method (normally self for instance methods, and cls for class methods). For static methods and normal functions, <class> will be None, resulting in a TypeError being raised during the preamble.

Note: The relationship between super and __super__ is similar to that between import and __import__.

Much of this was discussed in the thread of the python-dev list, "Fixing super anyone?" [1].

Open Issues

Determining the class object to use

The exact mechanism for associating the method with the defining class is not specified in this PEP, and should be chosen for maximum performance. For CPython, it is suggested that the class instance be held in a C-level variable on the function object which is bound to one of NULL (not part of a class), Py_None (static method) or a class object (instance or class method).

Should super actually become a keyword?

With this proposal, super would become a keyword to the same extent that None is a keyword. It is possible that further restricting the super name may simplify implementation, however some are against the actual keyword- ization of super. The simplest solution is often the correct solution and the simplest solution may well not be adding additional keywords to the language when they are not needed. Still, it may solve other open issues.

Closed Issues

super used with __call__ attributes

It was considered that it might be a problem that instantiating super instances the classic way, because calling it would lookup the __call__ attribute and thus try to perform an automatic super lookup to the next class in the MRO. However, this was found to be false, because calling an object only looks up the __call__ method directly on the object's type. The following example shows this in action.

class A(object):
    def __call__(self):
        return '__call__'
    def __getattribute__(self, attr):
        if attr == '__call__':
            return lambda: '__getattribute__'
a = A()
assert a() == '__call__'
assert a.__call__() == '__getattribute__'

In any case, with the renaming of __builtin__.super to __builtin__.__super__ this issue goes away entirely.

Reference Implementation

It is impossible to implement the above specification entirely in Python. This reference implementation has the following differences to the specification:

  1. New super semantics are implemented using bytecode hacking.
  2. Assignment to super is not a SyntaxError. Also see point #4.
  3. Classes must either use the metaclass autosuper_meta or inherit from the base class autosuper to acquire the new super semantics.
  4. super is not an implicit local variable. In particular, for inner functions to be able to use the super instance, there must be an assignment of the form super = super in the method.

The reference implementation assumes that it is being run on Python 2.5+.

#!/usr/bin/env python
#
# autosuper.py

from array import array
import dis
import new
import types
import __builtin__
__builtin__.__super__ = __builtin__.super
del __builtin__.super

# We need these for modifying bytecode
from opcode import opmap, HAVE_ARGUMENT, EXTENDED_ARG

LOAD_GLOBAL = opmap['LOAD_GLOBAL']
LOAD_NAME = opmap['LOAD_NAME']
LOAD_CONST = opmap['LOAD_CONST']
LOAD_FAST = opmap['LOAD_FAST']
LOAD_ATTR = opmap['LOAD_ATTR']
STORE_FAST = opmap['STORE_FAST']
LOAD_DEREF = opmap['LOAD_DEREF']
STORE_DEREF = opmap['STORE_DEREF']
CALL_FUNCTION = opmap['CALL_FUNCTION']
STORE_GLOBAL = opmap['STORE_GLOBAL']
DUP_TOP = opmap['DUP_TOP']
POP_TOP = opmap['POP_TOP']
NOP = opmap['NOP']
JUMP_FORWARD = opmap['JUMP_FORWARD']
ABSOLUTE_TARGET = dis.hasjabs

def _oparg(code, opcode_pos):
    return code[opcode_pos+1] + (code[opcode_pos+2] << 8)

def _bind_autosuper(func, cls):
    co = func.func_code
    name = func.func_name
    newcode = array('B', co.co_code)
    codelen = len(newcode)
    newconsts = list(co.co_consts)
    newvarnames = list(co.co_varnames)

    # Check if the global 'super' keyword is already present
    try:
        sn_pos = list(co.co_names).index('super')
    except ValueError:
        sn_pos = None

    # Check if the varname 'super' keyword is already present
    try:
        sv_pos = newvarnames.index('super')
    except ValueError:
        sv_pos = None

    # Check if the callvar 'super' keyword is already present
    try:
        sc_pos = list(co.co_cellvars).index('super')
    except ValueError:
        sc_pos = None

    # If 'super' isn't used anywhere in the function, we don't have anything to do
    if sn_pos is None and sv_pos is None and sc_pos is None:
        return func

    c_pos = None
    s_pos = None
    n_pos = None

    # Check if the 'cls_name' and 'super' objects are already in the constants
    for pos, o in enumerate(newconsts):
        if o is cls:
            c_pos = pos

        if o is __super__:
            s_pos = pos

        if o == name:
            n_pos = pos

    # Add in any missing objects to constants and varnames
    if c_pos is None:
        c_pos = len(newconsts)
        newconsts.append(cls)

    if n_pos is None:
        n_pos = len(newconsts)
        newconsts.append(name)

    if s_pos is None:
        s_pos = len(newconsts)
        newconsts.append(__super__)

    if sv_pos is None:
        sv_pos = len(newvarnames)
        newvarnames.append('super')

    # This goes at the start of the function. It is:
    #
    #   super = __super__(cls, self)
    #
    # If 'super' is a cell variable, we store to both the
    # local and cell variables (i.e. STORE_FAST and STORE_DEREF).
    #
    preamble = [
        LOAD_CONST, s_pos & 0xFF, s_pos >> 8,
        LOAD_CONST, c_pos & 0xFF, c_pos >> 8,
        LOAD_FAST, 0, 0,
        CALL_FUNCTION, 2, 0,
    ]

    if sc_pos is None:
        # 'super' is not a cell variable - we can just use the local variable
        preamble += [
            STORE_FAST, sv_pos & 0xFF, sv_pos >> 8,
        ]
    else:
        # If 'super' is a cell variable, we need to handle LOAD_DEREF.
        preamble += [
            DUP_TOP,
            STORE_FAST, sv_pos & 0xFF, sv_pos >> 8,
            STORE_DEREF, sc_pos & 0xFF, sc_pos >> 8,
        ]

    preamble = array('B', preamble)

    # Bytecode for loading the local 'super' variable.
    load_super = array('B', [
        LOAD_FAST, sv_pos & 0xFF, sv_pos >> 8,
    ])

    preamble_len = len(preamble)
    need_preamble = False
    i = 0

    while i < codelen:
        opcode = newcode[i]
        need_load = False
        remove_store = False

        if opcode == EXTENDED_ARG:
            raise TypeError("Cannot use 'super' in function with EXTENDED_ARG opcode")

        # If the opcode is an absolute target it needs to be adjusted
        # to take into account the preamble.
        elif opcode in ABSOLUTE_TARGET:
            oparg = _oparg(newcode, i) + preamble_len
            newcode[i+1] = oparg & 0xFF
            newcode[i+2] = oparg >> 8

        # If LOAD_GLOBAL(super) or LOAD_NAME(super) then we want to change it into
        # LOAD_FAST(super)
        elif (opcode == LOAD_GLOBAL or opcode == LOAD_NAME) and _oparg(newcode, i) == sn_pos:
            need_preamble = need_load = True

        # If LOAD_FAST(super) then we just need to add the preamble
        elif opcode == LOAD_FAST and _oparg(newcode, i) == sv_pos:
            need_preamble = need_load = True

        # If LOAD_DEREF(super) then we change it into LOAD_FAST(super) because
        # it's slightly faster.
        elif opcode == LOAD_DEREF and _oparg(newcode, i) == sc_pos:
            need_preamble = need_load = True

        if need_load:
            newcode[i:i+3] = load_super

        i += 1

        if opcode >= HAVE_ARGUMENT:
            i += 2

    # No changes needed - get out.
    if not need_preamble:
        return func

    # Our preamble will have 3 things on the stack
    co_stacksize = max(3, co.co_stacksize)

    # Conceptually, our preamble is on the `def` line.
    co_lnotab = array('B', co.co_lnotab)

    if co_lnotab:
        co_lnotab[0] += preamble_len

    co_lnotab = co_lnotab.tostring()

    # Our code consists of the preamble and the modified code.
    codestr = (preamble + newcode).tostring()

    codeobj = new.code(co.co_argcount, len(newvarnames), co_stacksize,
                       co.co_flags, codestr, tuple(newconsts), co.co_names,
                       tuple(newvarnames), co.co_filename, co.co_name,
                       co.co_firstlineno, co_lnotab, co.co_freevars,
                       co.co_cellvars)

    func.func_code = codeobj
    func.func_class = cls
    return func

class autosuper_meta(type):
    def __init__(cls, name, bases, clsdict):
        UnboundMethodType = types.UnboundMethodType

        for v in vars(cls):
            o = getattr(cls, v)
            if isinstance(o, UnboundMethodType):
                _bind_autosuper(o.im_func, cls)

class autosuper(object):
    __metaclass__ = autosuper_meta

if __name__ == '__main__':
    class A(autosuper):
        def f(self):
            return 'A'

    class B(A):
        def f(self):
            return 'B' + super.f()

    class C(A):
        def f(self):
            def inner():
                return 'C' + super.f()

            # Needed to put 'super' into a cell
            super = super
            return inner()

    class D(B, C):
        def f(self, arg=None):
            var = None
            return 'D' + super.f()

    assert D().f() == 'DBCA'

Disassembly of B.f and C.f reveals the different preambles used when super is simply a local variable compared to when it is used by an inner function.

>>> dis.dis(B.f)

214           0 LOAD_CONST               4 (<type 'super'>)
              3 LOAD_CONST               2 (<class '__main__.B'>)
              6 LOAD_FAST                0 (self)
              9 CALL_FUNCTION            2
             12 STORE_FAST               1 (super)

215          15 LOAD_CONST               1 ('B')
             18 LOAD_FAST                1 (super)
             21 LOAD_ATTR                1 (f)
             24 CALL_FUNCTION            0
             27 BINARY_ADD
             28 RETURN_VALUE
>>> dis.dis(C.f)

218           0 LOAD_CONST               4 (<type 'super'>)
              3 LOAD_CONST               2 (<class '__main__.C'>)
              6 LOAD_FAST                0 (self)
              9 CALL_FUNCTION            2
             12 DUP_TOP
             13 STORE_FAST               1 (super)
             16 STORE_DEREF              0 (super)

219          19 LOAD_CLOSURE             0 (super)
             22 LOAD_CONST               1 (<code object inner at 00C160A0, file "autosuper.py", line 219>)
             25 MAKE_CLOSURE             0
             28 STORE_FAST               2 (inner)

223          31 LOAD_FAST                1 (super)
             34 STORE_DEREF              0 (super)

224          37 LOAD_FAST                2 (inner)
             40 CALL_FUNCTION            0
             43 RETURN_VALUE

Note that in the final implementation, the preamble would not be part of the bytecode of the method, but would occur immediately following unpacking of parameters.

Alternative Proposals

No Changes

Although its always attractive to just keep things how they are, people have sought a change in the usage of super calling for some time, and for good reason, all mentioned previously.

  • Decoupling from the class name (which might not even be bound to the right class anymore!)
  • Simpler looking, cleaner super calls would be better

Dynamic attribute on super type

The proposal adds a dynamic attribute lookup to the super type, which will automatically determine the proper class and instance parameters. Each super attribute lookup identifies these parameters and performs the super lookup on the instance, as the current super implementation does with the explicit invokation of a super instance upon a class and instance.

This proposal relies on sys._getframe(), which is not appropriate for anything except a prototype implementation.

super(__this_class__, self)

This is nearly an anti-proposal, as it basically relies on the acceptance of the __this_class__ PEP, which proposes a special name that would always be bound to the class within which it is used. If that is accepted, __this_class__ could simply be used instead of the class' name explicitly, solving the name binding issues [2].

self.__super__.foo(*args)

The __super__ attribute is mentioned in this PEP in several places, and could be a candidate for the complete solution, actually using it explicitly instead of any super usage directly. However, double-underscore names are usually an internal detail, and attempted to be kept out of everyday code.

super(self, *args) or __super__(self, *args)

This solution only solves the problem of the type indication, does not handle differently named super methods, and is explicit about the name of the instance. It is less flexable without being able to enacted on other method names, in cases where that is needed. One use case this fails is where a base- class has a factory classmethod and a subclass has two factory classmethods, both of which needing to properly make super calls to the one in the base- class.

super.foo(self, *args)

This variation actually eliminates the problems with locating the proper instance, and if any of the alternatives were pushed into the spotlight, I would want it to be this one.

super or super()

This proposal leaves no room for different names, signatures, or application to other classes, or instances. A way to allow some similar use alongside the normal proposal would be favorable, encouraging good design of multiple inheritance trees and compatible methods.

super(*p, **kw)

There has been the proposal that directly calling super(*p, **kw) would be equivalent to calling the method on the super object with the same name as the method currently being executed i.e. the following two methods would be equivalent:

def f(self, *p, **kw):
    super.f(*p, **kw)
def f(self, *p, **kw):
    super(*p, **kw)

There is strong sentiment for and against this, but implementation and style concerns are obvious. Guido has suggested that this should be excluded from this PEP on the principle of KISS (Keep It Simple Stupid).

History

29-Apr-2007 - Changed title from "Super As A Keyword" to "New Super"
  • Updated much of the language and added a terminology section for clarification in confusing places.
  • Added reference implementation and history sections.
06-May-2007 - Updated by Tim Delaney to reflect discussions on the python-3000
and python-dev mailing lists.

pep-0368 Standard image protocol and class

PEP:368
Title:Standard image protocol and class
Version:$Revision$
Last-Modified:$Date$
Author:Lino Mastrodomenico <l.mastrodomenico at gmail.com>
Status:Deferred
Type:Standards Track
Content-Type:text/x-rst
Created:28-Jun-2007
Python-Version:2.6, 3.0
Post-History:

Abstract

The current situation of image storage and manipulation in the Python world is extremely fragmented: almost every library that uses image objects has implemented its own image class, incompatible with everyone else's and often not very pythonic. A basic RGB image class exists in the standard library (Tkinter.PhotoImage), but is pretty much unusable, and unused, for anything except Tkinter programming.

This fragmentation not only takes up valuable space in the developers minds, but also makes the exchange of images between different libraries (needed in relatively common use cases) slower and more complex than it needs to be.

This PEP proposes to improve the situation by defining a simple and pythonic image protocol/interface that can be hopefully accepted and implemented by existing image classes inside and outside the standard library without breaking backward compatibility with their existing user bases. In practice this is a definition of how a minimal image-like object should look and act (in a similar way to the read() and write() methods in file-like objects).

The inclusion in the standard library of a class that provides basic image manipulation functionality and implements the new protocol is also proposed, together with a mixin class that helps adding support for the protocol to existing image classes.

PEP Deferral

Further exploration of the concepts covered in this PEP has been deferred for lack of a current champion interested in promoting the goals of the PEP and collecting and incorporating feedback, and with sufficient available time to do so effectively.

Rationale

A good way to have high quality modules ready for inclusion in the Python standard library is to simply wait for natural selection among competing external libraries to provide a clear winner with useful functionality and a big user base. Then the de-facto standard can be officially sanctioned by including it in the standard library.

Unfortunately this approach hasn't worked well for the creation of a dominant image class in the Python world: almost every third-party library that requires an image object creates its own class incompatible with the ones from other libraries. This is a real problem because it's entirely reasonable for a program to create and manipulate an image using, e.g., PIL (the Python Imaging Library) and then display it using wxPython or pygame. But these libraries have different and incompatible image classes, and the usual solution is to manually "export" an image from the source to a (width, height, bytes_string) tuple and "import" it creating a new instance in the target format. This approach works, but is both uglier and slower than it needs to be.

Another "solution" that has been sometimes used is the creation of specific adapters and/or converters from a class to another (e.g. PIL offers the ImageTk module for converting PIL images to a class compatible with the Tkinter one). But this approach doesn't scale well with the number of libraries involved and it's still annoying for the user: if I have a perfectly good image object why should I convert before passing it to the next method, why can't it simply accept my image as-is?

The problem isn't by any stretch limited to the three mentioned libraries and has probably multiple causes, including two that IMO are very important to understand before solving it:

  • in today's computing world an image is a basic type not strictly tied to a specific domain. This is why there will never be a clear winner between the image classes from the three libraries mentioned above (PIL, wxPython and pygame): they cover different domains and don't really compete with each other;
  • the Python standard library has never provided a good image class that can be adopted or imitated by third part modules. Tkinter.PhotoImage provides basic RGB functionality, but it's by far the slowest and ugliest of the bunch and it can be instantiated only after the Tkinter root window has been created.

This PEP tries to improve this situation in four ways:

  1. It defines a simple and pythonic image protocol/interface (both on the Python and the C side) that can be hopefully accepted and implemented by existing image classes inside and outside the standard library without breaking backward compatibility with their existing user bases.
  2. It proposes the inclusion in the standard library of three new classes:
    • ImageMixin provides almost everything necessary to implement the new protocol; its main purpose is to make as simple as possible to support this interface for existing libraries, in some cases as simple as adding it to the list of base classes and doing minor additions to the constructor.
    • Image is a subclass of ImageMixin and will add a constructor that can resize and/or convert an image between different pixel formats. This is intended to provide a fast and efficient default implementation of the new protocol.
    • ImageSize is a minor helper class. See below for details.
  3. Tkinter.PhotoImage will implement the new protocol (mostly through the ImageMixin class) and all the Tkinter methods that can receive an image will be modified the accept any object that implements the interface. As an aside the author of this PEP will collaborate with the developers of the most common external libraries to achieve the same goal (supporting the protocol in their classes and accepting any class that implements it).
  4. New PyImage_* functions will be added to the CPython C API: they implement the C side of the protocol and accept as first parameter any object that supports it, even if it isn't an instance of the Image/ImageMixin classes.

The main effects for the end user will be a simplification of the interchange of images between different libraries (if everything goes well, any Python library will accept images from any other library) and the out-of-the-box availability of the new Image class. The new class is intended to cover simple but common use cases like cropping and/or resizing a photograph to the desired size and passing it an appropriate widget for displaying it on a window, or darkening a texture and passing it to a 3D library.

The Image class is not intended to replace or compete with PIL, Pythonmagick or NumPy, even if it provides a (very small) subset of the functionality of these three libraries. In particular PIL offers very rich image manipulation features with dozens of classes, filters, transformations and file formats. The inclusion of PIL (or something similar) in the standard library may, or may not, be a worthy goal but it's completely outside the scope of this PEP.

Specification

The imageop module is used as the default location for the new classes and objects because it has for a long time hosted functions that provided a somewhat similar functionality, but a new module may be created if preferred (e.g. a new "image" or "media" module; the latter may eventually include other multimedia classes).

MODES is a new module level constant: it is a set of the pixel formats supported by the Image class. Any image object that implements the new protocol is guaranteed to be formatted in one of these modes, but libraries that accept images are allowed to support only a subset of them.

These modes are in turn also available as module level constants (e.g. imageop.RGB).

The following table is a summary of the modes currently supported and their properties:

Name Component names Bits per component Subsampling Valid intervals
L l (lowercase L) 8 no full range
L16 l 16 no full range
L32 l 32 no full range
LA l, a 8 no full range
LA32 l, a 16 no full range
RGB r, g, b 8 no full range
RGB48 r, g, b 16 no full range
RGBA r, g, b, a 8 no full range
RGBA64 r, g, b, a 16 no full range
YV12 y, cr, cb 8 1, 2, 2 16-235, 16-240, 16-240
JPEG_YV12 y, cr, cb 8 1, 2, 2 full range
CMYK c, m, y, k 8 no full range
CMYK64 c, m, y, k 16 no full range

When the name of a mode ends with a number, it represents the average number of bits per pixel. All the other modes simply use a byte per component per pixel.

No palette modes or modes with less than 8 bits per component are supported. Welcome to the 21st century.

Here's a quick description of the modes and the rationale for their inclusion; there are four groups of modes:

  1. grayscale (L* modes): they are heavily used in scientific computing (those people may also need a very high dynamic range and precision, hence L32, the only mode with 32 bits per component) and sometimes it can be useful to consider a single component of a color image as a grayscale image (this is used by the individual planes of the planar images, see YV12 below); the name of the component ('l', lowercase letter L) stands for luminance, the second optional component ('a') is the alpha value and represents the opacity of the pixels: alpha = 0 means full transparency, alpha = 255/65535 represents a fully opaque pixel;

  2. RGB* modes: the garden variety color images. The optional alpha component has the same meaning as in grayscale modes;

  3. YCbCr, a.k.a. YUV (*YV12 modes). These modes are planar (i.e. the values of all the pixel for each component are stored in a consecutive memory area, instead of the usual arrangement where all the components of a pixel reside in consecutive bytes) and use a 1, 2, 2 (a.k.a. 4:2:0) subsampling (i.e. each pixel has its own Y value, but the Cb and Cr components are shared between groups of 2x2 adjacent pixels) because this is the format that's by far the most common for YCbCr images. Please note that the V (Cr) plane is stored before the U (Cb) plane.

    YV12 is commonly used for MPEG2 (including DVDs), MPEG4 (both ASP/DivX and AVC/H.264) and Theora video frames. Valid values for Y are in range(16, 236) (excluding 236), and valid values for Cb and Cr are in range(16, 241). JPEG_YV12 is similar to YV12, but the three components can have the full range of 256 values. It's the native format used by almost all JPEG/JFIF files and by MJPEG video frames. The "strangeness" of these two wrt all the other supported modes derives from the fact that they are widely used that way by a lot of existing libraries and applications; this is also the reason why they are included (and the fact that they can't losslessly converted to RGB because YCbCr is a bigger color space); the funny 4:2:0 planar arrangement of the pixel values is relatively easy to support because in most cases the three planes can be considered three separate grayscale images;

  4. CMYK* modes (cyan, magenta, yellow and black) are subtractive color modes, used for printing color images on dead trees. Professional designers love to pretend that they can't live without them, so here they are.

Python API

See the examples below.

In Python 2.x, all the new classes defined here are new-style classes.

Mode Objects

The mode objects offer a number of attributes and methods that can be used for implementing generic algorithms that work on different types of images:

components

The number of components per pixel (e.g. 4 for an RGBA image).

component_names

A tuple of strings; see the column "Component names" in the above table.

bits_per_component

8, 16 or 32; see "Bits per component" in the above table.

bytes_per_pixel

components * bits_per_component // 8, only available for non planar modes (see below).

planar

Boolean; True if the image components reside each in a separate plane. Currently this happens if and only if the mode uses subsampling.

subsampling

A tuple that for each component in the mode contains a tuple of two integers that represent the amount of downsampling in the horizontal and vertical direction, respectively. In practice it's ((1, 1), (2, 2), (2, 2)) for YV12 and JPEG_YV12 and ((1, 1),) * components for everything else.

x_divisor

max(x for x, y in subsampling); the width of an image that uses this mode must be divisible for this value.

y_divisor

max(y for x, y in subsampling); the height of an image that uses this mode must be divisible for this value.

intervals

A tuple that for each component in the mode contains a tuple of two integers: the minimum and maximum valid value for the component. Its value is ((16, 235), (16, 240), (16, 240)) for YV12 and ((0, 2 ** bits_per_component - 1),) * components for everything else.

get_length(iterable[integer]) -> int

The parameter must be an iterable that contains two integers: the width and height of an image; it returns the number of bytes needed to store an image of these dimensions with this mode.

Implementation detail: the modes are instances of a subclass of str and have a value equal to their name (e.g. imageop.RGB == 'RGB') except for L32 that has value 'I'. This is only intended for backward compatibility with existing PIL users; new code that uses the image protocol proposed here should not rely on this detail.

Image Protocol

Any object that supports the image protocol must provide the following methods and attributes:

mode

The format and the arrangement of the pixels in this image; it's one of the constants in the MODES set.

size

An instance of the ImageSize class; it's a named tuple of two integers: the width and the height of the image in pixels; both of them must be >= 1 and can also be accessed as the width and height attributes of size.

buffer

A sequence of integers between 0 and 255; they are the actual bytes used for storing the image data (i.e. modifying their values affects the image pixels and vice versa); the data has a row-major/C-contiguous order without padding and without any special memory alignment, even when there are more than 8 bits per component. The only supported methods are __len__, __getitem__/__setitem__ (with both integers and slice indexes) and __iter__; on the C side it implements the buffer protocol.

This is a pretty low level interface to the image and the user is responsible for using the correct (native) byte order for modes with more than 8 bit per component and the correct value ranges for YV12 images. A buffer may or may not keep a reference to its image, but it's still safe (if useless) to use the buffer even after the corresponding image has been destroyed by the garbage collector (this will require changes to the image class of wxPython and possibly other libraries). Implementation detail: this can be an array('B'), a bytes() object or a specialized fixed-length type.

info

A dict object that can contain arbitrary metadata associated with the image (e.g. DPI, gamma, ICC profile, exposure time...); the interpretation of this data is beyond the scope of this PEP and probably depends on the library used to create and/or to save the image; if a method of the image returns a new image, it can copy or adapt metadata from its own info attribute (the ImageMixin implementation always creates a new image with an empty info dictionary).
bits_per_component
bytes_per_pixel
component_names
components
intervals
planar
subsampling
Shortcuts for the corresponding mode.* attributes.

map(function[, function...]) -> None

For every pixel in the image, maps each component through the corresponding function. If only one function is passed, it is used repeatedly for each component. This method modifies the image in place and is usually very fast (most of the time the functions are called only a small number of times, possibly only once for simple functions without branches), but it imposes a number of restrictions on the function(s) passed:

  • it must accept a single integer argument and return a number (map will round the result to the nearest integer and clip it to range(0, 2 ** bits_per_component), if necessary);
  • it must not try to intercept any BaseException, Exception or any unknown subclass of Exception raised by any operation on the argument (implementations may try to optimize the speed by passing funny objects, so even a simple "if n == 10:" may raise an exception: simply ignore it, map will take care of it); catching any other exception is fine;
  • it should be side-effect free and its result should not depend on values (other than the argument) that may change during a single invocation of map.
rotate90() -> image
rotate180() -> image
rotate270() -> image
Return a copy of the image rotated 90, 180 or 270 degrees counterclockwise around its center.

clip() -> None

Saturates invalid component values in YV12 images to the minimum or the maximum allowed (see mode.intervals), for other image modes this method does nothing, very fast; libraries that save/export YV12 images are encouraged to always call this method, since intermediate operations (e.g. the map method) may assign to pixels values outside the valid intervals.

split() -> tuple[image]

Returns a tuple of L, L16 or L32 images corresponding to the individual components in the image.

Planar images also supports attributes with the same names defined in component_names: they contain grayscale (mode L) images that offer a view on the pixel values for the corresponding component; any change to the subimages is immediately reflected on the parent image and vice versa (their buffers refer to the same memory location).

Non-planar images offer the following additional methods:

pixels() -> iterator[pixel]

Returns an iterator that iterates over all the pixels in the image, starting from the top line and scanning each line from left to right. See below for a description of the pixel objects.

__iter__() -> iterator[line]

Returns an iterator that iterates over all the lines in the image, from top to bottom. See below for a description of the line objects.

__len__() -> int

Returns the number of lines in the image (size.height).

__getitem__(integer) -> line

Returns the line at the specified (y) position.

__getitem__(tuple[integer]) -> pixel

The parameter must be a tuple of two integers; they are interpreted respectively as x and y coordinates in the image (0, 0 is the top left corner) and a pixel object is returned.

__getitem__(slice | tuple[integer | slice]) -> image

The parameter must be a slice or a tuple that contains two slices or an integer and a slice; the selected area of the image is copied and a new image is returned; image[x:y:z] is equivalent to image[:, x:y:z].

__setitem__(tuple[integer], integer | iterable[integer]) -> None

Modifies the pixel at specified position; image[x, y] = integer is a shortcut for image[x, y] = (integer,) for images with a single component.

__setitem__(slice | tuple[integer | slice], image) -> None

Selects an area in the same way as the corresponding form of the __getitem__ method and assigns to it a copy of the pixels from the image in the second argument, that must have exactly the same mode as this image and the same size as the specified area; the alpha component, if present, is simply copied and doesn't affect the other components of the image (i.e. no alpha compositing is performed).

The mode, size and buffer (including the address in memory of the buffer) never change after an image is created.

It is expected that, if PEP 3118 is accepted, all the image objects will support the new buffer protocol, however this is beyond the scope of this PEP.

Image and ImageMixin Classes

The ImageMixin class implements all the methods and attributes described above except mode, size, buffer and info. Image is a subclass of ImageMixin that adds support for these four attributes and offers the following constructor (please note that the constructor is not part of the image protocol):

__init__(mode, size, color, source)

mode must be one of the constants in the MODES set, size is a sequence of two integers (width and height of the new image); color is a sequence of integers, one for each component of the image, used to initialize all the pixels to the same value; source can be a sequence of integers of the appropriate size and format that is copied as-is in the buffer of the new image or an existing image; in Python 2.x source can also be an instance of str and is interpreted as a sequence of bytes. color and source are mutually exclusive and if they are both omitted the image is initialized to transparent black (all the bytes in the buffer have value 16 in the YV12 mode, 255 in the CMYK* modes and 0 for everything else). If source is present and is an image, mode and/or size can be omitted; if they are specified and are different from the source mode and/or size, the source image is converted.

The exact algorithms used for resizing and doing color space conversions may differ between Python versions and implementations, but they always give high quality results (e.g.: a cubic spline interpolation can be used for upsampling and an antialias filter can be used for downsampling images); any combination of mode conversion is supported, but the algorithm used for conversions to and from the CMYK* modes is pretty naĂŻve: if you have the exact color profiles of your devices you may want to use a good color management tool such as LittleCMS. The new image has an empty info dict.

Line Objects

The line objects (returned, e.g., when iterating over an image) support the following attributes and methods:

mode

The mode of the image from where this line comes.

__iter__() -> iterator[pixel]

Returns an iterator that iterates over all the pixels in the line, from left to right. See below for a description of the pixel objects.

__len__() -> int

Returns the number of pixels in the line (the image width).

__getitem__(integer) -> pixel

Returns the pixel at the specified (x) position.

__getitem__(slice) -> image

The selected part of the line is copied and a new image is returned; the new image will always have height 1.

__setitem__(integer, integer | iterable[integer]) -> None

Modifies the pixel at the specified position; line[x] = integer is a shortcut for line[x] = (integer,) for images with a single component.

__setitem__(slice, image) -> None

Selects a part of the line and assigns to it a copy of the pixels from the image in the second argument, that must have height 1, a width equal to the specified slice and the same mode as this line; the alpha component, if present, is simply copied and doesn't affect the other components of the image (i.e. no alpha compositing is performed).

Pixel Objects

The pixel objects (returned, e.g., when iterating over a line) support the following attributes and methods:

mode

The mode of the image from where this pixel comes.

value

A tuple of integers, one for each component. Any iterable of the correct length can be assigned to value (it will be automagically converted to a tuple), but you can't assign to it an integer, even if the mode has only a single component: use, e.g., pixel.l = 123 instead.

r, g, b, a, l, c, m, y, k

The integer values of each component; only those applicable for the current mode (in mode.component_names) will be available.
__iter__() -> iterator[int]
__len__() -> int
__getitem__(integer | slice) -> int | tuple[int]
__setitem__(integer | slice, integer | iterable[integer]) -> None
These four methods emulate a fixed length list of integers, one for each pixel component.

ImageSize Class

ImageSize is a named tuple, a class identical to tuple except that:

  • its constructor only accepts two integers, width and height; they are converted in the constructor using their __index__() methods, so all the ImageSize objects are guaranteed to contain only int (or possibly long, in Python 2.x) instances;
  • it has a width and a height property that are equivalent to the first and the second number in the tuple, respectively;
  • the string returned by its __repr__ method is 'imageop.ImageSize(width=%d, height=%d)' % (width, height).

ImageSize is not usually instantiated by end-users, but can be used when creating a new class that implements the image protocol, since the size attribute must be an ImageSize instance.

C API

The available image modes are visible at the C level as PyImage_* constants of type PyObject * (e.g.: PyImage_RGB is imageop.RGB).

The following functions offer a C-friendly interface to mode and image objects (all the functions return NULL or -1 on failure):

int PyImageMode_Check(PyObject *obj)

Returns true if the object obj is a valid image mode.
int PyImageMode_GetComponents(PyObject *mode)
PyObject* PyImageMode_GetComponentNames(PyObject *mode)
int PyImageMode_GetBitsPerComponent(PyObject *mode)
int PyImageMode_GetBytesPerPixel(PyObject *mode)
int PyImageMode_GetPlanar(PyObject *mode)
PyObject* PyImageMode_GetSubsampling(PyObject *mode)
int PyImageMode_GetXDivisor(PyObject *mode)
int PyImageMode_GetYDivisor(PyObject *mode)
Py_ssize_t PyImageMode_GetLength(PyObject *mode, Py_ssize_t width, Py_ssize_t height)
These functions are equivalent to their corresponding Python attributes or methods.

int PyImage_Check(PyObject *obj)

Returns true if the object obj is an Image object or an instance of a subtype of the Image type; see also PyObject_CheckImage below.

int PyImage_CheckExact(PyObject *obj)

Returns true if the object obj is an Image object, but not an instance of a subtype of the Image type.
PyObject* PyImage_New(PyObject *mode, Py_ssize_t width, Py_ssize_t height)
Returns a new Image instance, initialized to transparent black (see Image.__init__ above for the details).
PyObject* PyImage_FromImage(PyObject *image, PyObject *mode, Py_ssize_t width, Py_ssize_t height)
Returns a new Image instance, initialized with the contents of the image object rescaled and converted to the specified mode, if necessary.
PyObject* PyImage_FromBuffer(PyObject *buffer, PyObject *mode, Py_ssize_t width, Py_ssize_t height)
Returns a new Image instance, initialized with the contents of the buffer object.

int PyObject_CheckImage(PyObject *obj)

Returns true if the object obj implements a sufficient subset of the image protocol to be accepted by the functions defined below, even if its class is not a subclass of ImageMixin and/or Image. Currently it simply checks for the existence and correctness of the attributes mode, size and buffer.
PyObject* PyImage_GetMode(PyObject *image)
Py_ssize_t PyImage_GetWidth(PyObject *image)
Py_ssize_t PyImage_GetHeight(PyObject *image)
int PyImage_Clip(PyObject *image)
PyObject* PyImage_Split(PyObject *image)
PyObject* PyImage_GetBuffer(PyObject *image)
int PyImage_AsBuffer(PyObject *image, const void **buffer, Py_ssize_t *buffer_len)
These functions are equivalent to their corresponding Python attributes or methods; the image memory can be accessed only with the GIL and a reference to the image or its buffer held, and extra care should be taken for modes with more than 8 bits per component: the data is stored in native byte order and it can be not aligned on 2 or 4 byte boundaries.

Examples

A few examples of common operations with the new Image class and protocol:

# create a new black RGB image of 6x9 pixels
rgb_image = imageop.Image(imageop.RGB, (6, 9))

# same as above, but initialize the image to bright red
rgb_image = imageop.Image(imageop.RGB, (6, 9), color=(255, 0, 0))

# convert the image to YCbCr
yuv_image = imageop.Image(imageop.JPEG_YV12, source=rgb_image)

# read the value of a pixel and split it into three ints
r, g, b = rgb_image[x, y]

# modify the magenta component of a pixel in a CMYK image
cmyk_image[x, y].m = 13

# modify the Y (luma) component of a pixel in a *YV12 image and
# its corresponding subsampled Cr (red chroma)
yuv_image.y[x, y] = 42
yuv_image.cr[x // 2, y // 2] = 54

# iterate over an image
for line in rgb_image:
    for pixel in line:
        # swap red and blue, and set green to 0
        pixel.value = pixel.b, 0, pixel.r

# find the maximum value of the red component in the image
max_red = max(pixel.r for pixel in rgb_image.pixels())

# count the number of colors in the image
num_of_colors = len(set(tuple(pixel) for pixel in image.pixels()))

# copy a block of 4x2 pixels near the upper right corner of an
# image and paste it into the lower left corner of the same image
image[:4, -2:] = image[-6:-2, 1:3]

# create a copy of the image, except that the new image can have a
# different (usually empty) info dict
new_image = image[:]

# create a mirrored copy of the image, with the left and right
# sides flipped
flipped_image = image[::-1, :]

# downsample an image to half its original size using a fast, low
# quality operation and a slower, high quality one:
low_quality_image = image[::2, ::2]
new_size = image.size.width // 2, image.size.height // 2
high_quality_image = imageop.Image(size=new_size, source=image)

# direct buffer access
rgb_image[0, 0] = r, g, b
assert tuple(rgb_image.buffer[:3]) == (r, g, b)

Backwards Compatibility

There are three areas touched by this PEP where backwards compatibility should be considered:

  • Python 2.6: new classes and objects are added to the imageop module without touching the existing module contents; new methods and attributes will be added to Tkinter.PhotoImage and its __getitem__ and __setitem__ methods will be modified to accept integers, tuples and slices (currently they only accept strings). All the changes provide a superset of the existing functionality, so no major compatibility issues are expected.
  • Python 3.0: the legacy contents of the imageop module will be deleted, according to PEP 3108; everything defined in this proposal will work like in Python 2.x with the exception of the usual 2.x/3.0 differences (e.g. support for long integers and for interpreting str instances as sequences of bytes will be dropped).
  • external libraries: the names and the semantics of the standard image methods and attributes are carefully chosen to allow some external libraries that manipulate images (including at least PIL, wxPython and pygame) to implement the new protocol in their image classes without breaking compatibility with existing code. The only blatant conflicts between the image protocol and NumPy arrays are the value of the size attribute and the coordinates order in the image[x, y] expression.

Reference Implementation

If this PEP is accepted, the author will provide a reference implementation of the new classes in pure Python (that can run in CPython, PyPy, Jython and IronPython) and a second one optimized for speed in Python and C, suitable for inclusion in the CPython standard library. The author will also submit the required Tkinter patches. For all the code will be available a version for Python 2.x and a version for Python 3.0 (it is expected that the two version will be very similar and the Python 3.0 one will probably be generated almost completely automatically).

Acknowledgments

The implementation of this PEP, if accepted, is sponsored by Google through the Google Summer of Code program.

pep-0369 Post import hooks

PEP:369
Title:Post import hooks
Version:$Revision$
Last-Modified:$Date$
Author:Christian Heimes <christian at python.org>
Status:Withdrawn
Type:Standards Track
Content-Type:text/x-rst
Created:02-Jan-2008
Python-Version:2.6, 3.0
Post-History:02-Dec-2012

Withdrawal Notice

This PEP has been withdrawn by its author, as much of the detailed design is no longer valid following the migration to importlib in Python 3.3.

Abstract

This PEP proposes enhancements for the import machinery to add post import hooks. It is intended primarily to support the wider use of abstract base classes that is expected in Python 3.0.

The PEP originally started as a combined PEP for lazy imports and post import hooks. After some discussion on the python-dev mailing list the PEP was parted in two separate PEPs. [1]

Rationale

Python has no API to hook into the import machinery and execute code after a module is successfully loaded. The import hooks of PEP 302 are about finding modules and loading modules but they were not designed to as post import hooks.

Use cases

A use case for a post import hook is mentioned in Nick Coghlan's initial posting [2]. about callbacks on module import. It was found during the development of Python 3.0 and its ABCs. We wanted to register classes like decimal.Decimal with an ABC but the module should not be imported on every interpreter startup. Nick came up with this example:

@imp.when_imported('decimal')
def register(decimal):
    Inexact.register(decimal.Decimal)

The function register is registered as callback for the module named 'decimal'. When decimal is imported the function is called with the module object as argument.

While this particular example isn't necessary in practice, (as decimal.Decimal will inherit from the appropriate abstract Number base class in 2.6 and 3.0), it still illustrates the principle.

Existing implementations

PJE's peak.util.imports [3] implements post load hooks. My implementation shares a lot with his and it's partly based on his ideas.

Post import hook implementation

Post import hooks are called after a module has been loaded. The hooks are callable which take one argument, the module instance. They are registered by the dotted name of the module, e.g. 'os' or 'os.path'.

The callable are stored in the dict sys.post_import_hooks which is a mapping from names (as string) to a list of callables or None.

States

No hook was registered

sys.post_import_hooks contains no entry for the module

A hook is registered and the module is not loaded yet

The import hook registry contains an entry sys.post_import_hooks["name"] = [hook1]

A module is successfully loaded

The import machinery checks if sys.post_import_hooks contains post import hooks for the newly loaded module. If hooks are found then the hooks are called in the order they were registered with the module instance as first argument. The processing of the hooks is stopped when a method raises an exception. At the end the entry for the module name set to None, even when an error has occured.

Additionally the new __notified__ slot of the module object is set to True in order to prevent infinity recursions when the notification method is called inside a hook. For object which don't subclass from PyModule a new attribute is added instead.

A module can't be loaded

The import hooks are neither called nor removed from the registry. It may be possible to load the module later.

Invariants

The import hook system guarentees certain invariants. XXX

Sample Python implementation

A Python implemenation may look like:

  def notify(name):
      try:
          module = sys.modules[name]
      except KeyError:
          raise ImportError("Module %s has not been imported" % (name,))
      if module.__notified__:
          return
      try:
          module.__notified__ = True
          if '.' in name:
              notify(name[:name.rfind('.')])
          for callback in post_import_hooks[name]:
             callback(module)
      finally:
          post_import_hooks[name] = None

XXX

C API

New C API functions

PyObject* PyImport_GetPostImportHooks(void)
Returns the dict sys.post_import_hooks or NULL
PyObject* PyImport_NotifyLoadedByModule(PyObject *module)
Notify the post import system that a module was requested. Returns the a borrowed reference to the same module object or NULL if an error has occured. The function calls only the hooks for the module itself an not its parents. The function must be called with the import lock acquired.
PyObject* PyImport_NotifyLoadedByName(const char *name)
PyImport_NotifyLoadedByName("a.b.c") calls PyImport_NotifyLoadedByModule() for a, a.b and a.b.c in that particular order. The modules are retrieved from sys.modules. If a module can't be retrieved, an exception is raised otherwise the a borrowed reference to modname is returned. The hook calls always start with the prime parent module. The caller of PyImport_NotifyLoadedByName() must hold the import lock!
PyObject* PyImport_RegisterPostImportHook(PyObject *callable, PyObject *mod_name)
Register a new hook callable for the module mod_name
int PyModule_GetNotified(PyObject *module)
Returns the status of the __notified__ slot / attribute.
int PyModule_SetNotified(PyObject *module, int status)
Set the status of the __notified__ slot / attribute.

The PyImport_NotifyLoadedByModule() method is called inside import_submodule(). The import system makes sure that the import lock is acquired and the hooks for the parent modules are already called.

Python API

The import hook registry and two new API methods are exposed through the sys and imp module.

sys.post_import_hooks

The dict contains the post import hooks:

{"name" : [hook1, hook2], ...}
imp.register_post_import_hook(hook: "callable", name: str)
Register a new hook hook for the module name
imp.notify_module_loaded(module: "module instance") -> module
Notify the system that a module has been loaded. The method is provided for compatibility with existing lazy / deferred import extensions.
module.__notified__
A slot of a module instance. XXX

The when_imported function decorator is also in the imp module, which is equivalent to:

def when_imported(name):
    def register(hook):
        register_post_import_hook(hook, name)
    return register
imp.when_imported(name) -> decorator function
for @when_imported(name) def hook(module): pass

Open issues

The when_imported decorator hasn't been written.

The code contains several XXX comments. They are mostly about error handling in edge cases.

Backwards Compatibility

The new features and API don't conflict with old import system of Python and don't cause any backward compatibility issues for most software. However systems like PEAK and Zope which implement their own lazy import magic need to follow some rules.

The post import hooks carefully designed to cooperate with existing deferred and lazy import systems. It's the suggestion of the PEP author to replace own on-load-hooks with the new hook API. The alternative lazy or deferred imports will still work but the implementations must call the imp.notify_module_loaded function.

Reference Implementation

A reference implementation is already written and is available in the py3k-importhook branch. [4] It still requires some cleanups, documentation updates and additional unit tests.

Acknowledgments

Nick Coghlan, for proof reading and the initial discussion Phillip J. Eby, for his implementation in PEAK and help with my own implementation

pep-0370 Per user site-packages directory

PEP:370
Title:Per user site-packages directory
Version:$Revision$
Last-Modified:$Date$
Author:Christian Heimes <christian at python.org>
Status:Final
Type:Standards Track
Content-Type:text/x-rst
Created:11-Jan-2008
Python-Version:2.6, 3.0
Post-History:

Abstract

This PEP proposes a new a per user site-packages directory to allow users the local installation of Python packages in their home directory.

Rationale

Current Python versions don't have a unified way to install packages into the home directory of a user (except for Mac Framework builds). Users are either forced to ask the system administrator to install or update a package for them or to use one of the many workarounds like Virtual Python [1], Working Env [2] or Virtual Env [3].

It's not the goal of the PEP to replace the tools or to implement isolated installations of Python. It only implements the most common use case of an additional site-packages directory for each user.

The feature can't be implemented using the environment variable PYTHONPATH. The env var just inserts a new directory to the beginning of sys.path but it doesn't parse the pth files in the directory. A full blown site-packages path is required for several applications and Python eggs.

Specification

site directory (site-packages)

A directory in sys.path. In contrast to ordinary directories the pth files in the directory are processed, too.

user site directory

A site directory inside the users' home directory. A user site directory is specific to a Python version. The path contains the version number (major and minor only).

Unix (including Mac OS X)
~/.local/lib/python2.6/site-packages
Windows
%APPDATA%/Python/Python26/site-packages

user data directory

Usually the parent directory of the user site directory. It's meant for Python version specific data like config files, docs, images and translations.

Unix (including Mac)
~/.local/lib/python2.6
Windows
%APPDATA%/Python/Python26

user base directory

It's located inside the user's home directory. The user site and use config directory are inside the base directory. On some systems the directory may be shared with 3rd party apps.

Unix (including Mac)
~/.local
Windows
%APPDATA%/Python

user script directory

A directory for binaries and scripts. [10] It's shared across Python versions and the destination directory for scripts.

Unix (including Mac)
~/.local/bin
Windows
%APPDATA%/Python/Scripts

Windows Notes

On Windows the Application Data directory (aka APPDATA) was chosen because it is the most designated place for application data. Microsoft recommands that software doesn't write to USERPROFILE [5] and My Documents is not suited for application data, either. [8] The code doesn't query the Win32 API, instead it uses the environment variable %APPDATA%.

The application data directory is part of the roaming profile. In networks with domain logins the application data may be copied from and to the a central server. This can slow down log-in and log-off. Users can keep the data on the server by e.g. setting PYTHONUSERBASE to the value "%HOMEDRIVE%%HOMEPATH%Applicata Data". Users should consult their local adminstrator for more information. [13]

Unix Notes

On Unix ~/.local was chosen in favor over ~/.python because the directory is already used by several other programs in analogy to /usr/local. [7] [11]

Mac OS X Notes

On Mac OS X Python uses ~/.local directory as well. [12] Framework builds of Python include ~/Library/Python/2.6/site-packages as an additional search path.

Implementation

The site module gets a new method adduserpackage() which adds the appropriate directory to the search path. The directory is not added if it doesn't exist when Python is started. However the location of the user site directory and user base directory is stored in an internal variable for distutils.

The user site directory is added before the system site directories but after Python's search paths and PYTHONPATH. This setup allows the user to install a different version of a package than the system administrator but it prevents the user from accidently overwriting a stdlib module. Stdlib modules can still be overwritten with PYTHONPATH.

For security reasons the user site directory is not added to sys.path when the effective user id or group id is not equal to the process uid / gid [9]. It's an additional barrier against code injection into suid apps. However Python suid scripts must always use the -E and -s option or users can sneak in their own code.

The user site directory can be suppressed with a new option -s or the environment variable PYTHONNOUSERSITE. The feature can be disabled globally by setting site.ENABLE_USER_SITE to the value False. It must be set by editing site.py. It can't be altered in sitecustomize.py or later.

The path to the user base directory can be overwritten with the environment variable PYTHONUSERBASE. The default location is used when PYTHONUSERBASE is not set or empty.

distutils.command.install (setup.py install) gets a new argument --user to install packages in the user site directory. The required directories are created on demand.

distutils.command.build_ext (setup.py build_ext) gets a new argument --user which adds the include/ and lib/ directories in the user base dirctory to the search paths for header files and libraries. It also adds the lib/ directory to rpath.

The site module gets two arguments --user-base and --user-site to print the path to the user base or user site directory to the standard output. The feature is intended for scripting, e.g. ./configure --prefix $(python2.5 -m site --user-base)

distutils.sysconfig will get methods to access the private variables of site. (not yet implemented)

The Windows updater needs to be updated, too. It should create an menu item which opens the user site directory in a new explorer windows.

Reference Implementation

A reference implementation is available in the bug tracker. [4]

pep-0371 Addition of the multiprocessing package to the standard library

PEP: 371
Title: Addition of the multiprocessing package to the standard library
Version: $Revision$
Last-Modified: $Date$
Author: Jesse Noller <jnoller at gmail.com>, Richard Oudkerk <r.m.oudkerk at googlemail.com>
Status: Final
Type: Standards Track
Content-Type: text/plain
Created: 06-May-2008
Python-Version: 2.6 / 3.0
Post-History: 

Abstract

    This PEP proposes the inclusion of the pyProcessing [1] package
    into the Python standard library, renamed to "multiprocessing".

    The processing package mimics the standard library threading
    module functionality to provide a process-based approach to 
    threaded programming allowing end-users to dispatch multiple 
    tasks that effectively side-step the global interpreter lock.

    The package also provides server and client functionality
    (processing.Manager) to provide remote sharing and management of
    objects and tasks so that applications may not only leverage
    multiple cores on the local machine, but also distribute objects
    and tasks across a cluster of networked machines.

    While the distributed capabilities of the package are beneficial,
    the primary focus of this PEP is the core threading-like API and
    capabilities of the package.

Rationale

    The current CPython interpreter implements the Global Interpreter
    Lock (GIL) and barring work in Python 3000 or other versions
    currently planned [2], the GIL will remain as-is within the
    CPython interpreter for the foreseeable future.  While the GIL
    itself enables clean and easy to maintain C code for the
    interpreter and extensions base, it is frequently an issue for
    those Python programmers who are leveraging multi-core machines.

    The GIL itself prevents more than a single thread from running
    within the interpreter at any given point in time, effectively
    removing Python's ability to take advantage of multi-processor
    systems.

    The pyprocessing package offers a method to side-step the GIL
    allowing applications within CPython to take advantage of
    multi-core architectures without asking users to completely change
    their programming paradigm (i.e.: dropping threaded programming
    for another "concurrent" approach - Twisted, Actors, etc).

    The Processing package offers CPython a "known API" which mirrors
    albeit in a PEP 8 compliant manner, that of the threading API, 
    with known semantics and easy scalability.

    In the future, the package might not be as relevant should the
    CPython interpreter enable "true" threading, however for some
    applications, forking an OS process may sometimes be more
    desirable than using lightweight threads, especially on those
    platforms where process creation is fast and optimized.

    For example, a simple threaded application:

        from threading import Thread as worker

        def afunc(number):
            print number * 3

        t = worker(target=afunc, args=(4,))
        t.start()
        t.join()

    The pyprocessing package mirrored the API so well, that with a
    simple change of the import to:

        from processing import process as worker

    The code would now execute through the processing.process class.
    Obviously, with the renaming of the API to PEP 8 compliance there
    would be additional renaming which would need to occur within
    user applications, however minor.

    This type of compatibility means that, with a minor (in most cases)
    change in code, users' applications will be able to leverage all
    cores and processors on a given machine for parallel execution.
    In many cases the pyprocessing package is even faster than the
    normal threading approach for I/O bound programs.  This of course,
    takes into account that the pyprocessing package is in optimized C
    code, while the threading module is not.

The "Distributed" Problem

    In the discussion on Python-Dev about the inclusion of this
    package [3] there was confusion about the intentions this PEP with
    an attempt to solve the "Distributed" problem - frequently
    comparing the functionality of this package with other solutions
    like MPI-based communication [4], CORBA, or other distributed
    object approaches [5].

    The "distributed" problem is large and varied.  Each programmer
    working within this domain has either very strong opinions about
    their favorite module/method or a highly customized problem for
    which no existing solution works.

    The acceptance of this package does not preclude or recommend that
    programmers working on the "distributed" problem not examine other
    solutions for their problem domain.  The intent of including this
    package is to provide entry-level capabilities for local
    concurrency and the basic support to spread that concurrency
    across a network of machines - although the two are not tightly
    coupled, the pyprocessing package could in fact, be used in
    conjunction with any of the other solutions including MPI/etc.

    If necessary - it is possible to completely decouple the local
    concurrency abilities of the package from the
    network-capable/shared aspects of the package.  Without serious
    concerns or cause however, the author of this PEP does not
    recommend that approach.

Performance Comparison

    As we all know - there are "lies, damned lies, and benchmarks".
    These speed comparisons, while aimed at showcasing the performance
    of the pyprocessing package, are by no means comprehensive or
    applicable to all possible use cases or environments.  Especially
    for those platforms with sluggish process forking timing.

    All benchmarks were run using the following:
        * 4 Core Intel Xeon CPU @ 3.00GHz
        * 16 GB of RAM
        * Python 2.5.2 compiled on Gentoo Linux (kernel 2.6.18.6)
        * pyProcessing 0.52

    All of the code for this can be downloaded from:
        http://jessenoller.com/code/bench-src.tgz

    The basic method of execution for these benchmarks is in the
    run_benchmarks.py script, which is simply a wrapper to execute a
    target function through a single threaded (linear), multi-threaded
    (via threading), and multi-process (via pyprocessing) function for
    a static number of iterations with increasing numbers of execution
    loops and/or threads.

    The run_benchmarks.py script executes each function 100 times,
    picking the best run of that 100 iterations via the timeit module.

    First, to identify the overhead of the spawning of the workers, we
    execute an function which is simply a pass statement (empty):

        cmd: python run_benchmarks.py empty_func.py
        Importing empty_func
        Starting tests ...
        non_threaded (1 iters)  0.000001 seconds
        threaded (1 threads)    0.000796 seconds
        processes (1 procs)     0.000714 seconds

        non_threaded (2 iters)  0.000002 seconds
        threaded (2 threads)    0.001963 seconds
        processes (2 procs)     0.001466 seconds

        non_threaded (4 iters)  0.000002 seconds
        threaded (4 threads)    0.003986 seconds
        processes (4 procs)     0.002701 seconds

        non_threaded (8 iters)  0.000003 seconds
        threaded (8 threads)    0.007990 seconds
        processes (8 procs)     0.005512 seconds

    As you can see, process forking via the pyprocessing package is
    faster than the speed of building and then executing the threaded
    version of the code.

    The second test calculates 50000 Fibonacci numbers inside of each
    thread (isolated and shared nothing):

        cmd: python run_benchmarks.py fibonacci.py
        Importing fibonacci
        Starting tests ...
        non_threaded (1 iters)  0.195548 seconds
        threaded (1 threads)    0.197909 seconds
        processes (1 procs)     0.201175 seconds

        non_threaded (2 iters)  0.397540 seconds
        threaded (2 threads)    0.397637 seconds
        processes (2 procs)     0.204265 seconds

        non_threaded (4 iters)  0.795333 seconds
        threaded (4 threads)    0.797262 seconds
        processes (4 procs)     0.206990 seconds

        non_threaded (8 iters)  1.591680 seconds
        threaded (8 threads)    1.596824 seconds
        processes (8 procs)     0.417899 seconds

    The third test calculates the sum of all primes below 100000,
    again sharing nothing.

        cmd: run_benchmarks.py crunch_primes.py
        Importing crunch_primes
        Starting tests ...
        non_threaded (1 iters)  0.495157 seconds
        threaded (1 threads)    0.522320 seconds
        processes (1 procs)     0.523757 seconds

        non_threaded (2 iters)  1.052048 seconds
        threaded (2 threads)    1.154726 seconds
        processes (2 procs)     0.524603 seconds

        non_threaded (4 iters)  2.104733 seconds
        threaded (4 threads)    2.455215 seconds
        processes (4 procs)     0.530688 seconds

        non_threaded (8 iters)  4.217455 seconds
        threaded (8 threads)    5.109192 seconds
        processes (8 procs)     1.077939 seconds

    The reason why tests two and three focused on pure numeric
    crunching is to showcase how the current threading implementation
    does hinder non-I/O applications.  Obviously, these tests could be
    improved to use a queue for coordination of results and chunks of
    work but that is not required to show the performance of the
    package and core processing.process module.

    The next test is an I/O bound test.  This is normally where we see
    a steep improvement in the threading module approach versus a
    single-threaded approach.  In this case, each worker is opening a
    descriptor to lorem.txt, randomly seeking within it and writing
    lines to /dev/null:

        cmd: python run_benchmarks.py file_io.py
        Importing file_io
        Starting tests ...
        non_threaded (1 iters)  0.057750 seconds
        threaded (1 threads)    0.089992 seconds
        processes (1 procs)     0.090817 seconds

        non_threaded (2 iters)  0.180256 seconds
        threaded (2 threads)    0.329961 seconds
        processes (2 procs)     0.096683 seconds

        non_threaded (4 iters)  0.370841 seconds
        threaded (4 threads)    1.103678 seconds
        processes (4 procs)     0.101535 seconds

        non_threaded (8 iters)  0.749571 seconds
        threaded (8 threads)    2.437204 seconds
        processes (8 procs)     0.203438 seconds

    As you can see, pyprocessing is still faster on this I/O operation
    than using multiple threads.  And using multiple threads is slower
    than the single threaded execution itself.

    Finally, we will run a socket-based test to show network I/O
    performance.  This function grabs a URL from a server on the LAN
    that is a simple error page from tomcat.  It gets the page 100
    times.  The network is silent, and a 10G connection:

        cmd: python run_benchmarks.py url_get.py
        Importing url_get
        Starting tests ...
        non_threaded (1 iters)  0.124774 seconds
        threaded (1 threads)    0.120478 seconds
        processes (1 procs)     0.121404 seconds

        non_threaded (2 iters)  0.239574 seconds
        threaded (2 threads)    0.146138 seconds
        processes (2 procs)     0.138366 seconds

        non_threaded (4 iters)  0.479159 seconds
        threaded (4 threads)    0.200985 seconds
        processes (4 procs)     0.188847 seconds

        non_threaded (8 iters)  0.960621 seconds
        threaded (8 threads)    0.659298 seconds
        processes (8 procs)     0.298625 seconds

    We finally see threaded performance surpass that of
    single-threaded execution, but the pyprocessing package is still
    faster when increasing the number of workers.  If you stay with
    one or two threads/workers, then the timing between threads and
    pyprocessing is fairly close.

    One item of note however, is that there is an implicit overhead
    within the pyprocessing package's Queue implementation due to the
    object serialization.
    
    Alec Thomas provided a short example based on the
    run_benchmarks.py script to demonstrate this overhead versus the
    default Queue implementation:

        cmd: run_bench_queue.py 
        non_threaded (1 iters)  0.010546 seconds
        threaded (1 threads)    0.015164 seconds
        processes (1 procs)     0.066167 seconds

        non_threaded (2 iters)  0.020768 seconds
        threaded (2 threads)    0.041635 seconds
        processes (2 procs)     0.084270 seconds

        non_threaded (4 iters)  0.041718 seconds
        threaded (4 threads)    0.086394 seconds
        processes (4 procs)     0.144176 seconds

        non_threaded (8 iters)  0.083488 seconds
        threaded (8 threads)    0.184254 seconds
        processes (8 procs)     0.302999 seconds

    Additional benchmarks can be found in the pyprocessing package's
    source distribution's examples/ directory.  The examples will be
    included in the package's documentation.

Maintenance

    Richard M. Oudkerk - the author of the pyprocessing package has
    agreed to maintain the package within Python SVN.  Jesse Noller
    has volunteered to also help maintain/document and test the
    package.

API Naming

    While the aim of the package's API is designed to closely mimic that of
    the threading and Queue modules as of python 2.x, those modules are not
    PEP 8 compliant. It has been decided that instead of adding the package
    "as is" and therefore perpetuating the non-PEP 8 compliant naming, we
    will rename all APIs, classes, etc to be fully PEP 8 compliant.

    This change does affect the ease-of-drop in replacement for those using
    the threading module, but that is an acceptable side-effect in the view
    of the authors, especially given that the threading module's own API
    will change.

    Issue 3042 in the tracker proposes that for Python 2.6 there will be
    two APIs for the threading module - the current one, and the PEP 8
    compliant one. Warnings about the upcoming removal of the original
    java-style API will be issued when -3 is invoked.

    In Python 3000, the threading API will become PEP 8 compliant, which
    means that the multiprocessing module and the threading module will
    again have matching APIs.

Timing/Schedule

    Some concerns have been raised about the timing/lateness of this
    PEP for the 2.6 and 3.0 releases this year, however it is felt by
    both the authors and others that the functionality this package
    offers surpasses the risk of inclusion.

    However, taking into account the desire not to destabilize
    Python-core, some refactoring of pyprocessing's code "into"
    Python-core can be withheld until the next 2.x/3.x releases.  This
    means that the actual risk to Python-core is minimal, and largely
    constrained to the actual package itself.

Open Issues

    * Confirm no "default" remote connection capabilities, if needed
      enable the remote security mechanisms by default for those
      classes which offer remote capabilities.

    * Some of the API (Queue methods qsize(), task_done() and join())
      either need to be added, or the reason for their exclusion needs
      to be identified and documented clearly.

Closed Issues

    * The PyGILState bug patch submitted in issue 1683 by roudkerk
      must be applied for the package unit tests to work.

    * Existing documentation has to be moved to ReST formatting.

    * Reliance on ctypes: The pyprocessing package's reliance on
      ctypes prevents the package from functioning on platforms where
      ctypes is not supported.  This is not a restriction of this
      package, but rather of ctypes.

    * DONE: Rename top-level package from "pyprocessing" to
      "multiprocessing".

    * DONE: Also note that the default behavior of process spawning 
      does not make it compatible with use within IDLE as-is, this 
      will be examined as a bug-fix or "setExecutable" enhancement.

    * DONE: Add in "multiprocessing.setExecutable()" method to override the
      default behavior of the package to spawn processes using the
      current executable name rather than the Python interpreter.  Note
      that Mark Hammond has suggested a factory-style interface for
      this[7].

References

    [1] PyProcessing home page
        http://pyprocessing.berlios.de/

    [2] See Adam Olsen's "safe threading" project
        http://code.google.com/p/python-safethread/

    [3] See: Addition of "pyprocessing" module to standard lib.
        http://mail.python.org/pipermail/python-dev/2008-May/079417.html

    [4] http://mpi4py.scipy.org/

    [5] See "Cluster Computing"
        http://wiki.python.org/moin/ParallelProcessing

    [6] The original run_benchmark.py code was published in Python
        Magazine in December 2007: "Python Threads and the Global
        Interpreter Lock" by Jesse Noller.  It has been modified for
        this PEP.

    [7] http://groups.google.com/group/python-dev2/msg/54cf06d15cbcbc34

    [8] Addition Python-Dev discussion
        http://mail.python.org/pipermail/python-dev/2008-June/080011.html

Copyright

    This document has been placed in the public domain.



pep-0372 Adding an ordered dictionary to collections

PEP:372
Title:Adding an ordered dictionary to collections
Version:$Revision$
Last-Modified:$Date$
Author:Armin Ronacher <armin.ronacher at active-4.com> Raymond Hettinger <python at rcn.com>
Status:Final
Type:Standards Track
Content-Type:text/x-rst
Created:15-Jun-2008
Python-Version:2.7, 3.1
Post-History:

Abstract

This PEP proposes an ordered dictionary as a new data structure for the collections module, called "OrderedDict" in this PEP. The proposed API incorporates the experiences gained from working with similar implementations that exist in various real-world applications and other programming languages.

Patch

A working Py3.1 patch including tests and documentation is at:

OrderedDict patch

The check-in was in revisions: 70101 and 70102

Rationale

In current Python versions, the widely used built-in dict type does not specify an order for the key/value pairs stored. This makes it hard to use dictionaries as data storage for some specific use cases.

Some dynamic programming languages like PHP and Ruby 1.9 guarantee a certain order on iteration. In those languages, and existing Python ordered-dict implementations, the ordering of items is defined by the time of insertion of the key. New keys are appended at the end, but keys that are overwritten are not moved to the end.

The following example shows the behavior for simple assignments:

>>> d = OrderedDict()
>>> d['parrot'] = 'dead'
>>> d['penguin'] = 'exploded'
>>> d.items()
[('parrot', 'dead'), ('penguin', 'exploded')]

That the ordering is preserved makes an OrderedDict useful for a couple of situations:

  • XML/HTML processing libraries currently drop the ordering of attributes, use a list instead of a dict which makes filtering cumbersome, or implement their own ordered dictionary. This affects ElementTree, html5lib, Genshi and many more libraries.

  • There are many ordered dict implementations in various libraries and applications, most of them subtly incompatible with each other. Furthermore, subclassing dict is a non-trivial task and many implementations don't override all the methods properly which can lead to unexpected results.

    Additionally, many ordered dicts are implemented in an inefficient way, making many operations more complex then they have to be.

  • PEP 3115 allows metaclasses to change the mapping object used for the class body. An ordered dict could be used to create ordered member declarations similar to C structs. This could be useful, for example, for future ctypes releases as well as ORMs that define database tables as classes, like the one the Django framework ships. Django currently uses an ugly hack to restore the ordering of members in database models.

  • The RawConfigParser class accepts a dict_type argument that allows an application to set the type of dictionary used internally. The motivation for this addition was expressly to allow users to provide an ordered dictionary. [1]

  • Code ported from other programming languages such as PHP often depends on an ordered dict. Having an implementation of an ordering-preserving dictionary in the standard library could ease the transition and improve the compatibility of different libraries.

Ordered Dict API

The ordered dict API would be mostly compatible with dict and existing ordered dicts. Note: this PEP refers to the 2.7 and 3.0 dictionary API as described in collections.Mapping abstract base class.

The constructor and update() both accept iterables of tuples as well as mappings like a dict does. Unlike a regular dictionary, the insertion order is preserved.

>>> d = OrderedDict([('a', 'b'), ('c', 'd')])
>>> d.update({'foo': 'bar'})
>>> d
collections.OrderedDict([('a', 'b'), ('c', 'd'), ('foo', 'bar')])

If ordered dicts are updated from regular dicts, the ordering of new keys is of course undefined.

All iteration methods as well as keys(), values() and items() return the values ordered by the time the key was first inserted:

>>> d['spam'] = 'eggs'
>>> d.keys()
['a', 'c', 'foo', 'spam']
>>> d.values()
['b', 'd', 'bar', 'eggs']
>>> d.items()
[('a', 'b'), ('c', 'd'), ('foo', 'bar'), ('spam', 'eggs')]

New methods not available on dict:

OrderedDict.__reversed__()
Supports reverse iteration by key.

Questions and Answers

What happens if an existing key is reassigned?

The key is not moved but assigned a new value in place. This is consistent with existing implementations.

What happens if keys appear multiple times in the list passed to the constructor?

The same as for regular dicts -- the latter item overrides the former. This has the side-effect that the position of the first key is used because only the value is actually overwritten:

>>> OrderedDict([('a', 1), ('b', 2), ('a', 3)])
collections.OrderedDict([('a', 3), ('b', 2)])

This behavior is consistent with existing implementations in Python, the PHP array and the hashmap in Ruby 1.9.

Is the ordered dict a dict subclass? Why?

Yes. Like defaultdict, an ordered dictionary `` subclasses dict. Being a dict subclass make some of the methods faster (like __getitem__ and __len__). More importantly, being a dict subclass lets ordered dictionaries be usable with tools like json that insist on having dict inputs by testing isinstance(d, dict).

Do any limitations arise from subclassing dict?

Yes. Since the API for dicts is different in Py2.x and Py3.x, the OrderedDict API must also be different. So, the Py2.7 version will need to override iterkeys, itervalues, and iteritems.

Does OrderedDict.popitem() return a particular key/value pair?

Yes. It pops-off the most recently inserted new key and its corresponding value. This corresponds to the usual LIFO behavior exhibited by traditional push/pop pairs. It is semantically equivalent to k=list(od)[-1]; v=od[k]; del od[k]; return (k,v). The actual implementation is more efficient and pops directly from a sorted list of keys.

Does OrderedDict support indexing, slicing, and whatnot?

As a matter of fact, OrderedDict does not implement the Sequence interface. Rather, it is a MutableMapping that remembers the order of key insertion. The only sequence-like addition is support for reversed.

An further advantage of not allowing indexing is that it leaves open the possibility of a fast C implementation using linked lists.

Does OrderedDict support alternate sort orders such as alphabetical?

No. Those wanting different sort orders really need to be using another technique. The OrderedDict is all about recording insertion order. If any other order is of interest, then another structure (like an in-memory dbm) is likely a better fit.

How well does OrderedDict work with the json module, PyYAML, and ConfigParser?

For json, the good news is that json's encoder respects OrderedDict's iteration order:

>>> items = [('one', 1), ('two', 2), ('three',3), ('four',4), ('five',5)]
>>> json.dumps(OrderedDict(items))
'{"one": 1, "two": 2, "three": 3, "four": 4, "five": 5}'

In Py2.6, the object_hook for json decoders passes-in an already built dictionary so order is lost before the object hook sees it. This problem is being fixed for Python 2.7/3.1 by adding a new hook that preserves order (see http://bugs.python.org/issue5381 ). With the new hook, order can be preserved:

>>> jtext = '{"one": 1, "two": 2, "three": 3, "four": 4, "five": 5}'
>>> json.loads(jtext, object_pairs_hook=OrderedDict)
OrderedDict({'one': 1, 'two': 2, 'three': 3, 'four': 4, 'five': 5})

For PyYAML, a full round-trip is problem free:

>>> ytext = yaml.dump(OrderedDict(items))
>>> print ytext
!!python/object/apply:collections.OrderedDict
- - [one, 1]
  - [two, 2]
  - [three, 3]
  - [four, 4]
  - [five, 5]

>>> yaml.load(ytext)
OrderedDict({'one': 1, 'two': 2, 'three': 3, 'four': 4, 'five': 5})

For the ConfigParser module, round-tripping is also problem free. Custom dicts were added in Py2.6 specifically to support ordered dictionaries:

>>> config = ConfigParser(dict_type=OrderedDict)
>>> config.read('myconfig.ini')
>>> config.remove_option('Log', 'error')
>>> config.write(open('myconfig.ini', 'w'))

How does OrderedDict handle equality testing?

Comparing two ordered dictionaries implies that the test will be order-sensitive so that list (od1.items())==list(od2.items()).

When ordered dicts are compared with other Mappings, their order insensitive comparison is used. This allows ordered dictionaries to be substituted anywhere regular dictionaries are used.

How __repr__ format will maintain order during an repr/eval round-trip?

OrderedDict([('a', 1), ('b', 2)])

What are the trade-offs of the possible underlying data structures?

  • Keeping a sorted list of keys is fast for all operations except __delitem__() which becomes an O(n) exercise. This data structure leads to very simple code and little wasted space.
  • Keeping a separate dictionary to record insertion sequence numbers makes the code a little bit more complex. All of the basic operations are O(1) but the constant factor is increased for __setitem__() and __delitem__() meaning that every use case will have to pay for this speedup (since all buildup go through __setitem__). Also, the first traveral incurs a one-time O(n log n) sorting cost. The storage costs are double that for the sorted-list-of-keys approach.
  • A version written in C could use a linked list. The code would be more complex than the other two approaches but it would conserve space and would keep the same big-oh performance as regular dictionaries. It is the fastest and most space efficient.

Reference Implementation

An implementation with tests and documentation is at:

OrderedDict patch

The proposed version has several merits:

  • Strict compliance with the MutableMapping API and no new methods so that the learning curve is near zero. It is simply a dictionary that remembers insertion order.
  • Generally good performance. The big-oh times are the same as regular dictionaries except that key deletion is O(n).

Other implementations of ordered dicts in various Python projects or standalone libraries, that inspired the API proposed here, are:

Future Directions

With the availability of an ordered dict in the standard library, other libraries may take advantage of that. For example, ElementTree could return odicts in the future that retain the attribute ordering of the source file.

pep-0373 Python 2.7 Release Schedule

PEP:373
Title:Python 2.7 Release Schedule
Version:$Revision$
Last-Modified:$Date$
Author:Benjamin Peterson <benjamin at python.org>
Status:Active
Type:Informational
Content-Type:text/x-rst
Created:3-Nov-2008
Python-Version:2.7

Abstract

This document describes the development and release schedule for Python 2.7. The schedule primarily concerns itself with PEP-sized items. Small features may be added up to and including the first beta release. Bugs may be fixed until the final release.

Update

The End Of Life date (EOL, sunset date) for Python 2.7 has been moved five years into the future, to 2020. This decision was made to clarify the status of Python 2.7 and relieve worries for those users who cannot yet migrate to Python 3. See also PEP 466.

This declaration does not guarantee that bugfix releases will be made on a regular basis, but it should enable volunteers who want to contribute bugfixes for Python 2.7 and it should satisfy vendors who still have to support Python 2 for years to come.

There will be no Python 2.8 (see PEP 404).

Release Manager and Crew

Position Name
2.7 Release Manager Benjamin Peterson
Windows installers Steve Dower
Mac installers Ned Deily

Maintenance releases

Being the last of the 2.x series, 2.7 will have an extended period of maintenance. The current plan is to support it for at least 10 years from the initial 2.7 release. This means there will be bugfix releases until 2020.

Planned future release dates:

  • 2.7.10rc1 2015-05-09
  • 2.7.10 2015-05-23
  • 2.7.11 December, 2015
  • beyond this date, releases as needed

Dates of previous maintenance releases:

  • 2.7.1 2010-11-27
  • 2.7.2 2011-07-21
  • 2.7.3rc1 2012-02-23
  • 2.7.3rc2 2012-03-15
  • 2.7.3 2012-03-09
  • 2.7.4rc1 2013-03-23
  • 2.7.4 2013-04-06
  • 2.7.5 2013-05-12
  • 2.7.6rc1 2013-10-26
  • 2.7.6 2013-11-10
  • 2.7.7rc1 2014-05-17
  • 2.7.7 2014-05-31
  • 2.7.8 2014-06-30
  • 2.7.9rc1 2014-11-26
  • 2.7.9 2014-12-10

2.7.0 Release Schedule

The release schedule for 2.7.0 was:

  • 2.7 alpha 1 2009-12-05
  • 2.7 alpha 2 2010-01-09
  • 2.7 alpha 3 2010-02-06
  • 2.7 alpha 4 2010-03-06
  • 2.7 beta 1 2010-04-03
  • 2.7 beta 2 2010-05-08
  • 2.7 rc1 2010-06-05
  • 2.7 rc2 2010-06-19
  • 2.7 final 2010-07-03

Possible features for 2.7

Nothing here. [Note that a moratorium on core language changes is in effect.]

References

None yet!

pep-0374 Choosing a distributed VCS for the Python project

PEP:374
Title:Choosing a distributed VCS for the Python project
Version:$Revision$
Last-Modified:$Date$
Author:Brett Cannon <brett at python.org>, Stephen J. Turnbull <stephen at xemacs.org>, Alexandre Vassalotti <alexandre at peadrop.com>, Barry Warsaw <barry at python.org>, Dirkjan Ochtman <dirkjan at ochtman.nl>
Status:Final
Type:Process
Content-Type:text/x-rst
Created:07-Nov-2008
Post-History:07-Nov-2008 22-Jan-2009

Rationale

Python has been using a centralized version control system (VCS; first CVS, now Subversion) for years to great effect. Having a master copy of the official version of Python provides people with a single place to always get the official Python source code. It has also allowed for the storage of the history of the language, mostly for help with development, but also for posterity. And of course the V in VCS is very helpful when developing.

But a centralized version control system has its drawbacks. First and foremost, in order to have the benefits of version control with Python in a seamless fashion, one must be a "core developer" (i.e. someone with commit privileges on the master copy of Python). People who are not core developers but who wish to work with Python's revision tree, e.g. anyone writing a patch for Python or creating a custom version, do not have direct tool support for revisions. This can be quite a limitation, since these non-core developers cannot easily do basic tasks such as reverting changes to a previously saved state, creating branches, publishing one's changes with full revision history, etc. For non-core developers, the last safe tree state is one the Python developers happen to set, and this prevents safe development. This second-class citizenship is a hindrance to people who wish to contribute to Python with a patch of any complexity and want a way to incrementally save their progress to make their development lives easier.

There is also the issue of having to be online to be able to commit one's work. Because centralized VCSs keep a central copy that stores all revisions, one must have Internet access in order for their revisions to be stored; no Net, no commit. This can be annoying if you happen to be traveling and lack any Internet. There is also the situation of someone wishing to contribute to Python but having a bad Internet connection where committing is time-consuming and expensive and it might work out better to do it in a single step.

Another drawback to a centralized VCS is that a common use case is for a developer to revise patches in response to review comments. This is more difficult with a centralized model because there's no place to contain intermediate work. It's either all checked in or none of it is checked in. In the centralized VCS, it's also very difficult to track changes to the trunk as they are committed, while you're working on your feature or bug fix branch. This increases the risk that such branches will grow stale, out-dated, or that merging them into the trunk will generate too may conflicts to be easily resolved.

Lastly, there is the issue of maintenance of Python. At any one time there is at least one major version of Python under development (at the time of this writing there are two). For each major version of Python under development there is at least the maintenance version of the last minor version and the in-development minor version (e.g. with 2.6 just released, that means that both 2.6 and 2.7 are being worked on). Once a release is done, a branch is created between the code bases where changes in one version do not (but could) belong in the other version. As of right now there is no natural support for this branch in time in central VCSs; you must use tools that simulate the branching. Tracking merges is similarly painful for developers, as revisions often need to be merged between four active branches (e.g. 2.6 maintenance, 3.0 maintenance, 2.7 development, 3.1 development). In this case, VCSs such as Subversion only handle this through arcane third party tools.

Distributed VCSs (DVCSs) solve all of these problems. While one can keep a master copy of a revision tree, anyone is free to copy that tree for their own use. This gives everyone the power to commit changes to their copy, online or offline. It also more naturally ties into the idea of branching in the history of a revision tree for maintenance and the development of new features bound for Python. DVCSs also provide a great many additional features that centralized VCSs don't or can't provide.

This PEP explores the possibility of changing Python's use of Subversion to any of the currently popular DVCSs, in order to gain the benefits outlined above. This PEP does not guarantee that a switch to a DVCS will occur at the conclusion of this PEP. It is quite possible that no clear winner will be found and that svn will continue to be used. If this happens, this PEP will be revisited and revised in the future as the state of DVCSs evolves.

Terminology

Agreeing on a common terminology is surprisingly difficult, primarily because each VCS uses these terms when describing subtly different tasks, objects, and concepts. Where possible, we try to provide a generic definition of the concepts, but you should consult the individual system's glossaries for details. Here are some basic references for terminology, from some of the standard web-based references on each VCS. You can also refer to glossaries for each DVCS:

branch
A line of development; a collection of revisions, ordered by time.
checkout/working copy/working tree
A tree of code the developer can edit, linked to a branch.
index
A "staging area" where a revision is built (unique to git).
repository
A collection of revisions, organized into branches.
clone
A complete copy of a branch or repository.
commit
To record a revision in a repository.
merge
Applying all the changes and history from one branch/repository to another.
pull
To update a checkout/clone from the original branch/repository, which can be remote or local
push/publish
To copy a revision, and all revisions it depends on, from a one repository to another.
cherry-pick
To merge one or more specific revisions from one branch to another, possibly in a different repository, possibly without its dependent revisions.
rebase
To "detach" a branch, and move it to a new branch point; move commits to the beginning of a branch instead of where they happened in time.

Typical Workflow

At the moment, the typical workflow for a Python core developer is:

  • Edit code in a checkout until it is stable enough to commit/push.
  • Commit to the master repository.

It is a rather simple workflow, but it has drawbacks. For one, because any work that involves the repository takes time thanks to the network, commits/pushes tend to not necessarily be as atomic as possible. There is also the drawback of there not being a necessarily cheap way to create new checkouts beyond a recursive copy of the checkout directory.

A DVCS would lead to a workflow more like this:

  • Branch off of a local clone of the master repository.
  • Edit code, committing in atomic pieces.
  • Merge the branch into the mainline, and
  • Push all commits to the master repository.

While there are more possible steps, the workflow is much more independent of the master repository than is currently possible. By being able to commit locally at the speed of your disk, a core developer is able to do atomic commits much more frequently, minimizing having commits that do multiple things to the code. Also by using a branch, the changes are isolated (if desired) from other changes being made by other developers. Because branches are cheap, it is easy to create and maintain many smaller branches that address one specific issue, e.g. one bug or one new feature. More sophisticated features of DVCSs allow the developer to more easily track long running development branches as the official mainline progresses.

Contenders

Name Short Name Version 2.x Trunk Mirror 3.x Trunk Mirror
Bazaar [1] bzr 1.12 http://code.python.org/python/trunk http://code.python.org/python/3.0
Mercurial [2] hg 1.2.0 http://code.python.org/hg/trunk/ http://code.python.org/hg/branches/py3k/
git [3] N/A 1.6.1 git://code.python.org/python/trunk git://code.python.org/python/branches/py3k

This PEP does not consider darcs, arch, or monotone. The main problem with these DVCSs is that they are simply not popular enough to bother supporting when they do not provide some very compelling features that the other DVCSs provide. Arch and darcs also have significant performance problems which seem unlikely to be addressed in the near future.

Interoperability

For those who have already decided which DVCSs they want to use, and are willing to maintain local mirrors themselves, all three DVCSs support interchange via the git "fast-import" changeset format. git does so natively, of course, and native support for Bazaar is under active development, and getting good early reviews as of mid-February 2009. Mercurial has idiosyncratic support for importing via its hg convert command, and third-party fast-import support [4] is available for exporting. Also, the Tailor [5] tool supports automatic maintenance of mirrors based on an official repository in any of the candidate formats with a local mirror in any format.

Usage Scenarios

Probably the best way to help decide on whether/which DVCS should replace Subversion is to see what it takes to perform some real-world usage scenarios that developers (core and non-core) have to work with. Each usage scenario outlines what it is, a bullet list of what the basic steps are (which can vary slightly per VCS), and how to perform the usage scenario in the various VCSs (including Subversion).

Each VCS had a single author in charge of writing implementations for each scenario (unless otherwise noted).

Name VCS
Brett svn
Barry bzr
Alexandre hg
Stephen git

Initial Setup

Some DVCSs have some perks if you do some initial setup upfront. This section covers what can be done before any of the usage scenarios are run in order to take better advantage of the tools.

All of the DVCSs support configuring your project identification. Unlike the centralized systems, they use your email address to identify your commits. (Access control is generally done by mechanisms external to the DVCS, such as ssh or console login). This identity may be associated with a full name.

All of the DVCSs will query the system to get some approximation to this information, but that may not be what you want. They also support setting this information on a per-user basis, and on a per- project basis. Convenience commands to set these attributes vary, but all allow direct editing of configuration files.

Some VCSs support end-of-line (EOL) conversions on checkout/checkin.

svn

None required, but it is recommended you follow the guidelines in the dev FAQ.

bzr

No setup is required, but for much quicker and space-efficient local branching, you should create a shared repository to hold all your Python branches. A shared repository is really just a parent directory containing a .bzr directory. When bzr commits a revision, it searches from the local directory on up the file system for a .bzr directory to hold the revision. By sharing revisions across multiple branches, you cut down on the amount of disk space used. Do this:

cd ~/projects
bzr init-repo python
cd python

Now, all your Python branches should be created inside of ~/projects/python.

There are also some settings you can put in your ~/.bzr/bazaar.conf and ~/.bzr/locations.conf file to set up defaults for interacting with Python code. None of them are required, although some are recommended. E.g. I would suggest gpg signing all commits, but that might be too high a barrier for developers. Also, you can set up default push locations depending on where you want to push branches by default. If you have write access to the master branches, that push location could be code.python.org. Otherwise, it might be a free Bazaar code hosting service such as Launchpad. If Bazaar is chosen, we should decide what the policies and recommendations are.

At a minimum, I would set up your email address:

bzr whoami "Firstname Lastname <email.address@example.com>"

As with hg and git below, there are ways to set your email address (or really, just about any parameter) on a per-repository basis. You do this with settings in your $HOME/.bazaar/locations.conf file, which has an ini-style format as does the other DVCSs. See the Bazaar documentation for details, which mostly aren't relevant for this discussion.

hg

Minimally, you should set your user name. To do so, create the file .hgrc in your home directory and add the following:

[ui]
username = Firstname Lastname <email.address@example.com>

If you are using Windows and your tools do not support Unix-style newlines, you can enable automatic newline translation by adding to your configuration:

[extensions]
win32text =

These options can also be set locally to a given repository by customizing <repo>/.hg/hgrc, instead of ~/.hgrc.

git

None needed. However, git supports a number of features that can smooth your work, with a little preparation. git supports setting defaults at the workspace, user, and system levels. The system level is out of scope of this PEP. The user configuration file is $HOME/.gitconfig on Unix-like systems, and the workspace configuration file is $REPOSITORY/.git/config.

You can use the git-config tool to set preferences for user.name and user.email either globally (for your system login account) or locally (to a given git working copy), or you can edit the configuration files (which have the same format as shown in the Mercurial section above).:

# my full name doesn't change
# note "--global" flag means per user
# (system-wide configuration is set with "--system")
git config --global user.name 'Firstname Lastname'
# but use my Pythonic email address
cd /path/to/python/repository
git config user.email email.address@python.example.com

If you are using Windows, you probably want to set the core.autocrlf and core.safecrlf preferences to true using git-config.:

# check out files with CRLF line endings rather than Unix-style LF only
git config --global core.autocrlf true
# scream if a transformation would be ambiguous
# (eg, a working file contains both naked LF and CRLF)
# and check them back in with the reverse transformation
git config --global core.safecrlf true

Although the repository will usually contain a .gitignore file specifying file names that rarely if ever should be registered in the VCS, you may have personal conventions (e.g., always editing log messages in a temporary file named ".msg") that you may wish to specify.:

# tell git where my personal ignores are
git config --global core.excludesfile ~/.gitignore
# I use .msg for my long commit logs, and Emacs makes backups in
# files ending with ~
# these are globs, not regular expressions
echo '*~' >> ~/.gitignore
echo '.msg' >> ~/.gitignore

If you use multiple branches, as with the other VCSes, you can save a lot of space by putting all objects in a common object store. This also can save download time, if the origins of the branches were in different repositories, because objects are shared across branches in your repository even if they were not present in the upstream repositories. git is very space- and time-efficient and applies a number of optimizations automatically, so this configuration is optional. (Examples are omitted.)

One-Off Checkout

As a non-core developer, I want to create and publish a one-off patch that fixes a bug, so that a core developer can review it for inclusion in the mainline.

  • Checkout/branch/clone trunk.
  • Edit some code.
  • Generate a patch (based on what is best supported by the VCS, e.g. branch history).
  • Receive reviewer comments and address the issues.
  • Generate a second patch for the core developer to commit.

svn

svn checkout http://svn.python.org/projects/python/trunk
cd trunk
# Edit some code.
echo "The cake is a lie!" > README
# Since svn lacks support for local commits, we fake it with patches.
svn diff >> commit-1.diff
svn diff >> patch-1.diff
# Upload the patch-1 to bugs.python.org.
# Receive reviewer comments.
# Edit some code.
echo "The cake is real!" > README
# Since svn lacks support for local commits, we fake it with patches.
svn diff >> commit-2.diff
svn diff >> patch-2.diff
# Upload patch-2 to bugs.python.org

bzr

bzr branch http://code.python.org/python/trunk
cd trunk
# Edit some code.
bzr commit -m 'Stuff I did'
bzr send -o bundle
# Upload bundle to bugs.python.org
# Receive reviewer comments
# Edit some code
bzr commit -m 'Respond to reviewer comments'
bzr send -o bundle
# Upload updated bundle to bugs.python.org

The bundle file is like a super-patch. It can be read by patch(1) but it contains additional metadata so that it can be fed to bzr merge to produce a fully usable branch completely with history. See Patch Review section below.

hg

hg clone http://code.python.org/hg/trunk
cd trunk
# Edit some code.
hg commit -m "Stuff I did"
hg outgoing -p > fixes.patch
# Upload patch to bugs.python.org
# Receive reviewer comments
# Edit some code
hg commit -m "Address reviewer comments."
hg outgoing -p > additional-fixes.patch
# Upload patch to bugs.python.org

While hg outgoing does not have the flag for it, most Mercurial commands support git's extended patch format through a --git command. This can be set in one's .hgrc file so that all commands that generate a patch use the extended format.

git

The patches could be created with git diff master > stuff-i-did.patch, too, but git format-patch | git am knows some tricks (empty files, renames, etc) that ordinary patch can't handle. git grabs "Stuff I did" out of the the commit message to create the file name 0001-Stuff-I-did.patch. See Patch Review below for a description of the git-format-patch format.

# Get the mainline code.
git clone git://code.python.org/python/trunk
cd trunk
# Edit some code.
git commit -a -m 'Stuff I did.'
# Create patch for my changes (i.e, relative to master).
git format-patch master
git tag stuff-v1
# Upload 0001-Stuff-I-did.patch to bugs.python.org.
# Time passes ... receive reviewer comments.
# Edit more code.
git commit -a -m 'Address reviewer comments.'
# Make an add-on patch to apply on top of the original.
git format-patch stuff-v1
# Upload 0001-Address-reviewer-comments.patch to bugs.python.org.

Backing Out Changes

As a core developer, I want to undo a change that was not ready for inclusion in the mainline.

  • Back out the unwanted change.
  • Push patch to server.

svn

# Assume the change to revert is in revision 40
svn merge -c -40 .
# Resolve conflicts, if any.
svn commit -m "Reverted revision 40"

bzr

# Assume the change to revert is in revision 40
bzr merge -r 40..39
# Resolve conflicts, if any.
bzr commit -m "Reverted revision 40"

Note that if the change you want revert is the last one that was made, you can just use bzr uncommit.

hg

# Assume the change to revert is in revision 9150dd9c6d30
hg backout --merge -r 9150dd9c6d30
# Resolve conflicts, if any.
hg commit -m "Reverted changeset 9150dd9c6d30"
hg push

Note, you can use "hg rollback" and "hg strip" to revert changes you committed in your local repository, but did not yet push to other repositories.

git

# Assume the change to revert is the grandfather of a revision tagged "newhotness".
git revert newhotness~2
# Resolve conflicts if any.  If there are no conflicts, the commit
# will be done automatically by "git revert", which prompts for a log.
git commit -m "Reverted changeset 9150dd9c6d30."
git push

Patch Review

As a core developer, I want to review patches submitted by other people, so that I can make sure that only approved changes are added to Python.

Core developers have to review patches as submitted by other people. This requires applying the patch, testing it, and then tossing away the changes. The assumption can be made that a core developer already has a checkout/branch/clone of the trunk.

  • Branch off of trunk.
  • Apply patch w/o any comments as generated by the patch submitter.
  • Push patch to server.
  • Delete now-useless branch.

svn

Subversion does not exactly fit into this development style very well as there are no such thing as a "branch" as has been defined in this PEP. Instead a developer either needs to create another checkout for testing a patch or create a branch on the server. Up to this point, core developers have not taken the "branch on the server" approach to dealing with individual patches. For this scenario the assumption will be the developer creates a local checkout of the trunk to work with.:

cp -r trunk issue0000
cd issue0000
patch -p0 < __patch__
# Review patch.
svn commit -m "Some patch."
cd ..
rm -r issue0000

Another option is to only have a single checkout running at any one time and use svn diff along with svn revert -R to store away independent changes you may have made.

bzr

bzr branch trunk issueNNNN
# Download `patch` bundle from Roundup
bzr merge patch
# Review patch
bzr commit -m'Patch NNN by So N. So' --fixes python:NNNN
bzr push bzr+ssh://me@code.python.org/trunk
rm -rf ../issueNNNN

Alternatively, since you're probably going to commit these changes to the trunk, you could just do a checkout. That would give you a local working tree while the branch (i.e. all revisions) would continue to live on the server. This is similar to the svn model and might allow you to more quickly review the patch. There's no need for the push in this case.:

bzr checkout trunk issueNNNN
# Download `patch` bundle from Roundup
bzr merge patch
# Review patch
bzr commit -m'Patch NNNN by So N. So' --fixes python:NNNN
rm -rf ../issueNNNN

hg

hg clone trunk issue0000
cd issue0000
# If the patch was generated using hg export, the user name of the
# submitter is automatically recorded. Otherwise,
# use hg import --no-commit submitted.diff and commit with
# hg commit -u "Firstname Lastname <email.address@example.com>"
hg import submitted.diff
# Review patch.
hg push ssh://alexandre@code.python.org/hg/trunk/

git

We assume a patch created by git-format-patch. This is a Unix mbox file containing one or more patches, each formatted as an RFC 2822 message. git-am interprets each message as a commit as follows. The author of the patch is taken from the From: header, the date from the Date header. The commit log is created by concatenating the content of the subject line, a blank line, and the message body up to the start of the patch.:

cd trunk
# Create a branch in case we don't like the patch.
# This checkout takes zero time, since the workspace is left in
# the same state as the master branch.
git checkout -b patch-review
# Download patch from bugs.python.org to submitted.patch.
git am < submitted.patch
# Review and approve patch.
# Merge into master and push.
git checkout master
git merge patch-review
git push

Backport

As a core developer, I want to apply a patch to 2.6, 2.7, 3.0, and 3.1 so that I can fix a problem in all three versions.

Thanks to always having the cutting-edge and the latest release version under development, Python currently has four branches being worked on simultaneously. That makes it important for a change to propagate easily through various branches.

svn

Because of Python's use of svnmerge, changes start with the trunk (2.7) and then get merged to the release version of 2.6. To get the change into the 3.x series, the change is merged into 3.1, fixed up, and then merged into 3.0 (2.7 -> 2.6; 2.7 -> 3.1 -> 3.0).

This is in contrast to a port-forward strategy where the patch would have been added to 2.6 and then pulled forward into newer versions (2.6 -> 2.7 -> 3.0 -> 3.1).

# Assume patch applied to 2.7 in revision 0000.
cd release26-maint
svnmerge merge -r 0000
# Resolve merge conflicts and make sure patch works.
svn commit -F svnmerge-commit-message.txt  # revision 0001.
cd ../py3k
svnmerge merge -r 0000
# Same as for 2.6, except Misc/NEWS changes are reverted.
svn revert Misc/NEWS
svn commit -F svnmerge-commit-message.txt  # revision 0002.
cd ../release30-maint
svnmerge merge -r 0002
svn commit -F svnmerge-commit-message.txt  # revision 0003.

bzr

Bazaar is pretty straightforward here, since it supports cherry picking revisions manually. In the example below, we could have given a revision id instead of a revision number, but that's usually not necessary. Martin Pool suggests "We'd generally recommend doing the fix first in the oldest supported branch, and then merging it forward to the later releases.":

# Assume patch applied to 2.7 in revision 0000
cd release26-maint
bzr merge ../trunk -c 0000
# Resolve conflicts and make sure patch works
bzr commit -m 'Back port patch NNNN'
bzr push bzr+ssh://me@code.python.org/trunk
cd ../py3k
bzr merge ../trunk -r 0000
# Same as for 2.6 except Misc/NEWS changes are reverted
bzr revert Misc/NEWS
bzr commit -m 'Forward port patch NNNN'
bzr push bzr+ssh://me@code.python.org/py3k

hg

Mercurial, like other DVCS, does not well support the current workflow used by Python core developers to backport patches. Right now, bug fixes are first applied to the development mainline (i.e., trunk), then back-ported to the maintenance branches and forward-ported, as necessary, to the py3k branch. This workflow requires the ability to cherry-pick individual changes. Mercurial's transplant extension provides this ability. Here is an example of the scenario using this workflow:

cd release26-maint
# Assume patch applied to 2.7 in revision 0000
hg transplant -s ../trunk 0000
# Resolve conflicts, if any.
cd ../py3k
hg pull ../trunk
hg merge
hg revert Misc/NEWS
hg commit -m "Merged trunk"
hg push

In the above example, transplant acts much like the current svnmerge command. When transplant is invoked without the revision, the command launches an interactive loop useful for transplanting multiple changes. Another useful feature is the --filter option which can be used to modify changesets programmatically (e.g., it could be used for removing changes to Misc/NEWS automatically).

Alternatively to the traditional workflow, we could avoid transplanting changesets by committing bug fixes to the oldest supported release, then merge these fixes upward to the more recent branches.

cd release25-maint
hg import fix_some_bug.diff
# Review patch and run test suite. Revert if failure.
hg push
cd ../release26-maint
hg pull ../release25-maint
hg merge
# Resolve conflicts, if any. Then, review patch and run test suite.
hg commit -m "Merged patches from release25-maint."
hg push
cd ../trunk
hg pull ../release26-maint
hg merge
# Resolve conflicts, if any, then review.
hg commit -m "Merged patches from release26-maint."
hg push

Although this approach makes the history non-linear and slightly more difficult to follow, it encourages fixing bugs across all supported releases. Furthermore, it scales better when there is many changes to backport, because we do not need to seek the specific revision IDs to merge.

git

In git I would have a workspace which contains all of the relevant master repository branches. git cherry-pick doesn't work across repositories; you need to have the branches in the same repository.

# Assume patch applied to 2.7 in revision release27~3 (4th patch back from tip).
cd integration
git checkout release26
git cherry-pick release27~3
# If there are conflicts, resolve them, and commit those changes.
# git commit -a -m "Resolve conflicts."
# Run test suite. If fixes are necessary, record as a separate commit.
# git commit -a -m "Fix code causing test failures."
git checkout master
git cherry-pick release27~3
# Do any conflict resolution and test failure fixups.
# Revert Misc/NEWS changes.
git checkout HEAD^ -- Misc/NEWS
git commit -m 'Revert cherry-picked Misc/NEWS changes.' Misc/NEWS
# Push both ports.
git push release26 master

If you are regularly merging (rather than cherry-picking) from a given branch, then you can block a given commit from being accidentally merged in the future by merging, then reverting it. This does not prevent a cherry-pick from pulling in the unwanted patch, and this technique requires blocking everything that you don't want merged. I'm not sure if this differs from svn on this point.

cd trunk
# Merge in the alpha tested code.
git merge experimental-branch
# We don't want the 3rd-to-last commit from the experimental-branch,
# and we don't want it to ever be merged.
# The notation "^N" means Nth parent of the current commit. Thus HEAD^2^1^1
# means the first parent of the first parent of the second parent of HEAD.
git revert HEAD^2^1^1
# Propagate the merge and the prohibition to the public repository.
git push

Coordinated Development of a New Feature

Sometimes core developers end up working on a major feature with several developers. As a core developer, I want to be able to publish feature branches to a common public location so that I can collaborate with other developers.

This requires creating a branch on a server that other developers can access. All of the DVCSs support creating new repositories on hosts where the developer is already able to commit, with appropriate configuration of the repository host. This is similar in concept to the existing sandbox in svn, although details of repository initialization may differ.

For non-core developers, there are various more-or-less public-access repository-hosting services. Bazaar has Launchpad [6], Mercurial has bitbucket.org [7], and git has GitHub [8]. All also have easy-to-use CGI interfaces for developers who maintain their own servers.

  • Branch trunk.
  • Pull from branch on the server.
  • Pull from trunk.
  • Push merge to trunk.

svn

# Create branch.
svn copy svn+ssh://pythondev@svn.python.org/python/trunk svn+ssh://pythondev@svn.python.org/python/branches/NewHotness
svn checkout svn+ssh://pythondev@svn.python.org/python/branches/NewHotness
cd NewHotness
svnmerge init
svn commit -m "Initialize svnmerge."
# Pull in changes from other developers.
svn update
# Pull in trunk and merge to the branch.
svnmerge merge
svn commit -F svnmerge-commit-message.txt

This scenario is incomplete as the decision for what DVCS to go with was made before the work was complete.

Separation of Issue Dependencies

Sometimes, while working on an issue, it becomes apparent that the problem being worked on is actually a compound issue of various smaller issues. Being able to take the current work and then begin working on a separate issue is very helpful to separate out issues into individual units of work instead of compounding them into a single, large unit.

  • Create a branch A (e.g. urllib has a bug).
  • Edit some code.
  • Create a new branch B that branch A depends on (e.g. the urllib bug exposes a socket bug).
  • Edit some code in branch B.
  • Commit branch B.
  • Edit some code in branch A.
  • Commit branch A.
  • Clean up.

svn

To make up for svn's lack of cheap branching, it has a changelist option to associate a file with a single changelist. This is not as powerful as being able to associate at the commit level. There is also no way to express dependencies between changelists.

cp -r trunk issue0000
cd issue0000
# Edit some code.
echo "The cake is a lie!" > README
svn changelist A README
# Edit some other code.
echo "I own Python!" > LICENSE
svn changelist B LICENSE
svn ci -m "Tell it how it is." --changelist B
# Edit changelist A some more.
svn ci -m "Speak the truth." --changelist A
cd ..
rm -rf issue0000

bzr

Here's an approach that uses bzr shelf (now a standard part of bzr) to squirrel away some changes temporarily while you take a detour to fix the socket bugs.

bzr branch trunk bug-0000
cd bug-0000
# Edit some code. Dang, we need to fix the socket module.
bzr shelve --all
# Edit some code.
bzr commit -m "Socket module fixes"
# Detour over, now resume fixing urllib
bzr unshelve
# Edit some code

Another approach uses the loom plugin. Looms can greatly simplify working on dependent branches because they automatically take care of the stacking dependencies for you. Imagine looms as a stack of dependent branches (called "threads" in loom parlance), with easy ways to move up and down the stack of threads, merge changes up the stack to descendant threads, create diffs between threads, etc. Occasionally, you may need or want to export your loom threads into separate branches, either for review or commit. Higher threads incorporate all the changes in the lower threads, automatically.

bzr branch trunk bug-0000
cd bug-0000
bzr loomify --base trunk
bzr create-thread fix-urllib
# Edit some code. Dang, we need to fix the socket module first.
bzr commit -m "Checkpointing my work so far"
bzr down-thread
bzr create-thread fix-socket
# Edit some code
bzr commit -m "Socket module fixes"
bzr up-thread
# Manually resolve conflicts if necessary
bzr commit -m 'Merge in socket fixes'
# Edit me some more code
bzr commit -m "Now that socket is fixed, complete the urllib fixes"
bzr record done

For bonus points, let's say someone else fixes the socket module in exactly the same way you just did. Perhaps this person even grabbed your fix-socket thread and applied just that to the trunk. You'd like to be able to merge their changes into your loom and delete your now-redundant fix-socket thread.

bzr down-thread trunk
# Get all new revisions to the trunk. If you've done things
# correctly, this will succeed without conflict.
bzr pull
bzr up-thread
# See? The fix-socket thread is now identical to the trunk
bzr commit -m 'Merge in trunk changes'
bzr diff -r thread: | wc -l # returns 0
bzr combine-thread
bzr up-thread
# Resolve any conflicts
bzr commit -m 'Merge trunk'
# Now our top-thread has an up-to-date trunk and just the urllib fix.

hg

One approach is to use the shelve extension; this extension is not included with Mercurial, but it is easy to install. With shelve, you can select changes to put temporarily aside.

hg clone trunk issue0000
cd issue0000
# Edit some code (e.g. urllib).
hg shelve
# Select changes to put aside
# Edit some other code (e.g. socket).
hg commit
hg unshelve
# Complete initial fix.
hg commit
cd ../trunk
hg pull ../issue0000
hg merge
hg commit
rm -rf ../issue0000

Several other way to approach this scenario with Mercurial. Alexander Solovyov presented a few alternative approaches [9] on Mercurial's mailing list.

git

cd trunk
# Edit some code in urllib.
# Discover a bug in socket, want to fix that first.
# So save away our current work.
git stash
# Edit some code, commit some changes.
git commit -a -m "Completed fix of socket."
# Restore the in-progress work on urllib.
git stash apply
# Edit me some more code, commit some more fixes.
git commit -a -m "Complete urllib fixes."
# And push both patches to the public repository.
git push

Bonus points: suppose you took your time, and someone else fixes socket in the same way you just did, and landed that in the trunk. In that case, your push will fail because your branch is not up-to-date. If the fix was a one-liner, there's a very good chance that it's exactly the same, character for character. git would notice that, and you are done; git will silently merge them.

Suppose we're not so lucky:

# Update your branch.
git pull git://code.python.org/public/trunk master

# git has fetched all the necessary data, but reports that the
# merge failed.  We discover the nearly-duplicated patch.
# Neither our version of the master branch nor the workspace has
# been touched.  Revert our socket patch and pull again:
git revert HEAD^
git pull git://code.python.org/public/trunk master

Like Bazaar and Mercurial, git has extensions to manage stacks of patches. You can use the original Quilt by Andrew Morton, or there is StGit ("stacked git") which integrates patch-tracking for large sets of patches into the VCS in a way similar to Mercurial Queues or Bazaar looms.

Doing a Python Release

How does PEP 101 change when using a DVCS?

bzr

It will change, but not substantially so. When doing the maintenance branch, we'll just push to the new location instead of doing an svn cp. Tags are totally different, since in svn they are directory copies, but in bzr (and I'm guessing hg), they are just symbolic names for revisions on a particular branch. The release.py script will have to change to use bzr commands instead. It's possible that because DVCS (in particular, bzr) does cherry picking and merging well enough that we'll be able to create the maint branches sooner. It would be a useful exercise to try to do a release off the bzr/hg mirrors.

hg

Clearly, details specific to Subversion in PEP 101 and in the release script will need to be updated. In particular, release tagging and maintenance branches creation process will have to be modified to use Mercurial's features; this will simplify and streamline certain aspects of the release process. For example, tagging and re-tagging a release will become a trivial operation since a tag, in Mercurial, is simply a symbolic name for a given revision.

git

It will change, but not substantially so. When doing the maintenance branch, we'll just git push to the new location instead of doing an svn cp. Tags are totally different, since in svn they are directory copies, but in git they are just symbolic names for revisions, as are branches. (The difference between a tag and a branch is that tags refer to a particular commit, and will never change unless you use git tag -f to force them to move. The checked-out branch, on the other hand, is automatically updated by git commit.) The release.py script will have to change to use git commands instead. With git I would create a (local) maintenance branch as soon as the release engineer is chosen. Then I'd "git pull" until I didn't like a patch, when it would be "git pull; git revert ugly-patch", until it started to look like the sensible thing is to fork off, and start doing "git cherry-pick" on the good patches.

Platform/Tool Support

Operating Systems

DVCS Windows OS X UNIX
bzr yes (installer) w/ tortoise yes (installer, fink or MacPorts) yes (various package formats)
hg yes (third-party installer) w/ tortoise yes (third-party installer, fink or MacPorts) yes (various package formats)
git yes (third-party installer) yes (third-party installer, fink or MacPorts) yes (.deb or .rpm)

As the above table shows, all three DVCSs are available on all three major OS platforms. But what it also shows is that Bazaar is the only DVCS that directly supports Windows with a binary installer while Mercurial and git require you to rely on a third-party for binaries. Both bzr and hg have a tortoise version while git does not.

Bazaar and Mercurial also has the benefit of being available in pure Python with optional extensions available for performance.

CRLF -> LF Support

bzr
My understanding is that support for this is being worked on as I type, landing in a version RSN. I will try to dig up details.
hg
Supported via the win32text extension.
git
I can't say from personal experience, but it looks like there's pretty good support via the core.autocrlf and core.safecrlf configuration attributes.

Case-insensitive filesystem support

bzr
Should be OK. I share branches between Linux and OS X all the time. I've done case changes (e.g. bzr mv Mailman mailman) and as long as I did it on Linux (obviously), when I pulled in the changes on OS X everything was hunky dory.
hg
Mercurial uses a case safe repository mechanism and detects case folding collisions.
git
Since OS X preserves case, you can do case changes there too. git does not have a problem with renames in either direction. However, case-insensitive filesystem support is usually taken to mean complaining about collisions on case-sensitive files systems. git does not do that.

Tools

In terms of code review tools such as Review Board [10] and Rietveld [11], the former supports all three while the latter supports hg and git but not bzr. Bazaar does not yet have an online review board, but it has several ways to manage email based reviews and trunk merging. There's Bundle Buggy [12], Patch Queue Manager [13] (PQM), and Launchpad's code reviews.

All three have some web site online that provides basic hosting support for people who want to put a repository online. Bazaar has Launchpad, Mercurial has bitbucket.org, and git has GitHub. Google Code also has instructions on how to use git with the service, both to hold a repository and how to act as a read-only mirror.

All three also appear to be supported by Buildbot [14].

Usage On Top Of Subversion

DVCS svn support
bzr bzr-svn [15] (third-party)
hg multiple third-parties
git git-svn [16]

All three DVCSs have svn support, although git is the only one to come with that support out-of-the-box.

Server Support

DVCS Web page interface
bzr loggerhead [17]
hg hgweb [18]
git gitweb [19]

All three DVCSs support various hooks on the client and server side for e.g. pre/post-commit verifications.

Development

All three projects are under active development. Git seems to be on a monthly release schedule. Bazaar is on a time-released monthly schedule. Mercurial is on a 4-month, timed release schedule.

Special Features

bzr

Martin Pool adds: "bzr has a stable Python scripting interface, with a distinction between public and private interfaces and a deprecation window for APIs that are changing. Some plugins are listed in https://edge.launchpad.net/bazaar and http://bazaar-vcs.org/Documentation".

hg

Alexander Solovyov comments:

Mercurial has easy to use extensive API with hooks for main events and ability to extend commands. Also there is the mq (mercurial queues) extension, distributed with Mercurial, which simplifies work with patches.

git

git has a cvsserver mode, ie, you can check out a tree from git using CVS. You can even commit to the tree, but features like merging are absent, and branches are handled as CVS modules, which is likely to shock a veteran CVS user.

Tests/Impressions

As I (Brett Cannon) am left with the task of of making the final decision of which/any DVCS to go with and not my co-authors, I felt it only fair to write down what tests I ran and my impressions as I evaluate the various tools so as to be as transparent as possible.

Barrier to Entry

The amount of time and effort it takes to get a checkout of Python's repository is critical. If the difficulty or time is too great then a person wishing to contribute to Python may very well give up. That cannot be allowed to happen.

I measured the checking out of the 2.x trunk as if I was a non-core developer. Timings were done using the time command in zsh and space was calculated with du -c -h.

DVCS San Francisco Vancouver Space
svn 1:04 2:59 139 M
bzr 10:45 16:04 276 M
hg 2:30 5:24 171 M
git 2:54 5:28 134 M

When comparing these numbers to svn, it is important to realize that it is not a 1:1 comparison. Svn does not pull down the entire revision history like all of the DVCSs do. That means svn can perform an initial checkout much faster than the DVCS purely based on the fact that it has less information to download for the network.

Performance of basic information functionality

To see how the tools did for performing a command that required querying the history, the log for the README file was timed.

DVCS Time
bzr 4.5 s
hg 1.1 s
git 1.5 s

One thing of note during this test was that git took longer than the other three tools to figure out how to get the log without it using a pager. While the pager use is a nice touch in general, not having it automatically turn on took some time (turns out the main git command has a --no-pager flag to disable use of the pager).

Figuring out what command to use from built-in help

I ended up trying to find out what the command was to see what URL the repository was cloned from. To do this I used nothing more than the help provided by the tool itself or its man pages.

Bzr was the easiest: bzr info. Running bzr help didn't show what I wanted, but mentioned bzr help commands. That list had the command with a description that made sense.

Git was the second easiest. The command git help didn't show much and did not have a way of listing all commands. That is when I viewed the man page. Reading through the various commands I discovered git remote. The command itself spit out nothing more than origin. Trying git remote origin said it was an error and printed out the command usage. That is when I noticed git remote show. Running git remote show origin gave me the information I wanted.

For hg, I never found the information I wanted on my own. It turns out I wanted hg paths, but that was not obvious from the description of "show definition of symbolic path names" as printed by hg help (it should be noted that reporting this in the PEP did lead to the Mercurial developers to clarify the wording to make the use of the hg paths command clearer).

Updating a checkout

To see how long it takes to update an outdated repository I timed both updating a repository 700 commits behind and 50 commits behind (three weeks stale and 1 week stale, respectively).

DVCS 700 commits 50 commits
bzr 39 s 7 s
hg 17 s 3 s
git N/A 4 s

Note

Git lacks a value for the 700 commits scenario as it does not seem to allow checking out a repository at a specific revision.

Git deserves special mention for its output from git pull. It not only lists the delta change information for each file but also color-codes the information.

Decision

At PyCon 2009 the decision was made to go with Mercurial.

Why Mercurial over Subversion

While svn has served the development team well, it needs to be admitted that svn does not serve the needs of non-committers as well as a DVCS does. Because svn only provides its features such as version control, branching, etc. to people with commit privileges on the repository it can be a hinderance for people who lack commit privileges. But DVCSs have no such limitiation as anyone can create a local branch of Python and perform their own local commits without the burden that comes with cloning the entire svn repository. Allowing anyone to have the same workflow as the core developers was the key reason to switch from svn to hg.

Orthogonal to the benefits of allowing anyone to easily commit locally to their own branches is offline, fast operations. Because hg stores all data locally there is no need to send requests to a server remotely and instead work off of the local disk. This improves response times tremendously. It also allows for offline usage for when one lacks an Internet connection. But this benefit is minor and considered simply a side-effect benefit instead of a driving factor for switching off of Subversion.

Why Mercurial over other DVCSs

Git was not chosen for three key reasons (see the PyCon 2009 lightning talk where Brett Cannon lists these exact reasons; talk started at 3:45). First, git's Windows support is the weakest out of the three DVCSs being considered which is unacceptable as Python needs to support development on any platform it runs on. Since Python runs on Windows and some people do develop on the platform it needs solid support. And while git's support is improving, as of this moment it is the weakest by a large enough margin to warrant considering it a problem.

Second, and just as important as the first issue, is that the Python core developers liked git the least out of the three DVCS options by a wide margin. If you look at the following table you will see the results of a survey taken of the core developers and how by a large margin git is the least favorite version control system.

DVCS ++ equal -- Uninformed
git 5 1 8 13
bzr 10 3 2 12
hg 15 1 1 10

Lastly, all things being equal (which they are not as shown by the previous two issues), it is preferable to use and support a tool written in Python and not one written in C and shell. We are pragmatic enough to not choose a tool simply because it is written in Python, but we do see the usefulness in promoting tools that do use it when it is reasonable to do so as it is in this case.

As for why Mercurial was chosen over Bazaar, it came down to popularity. As the core developer survey shows, hg was preferred over bzr. But the community also appears to prefer hg as was shown at PyCon after git's removal from consideration was announced. Many people came up to Brett and said in various ways that they wanted hg to be chosen. While no one said they did not want bzr chosen, no one said they did either.

Based on all of this information, Guido and Brett decided Mercurial was to be the next version control system for Python.

Transition Plan

PEP 385 outlines the transition from svn to hg.

pep-0375 Python 3.1 Release Schedule

PEP:375
Title:Python 3.1 Release Schedule
Version:$Revision$
Last-Modified:$Date$
Author:Benjamin Peterson <benjamin at python.org>
Status:Final
Type:Informational
Content-Type:text/x-rst
Created:8-Feb-2009
Python-Version:3.1

Abstract

This document describes the development and release schedule for Python 3.1. The schedule primarily concerns itself with PEP-sized items. Small features may be added up to and including the first beta release. Bugs may be fixed until the final release.

Release Manager and Crew

Position Name
3.1 Release Manager Benjamin Peterson
Windows installers Martin v. Loewis
Mac installers Ronald Oussoren

Release Schedule

  • 3.1a1 March 7, 2009
  • 3.1a2 April 4, 2009
  • 3.1b1 May 6, 2009
  • 3.1rc1 May 30, 2009
  • 3.1rc2 June 13, 2009
  • 3.1 final June 27, 2009

Maintenance Releases

3.1 is no longer maintained. 3.1 received security fixes until June 2012.

Previous maintenance releases are:

  • v3.1.1rc1 2009-08-13
  • v3.1.1 2009-08-16
  • v3.1.2rc1 2010-03-06
  • v3.1.2 2010-03-20
  • v3.1.3rc1 2010-11-13
  • v3.1.3 2010-11-27
  • v3.1.4rc1 2011-05-29
  • v3.1.4 2011-06-11
  • v3.1.5rc1 2012-02-23
  • v3.1.5rc2 2012-03-15
  • v3.1.5 2012-04-06

Features for 3.1

  • importlib
  • io in C
  • Update simplejson to the latest external version [1].
  • Ordered dictionary for collections [2].
  • auto-numbered replacement fields in str.format() strings [3]
  • Nested with-statements in one with statement

pep-0376 Database of Installed Python Distributions

PEP:376
Title:Database of Installed Python Distributions
Version:$Revision$
Last-Modified:$Date$
Author:Tarek ZiadĂŠ <tarek at ziade.org>
Status:Accepted
Type:Standards Track
Content-Type:text/x-rst
Created:22-Feb-2009
Python-Version:2.7, 3.2
Post-History:

Abstract

The goal of this PEP is to provide a standard infrastructure to manage project distributions installed on a system, so all tools that are installing or removing projects are interoperable.

To achieve this goal, the PEP proposes a new format to describe installed distributions on a system. It also describes a reference implementation for the standard library.

In the past an attempt was made to create an installation database (see PEP 262 [3]).

Combined with PEP 345, the current proposal supersedes PEP 262.

Rationale

There are two problems right now in the way distributions are installed in Python:

  • There are too many ways to do it and this makes interoperation difficult.
  • There is no API to get information on installed distributions.

How distributions are installed

Right now, when a distribution is installed in Python, every element can be installed in a different directory.

For instance, Distutils installs the pure Python code in the purelib directory, which is lib\python2.6\site-packages for unix-like systems and Mac OS X, or Lib/site-packages under Python's installation directory for Windows.

Additionally, the install_egg_info subcommand of the Distutils install command adds an .egg-info file for the project into the purelib directory.

For example, for the docutils distribution, which contains one package an extra module and executable scripts, three elements are installed in site-packages:

  • docutils: The docutils package.
  • roman.py: An extra module used by docutils.
  • docutils-0.5-py2.6.egg-info: A file containing the distribution metadata as described in PEP 314 [4]. This file corresponds to the file called PKG-INFO, built by the sdist command.

Some executable scripts, such as rst2html.py, are also added in the bin directory of the Python installation.

Another project called setuptools [5] has two other formats to install distributions, called EggFormats [8]:

  • a self-contained .egg directory, that contains all the distribution files and the distribution metadata in a file called PKG-INFO in a subdirectory called EGG-INFO. setuptools creates other files in that directory that can be considered as complementary metadata.
  • an .egg-info directory installed in site-packages, that contains the same files EGG-INFO has in the .egg format.

The first format is automatically used when you install a distribution that uses the setuptools.setup function in its setup.py file, instead of the distutils.core.setup one.

setuptools also add a reference to the distribution into an easy-install.pth file.

Last, the setuptools project provides an executable script called easy_install [6] that installs all distributions, including distutils-based ones in self-contained .egg directories.

If you want to have standalone .egg-info directories for your distributions, e.g. the second setuptools format, you have to force it when you work with a setuptools-based distribution or with the easy_install script. You can force it by using the -–single-version-externally-managed option or the --root option. This will make the setuptools project install the project like distutils does.

This option is used by :

  • the pip [7] installer
  • the Fedora packagers [11].
  • the Debian packagers [12].

Uninstall information

Distutils doesn't provide an uninstall command. If you want to uninstall a distribution, you have to be a power user and remove the various elements that were installed, and then look over the .pth file to clean them if necessary.

And the process differs depending on the tools you have used to install the distribution and if the distribution's setup.py uses Distutils or Setuptools.

Under some circumstances, you might not be able to know for sure that you have removed everything, or that you didn't break another distribution by removing a file that is shared among several distributions.

But there's a common behavior: when you install a distribution, files are copied in your system. And it's possible to keep track of these files for later removal.

Moreover, the Pip project has gained an uninstall feature lately. It records all installed files, using the record option of the install command.

What this PEP proposes

To address those issues, this PEP proposes a few changes:

  • A new .dist-info structure using a directory, inspired on one format of the EggFormats standard from setuptools.
  • New APIs in pkgutil to be able to query the information of installed distributions.
  • An uninstall function and an uninstall script in Distutils.

One .dist-info directory per installed distribution

This PEP proposes an installation format inspired by one of the options in the EggFormats standard, the one that uses a distinct directory located in the site-packages directory.

This distinct directory is named as follows:

name + '-' + version + '.dist-info'

This .dist-info directory can contain these files:

  • METADATA: contains metadata, as described in PEP 345, PEP 314 and PEP 241.

  • RECORD: records the list of installed files

  • INSTALLER: records the name of the tool used to install the project

  • REQUESTED: the presence of this file indicates that the project

    installation was explicitly requested (i.e., not installed as a dependency).

The METADATA, RECORD and INSTALLER files are mandatory, while REQUESTED may be missing.

This proposal will not impact Python itself because the metadata files are not used anywhere yet in the standard library besides Distutils.

It will impact the setuptools and pip projects but, given the fact that they already work with a directory that contains a PKG-INFO file, the change will have no deep consequences.

RECORD

A RECORD file is added inside the .dist-info directory at installation time when installing a source distribution using the install command. Notice that when installing a binary distribution created with bdist command or a bdist-based command, the RECORD file will be installed as well since these commands use the install command to create binary distributions.

The RECORD file holds the list of installed files. These correspond to the files listed by the record option of the install command, and will be generated by default. This allows the implementation of an uninstallation feature, as explained later in this PEP. The install command also provides an option to prevent the RECORD file from being written and this option should be used when creating system packages.

Third-party installation tools also should not overwrite or delete files that are not in a RECORD file without prompting or warning.

This RECORD file is inspired from PEP 262 FILES [3].

The RECORD file is a CSV file, composed of records, one line per installed file. The csv module is used to read the file, with these options:

  • field delimiter : ,
  • quoting char : ".
  • line terminator : os.linesep (so \r\n or \n)

When a distribution is installed, files can be installed under:

  • the base location: path defined by the --install-lib option, which defaults to the site-packages directory.
  • the installation prefix: path defined by the --prefix option, which defaults to sys.prefix.
  • any other path on the system.

Each record is composed of three elements:

  • the file's path

    • a '/'-separated path, relative to the base location, if the file is under the base location.
    • a '/'-separated path, relative to the base location, if the file is under the installation prefix AND if the base location is a subpath of the installation prefix.
    • an absolute path, using the local platform separator
  • a hash of the file's contents. Notice that pyc and pyo generated files don't have any hash because they are automatically produced from py files. So checking the hash of the corresponding py file is enough to decide if the file and its associated pyc or pyo files have changed.

    The hash is either the empty string or the hash algorithm as named in hashlib.algorithms_guaranteed, followed by the equals character =, followed by the urlsafe-base64-nopad encoding of the digest (base64.urlsafe_b64encode(digest) with trailing = removed).

  • the file's size in bytes

The csv module is used to generate this file, so the field separator is ",". Any "," character found within a field is escaped automatically by csv.

When the file is read, the U option is used so the universal newline support (see PEP 278 [10]) is activated, avoiding any trouble reading a file produced on a platform that uses a different new line terminator.

Here's an example of a RECORD file (extract):

lib/python2.6/site-packages/docutils/__init__.py,md5=nWt-Dge1eug4iAgqLS_uWg,9544
lib/python2.6/site-packages/docutils/__init__.pyc,,
lib/python2.6/site-packages/docutils/core.py,md5=X90C_JLIcC78PL74iuhPnA,66188
lib/python2.6/site-packages/docutils/core.pyc,,
lib/python2.6/site-packages/roman.py,md5=7YhfNczihNjOY0FXlupwBg,234
lib/python2.6/site-packages/roman.pyc,,
/usr/local/bin/rst2html.py,md5=g22D3amDLJP-FhBzCi7EvA,234
/usr/local/bin/rst2html.pyc,,
python2.6/site-packages/docutils-0.5.dist-info/METADATA,md5=ovJyUNzXdArGfmVyb0onyA,195
lib/python2.6/site-packages/docutils-0.5.dist-info/RECORD,,

Notice that the RECORD file can't contain a hash of itself and is just mentioned here

A project that installs a config.ini file in /etc/myapp will be added like this:

/etc/myapp/config.ini,md5=gLfd6IANquzGLhOkW4Mfgg,9544

For a windows platform, the drive letter is added for the absolute paths, so a file that is copied in c:MyAppwill be:

c:\etc\myapp\config.ini,md5=gLfd6IANquzGLhOkW4Mfgg,9544

INSTALLER

The install command has a new option called installer. This option is the name of the tool used to invoke the installation. It's an normalized lower-case string matching [a-z0-9_-.].

$ python setup.py install --installer=pkg-system

It defaults to distutils if not provided.

When a distribution is installed, the INSTALLER file is generated in the .dist-info directory with this value, to keep track of who installed the distribution. The file is a single-line text file.

REQUESTED

Some install tools automatically detect unfulfilled dependencies and install them. In these cases, it is useful to track which distributions were installed purely as a dependency, so if their dependent distribution is later uninstalled, the user can be alerted of the orphaned dependency.

If a distribution is installed by direct user request (the usual case), a file REQUESTED is added to the .dist-info directory of the installed distribution. The REQUESTED file may be empty, or may contain a marker comment line beginning with the "#" character.

If an install tool installs a distribution automatically, as a dependency of another distribution, the REQUESTED file should not be created.

The install command of distutils by default creates the REQUESTED file. It accepts --requested and --no-requested options to explicitly specify whether the file is created.

If a distribution that was already installed on the system as a dependency is later installed by name, the distutils install command will create the REQUESTED file in the .dist-info directory of the existing installation.

Implementation details

New functions and classes in pkgutil

To use the .dist-info directory content, we need to add in the standard library a set of APIs. The best place to put these APIs is pkgutil.

Functions

The new functions added in the pkgutil module are :

  • distinfo_dirname(name, version) -> directory name

    name is converted to a standard distribution name by replacing any runs of non-alphanumeric characters with a single '-'.

    version is converted to a standard version string. Spaces become dots, and all other non-alphanumeric characters (except dots) become dashes, with runs of multiple dashes condensed to a single dash.

    Both attributes are then converted into their filename-escaped form, i.e. any '-' characters are replaced with '_' other than the one in 'dist-info' and the one separating the name from the version number.

  • get_distributions() -> iterator of Distribution instances.

    Provides an iterator that looks for .dist-info directories in sys.path and returns Distribution instances for each one of them.

  • get_distribution(name) -> Distribution or None.

  • obsoletes_distribution(name, version=None) -> iterator of Distribution instances.

    Iterates over all distributions to find which distributions obsolete name. If a version is provided, it will be used to filter the results.

  • provides_distribution(name, version=None) -> iterator of Distribution instances.

    Iterates over all distributions to find which distributions provide name. If a version is provided, it will be used to filter the results. Scans all elements in sys.path and looks for all directories ending with .dist-info. Returns a Distribution corresponding to the .dist-info directory that contains a METADATA that matches name for the name metadata.

    This function only returns the first result founded, since no more than one values are expected. If the directory is not found, returns None.

  • get_file_users(path) -> iterator of Distribution instances.

    Iterates over all distributions to find out which distributions uses path. path can be a local absolute path or a relative '/'-separated path.

    A local absolute path is an absolute path in which occurrences of '/' have been replaced by the system separator given by os.sep.

Distribution class

A new class called Distribution is created with the path of the .dist-info directory provided to the constructor. It reads the metadata contained in METADATA when it is instantiated.

Distribution(path) -> instance

Creates a Distribution instance for the given path.

Distribution provides the following attributes:

  • name: The name of the distribution.
  • metadata: A DistributionMetadata instance loaded with the distribution's METADATA file.
  • requested: A boolean that indicates whether the REQUESTED metadata file is present (in other words, whether the distribution was installed by user request).

And following methods:

  • get_installed_files(local=False) -> iterator of (path, hash, size)

    Iterates over the RECORD entries and return a tuple (path, hash, size) for each line. If local is True, the path is transformed into a local absolute path. Otherwise the raw value from RECORD is returned.

    A local absolute path is an absolute path in which occurrences of '/' have been replaced by the system separator given by os.sep.

  • uses(path) -> Boolean

    Returns True if path is listed in RECORD. path can be a local absolute path or a relative '/'-separated path.

  • get_distinfo_file(path, binary=False) -> file object

    Returns a file located under the .dist-info directory.

    Returns a file instance for the file pointed by path.

    path has to be a '/'-separated path relative to the .dist-info directory or an absolute path.

    If path is an absolute path and doesn't start with the .dist-info directory path, a DistutilsError is raised.

    If binary is True, opens the file in read-only binary mode (rb), otherwise opens it in read-only mode (r).

  • get_distinfo_files(local=False) -> iterator of paths

    Iterates over the RECORD entries and returns paths for each line if the path is pointing to a file located in the .dist-info directory or one of its subdirectories.

    If local is True, each path is transformed into a local absolute path. Otherwise the raw value from RECORD is returned.

Notice that the API is organized in five classes that work with directories and Zip files (so it works with files included in Zip files, see PEP 273 for more details [9]). These classes are described in the documentation of the prototype implementation for interested readers [13].

Examples

Let's use some of the new APIs with our docutils example:

>>> from pkgutil import get_distribution, get_file_users, distinfo_dirname
>>> dist = get_distribution('docutils')
>>> dist.name
'docutils'
>>> dist.metadata.version
'0.5'

>>> distinfo_dirname('docutils', '0.5')
'docutils-0.5.dist-info'

>>> distinfo_dirname('python-ldap', '2.5')
'python_ldap-2.5.dist-info'

>>> distinfo_dirname('python-ldap', '2.5 a---5')
'python_ldap-2.5.a_5.dist-info'

>>> for path, hash, size in dist.get_installed_files()::
...     print '%s %s %d' % (path, hash, size)
...
python2.6/site-packages/docutils/__init__.py,b690274f621402dda63bf11ba5373bf2,9544
python2.6/site-packages/docutils/core.py,9c4b84aff68aa55f2e9bf70481b94333,66188
python2.6/site-packages/roman.py,a4b84aff68aa55f2e9bf70481b943D3,234
/usr/local/bin/rst2html.py,a4b84aff68aa55f2e9bf70481b943D3,234
python2.6/site-packages/docutils-0.5.dist-info/METADATA,6fe57de576d749536082d8e205b77748,195
python2.6/site-packages/docutils-0.5.dist-info/RECORD

>>> dist.uses('docutils/core.py')
True

>>> dist.uses('/usr/local/bin/rst2html.py')
True

>>> dist.get_distinfo_file('METADATA')
<open file at ...>

>>> dist.requested
True

New functions in Distutils

Distutils already provides a very basic way to install a distribution, which is running the install command over the setup.py script of the distribution.

Distutils2 [3] will provide a very basic uninstall function, that is added in distutils2.util and takes the name of the distribution to uninstall as its argument. uninstall uses the APIs described earlier and remove all unique files, as long as their hash didn't change. Then it removes empty directories left behind.

uninstall returns a list of uninstalled files:

>>> from distutils2.util import uninstall
>>> uninstall('docutils')
['/opt/local/lib/python2.6/site-packages/docutils/core.py',
 ...
 '/opt/local/lib/python2.6/site-packages/docutils/__init__.py']

If the distribution is not found, a DistutilsUninstallError is raised.

Filtering

To make it a reference API for third-party projects that wish to control how uninstall works, a second callable argument can be used. It's called for each file that is removed. If the callable returns True, the file is removed. If it returns False, it's left alone.

Examples:

>>> def _remove_and_log(path):
...     logging.info('Removing %s' % path)
...     return True
...
>>> uninstall('docutils', _remove_and_log)

>>> def _dry_run(path):
...     logging.info('Removing %s (dry run)' % path)
...     return False
...
>>> uninstall('docutils', _dry_run)

Of course, a third-party tool can use lower-level pkgutil APIs to implement its own uninstall feature.

Installer marker

As explained earlier in this PEP, the install command adds an INSTALLER file in the .dist-info directory with the name of the installer.

To avoid removing distributions that were installed by another packaging system, the uninstall function takes an extra argument installer which defaults to distutils2.

When called, uninstall controls that the INSTALLER file matches this argument. If not, it raises a DistutilsUninstallError:

>>> uninstall('docutils')
Traceback (most recent call last):
...
DistutilsUninstallError: docutils was installed by 'cool-pkg-manager'

>>> uninstall('docutils', installer='cool-pkg-manager')

This allows a third-party application to use the uninstall function and strongly suggest that no other program remove a distribution it has previously installed. This is useful when a third-party program that relies on Distutils APIs does extra steps on the system at installation time, it has to undo at uninstallation time.

Adding an Uninstall script

An uninstall script is added in Distutils2. and is used like this:

$ python -m distutils2.uninstall projectname

Notice that script doesn't control if the removal of a distribution breaks another distribution. Although it makes sure that all the files it removes are not used by any other distribution, by using the uninstall function.

Also note that this uninstall script pays no attention to the REQUESTED metadata; that is provided only for use by external tools to provide more advanced dependency management.

Backward compatibility and roadmap

These changes don't introduce any compatibility problems since they will be implemented in:

  • pkgutil in new functions
  • distutils2

The plan is to include the functionality outlined in this PEP in pkgutil for Python 3.2, and in Distutils2.

Distutils2 will also contain a backport of the new pgkutil, and can be used for 2.4 onward.

Distributions installed using existing, pre-standardization formats do not have the necessary metadata available for the new API, and thus will be ignored. Third-party tools may of course to continue to support previous formats in addition to the new format, in order to ease the transition.

Acknowledgements

Jim Fulton, Ian Bicking, Phillip Eby, Rafael Villar Burke, and many people at Pycon and Distutils-SIG.

pep-0377 Allow __enter__() methods to skip the statement body

PEP:377
Title:Allow __enter__() methods to skip the statement body
Version:$Revision$
Last-Modified:$Date$
Author:Nick Coghlan <ncoghlan at gmail.com>
Status:Rejected
Type:Standards Track
Content-Type:text/x-rst
Created:8-Mar-2009
Python-Version:2.7, 3.1
Post-History:8-Mar-2009

Abstract

This PEP proposes a backwards compatible mechanism that allows __enter__() methods to skip the body of the associated with statement. The lack of this ability currently means the contextlib.contextmanager decorator is unable to fulfil its specification of being able to turn arbitrary code into a context manager by moving it into a generator function with a yield in the appropriate location. One symptom of this is that contextlib.nested will currently raise RuntimeError in situations where writing out the corresponding nested with statements would not [1].

The proposed change is to introduce a new flow control exception SkipStatement, and skip the execution of the with statement body if __enter__() raises this exception.

PEP Rejection

This PEP was rejected by Guido [4] as it imposes too great an increase in complexity without a proportional increase in expressiveness and correctness. In the absence of compelling use cases that need the more complex semantics proposed by this PEP the existing behaviour is considered acceptable.

Proposed Change

The semantics of the with statement will be changed to include a new try/except/else block around the call to __enter__(). If SkipStatement is raised by the __enter__() method, then the main section of the with statement (now located in the else clause) will not be executed. To avoid leaving the names in any as clause unbound in this case, a new StatementSkipped singleton (similar to the existing NotImplemented singleton) will be assigned to all names that appear in the as clause.

The components of the with statement remain as described in PEP 343 [2]:

with EXPR as VAR:
    BLOCK

After the modification, the with statement semantics would be as follows:

mgr = (EXPR)
exit = mgr.__exit__  # Not calling it yet
try:
    value = mgr.__enter__()
except SkipStatement:
    VAR = StatementSkipped
    # Only if "as VAR" is present and
    # VAR is a single name
    # If VAR is a tuple of names, then StatementSkipped
    # will be assigned to each name in the tuple
else:
    exc = True
    try:
        try:
            VAR = value  # Only if "as VAR" is present
            BLOCK
        except:
            # The exceptional case is handled here
            exc = False
            if not exit(*sys.exc_info()):
                raise
            # The exception is swallowed if exit() returns true
    finally:
        # The normal and non-local-goto cases are handled here
        if exc:
            exit(None, None, None)

With the above change in place for the with statement semantics, contextlib.contextmanager() will then be modified to raise SkipStatement instead of RuntimeError when the underlying generator doesn't yield.

Rationale for Change

Currently, some apparently innocuous context managers may raise RuntimeError when executed. This occurs when the context manager's __enter__() method encounters a situation where the written out version of the code corresponding to the context manager would skip the code that is now the body of the with statement. Since the __enter__() method has no mechanism available to signal this to the interpreter, it is instead forced to raise an exception that not only skips the body of the with statement, but also jumps over all code until the nearest exception handler. This goes against one of the design goals of the with statement, which was to be able to factor out arbitrary common exception handling code into a single context manager by putting into a generator function and replacing the variant part of the code with a yield statement.

Specifically, the following examples behave differently if cmB().__enter__() raises an exception which cmA().__exit__() then handles and suppresses:

with cmA():
  with cmB():
    do_stuff()
# This will resume here without executing "do_stuff()"

@contextlib.contextmanager
def combined():
  with cmA():
    with cmB():
      yield

with combined():
  do_stuff()
# This will raise a RuntimeError complaining that the context
# manager's underlying generator didn't yield

with contextlib.nested(cmA(), cmB()):
  do_stuff()
# This will raise the same RuntimeError as the contextmanager()
# example (unsurprising, given that the nested() implementation
# uses contextmanager())

# The following class based version shows that the issue isn't
# specific to contextlib.contextmanager() (it also shows how
# much simpler it is to write context managers as generators
# instead of as classes!)
class CM(object):
  def __init__(self):
    self.cmA = None
    self.cmB = None

  def __enter__(self):
    if self.cmA is not None:
      raise RuntimeError("Can't re-use this CM")
    self.cmA = cmA()
    self.cmA.__enter__()
    try:
      self.cmB = cmB()
      self.cmB.__enter__()
    except:
      self.cmA.__exit__(*sys.exc_info())
      # Can't suppress in __enter__(), so must raise
      raise

  def __exit__(self, *args):
    suppress = False
    try:
      if self.cmB is not None:
        suppress = self.cmB.__exit__(*args)
    except:
      suppress = self.cmA.__exit__(*sys.exc_info()):
      if not suppress:
        # Exception has changed, so reraise explicitly
        raise
    else:
      if suppress:
        # cmB already suppressed the exception,
        # so don't pass it to cmA
        suppress = self.cmA.__exit__(None, None, None):
      else:
        suppress = self.cmA.__exit__(*args):
    return suppress

With the proposed semantic change in place, the contextlib based examples above would then "just work", but the class based version would need a small adjustment to take advantage of the new semantics:

class CM(object):
  def __init__(self):
    self.cmA = None
    self.cmB = None

  def __enter__(self):
    if self.cmA is not None:
      raise RuntimeError("Can't re-use this CM")
    self.cmA = cmA()
    self.cmA.__enter__()
    try:
      self.cmB = cmB()
      self.cmB.__enter__()
    except:
      if self.cmA.__exit__(*sys.exc_info()):
        # Suppress the exception, but don't run
        # the body of the with statement either
        raise SkipStatement
      raise

  def __exit__(self, *args):
    suppress = False
    try:
      if self.cmB is not None:
        suppress = self.cmB.__exit__(*args)
    except:
      suppress = self.cmA.__exit__(*sys.exc_info()):
      if not suppress:
        # Exception has changed, so reraise explicitly
        raise
    else:
      if suppress:
        # cmB already suppressed the exception,
        # so don't pass it to cmA
        suppress = self.cmA.__exit__(None, None, None):
      else:
        suppress = self.cmA.__exit__(*args):
    return suppress

There is currently a tentative suggestion [3] to add import-style syntax to the with statement to allow multiple context managers to be included in a single with statement without needing to use contextlib.nested. In that case the compiler has the option of simply emitting multiple with statements at the AST level, thus allowing the semantics of actual nested with statements to be reproduced accurately. However, such a change would highlight rather than alleviate the problem the current PEP aims to address: it would not be possible to use contextlib.contextmanager to reliably factor out such with statements, as they would exhibit exactly the same semantic differences as are seen with the combined() context manager in the above example.

Performance Impact

Implementing the new semantics makes it necessary to store the references to the __enter__ and __exit__ methods in temporary variables instead of on the stack. This results in a slight regression in with statement speed relative to Python 2.6/3.1. However, implementing a custom SETUP_WITH opcode would negate any differences between the two approaches (as well as dramatically improving speed by eliminating more than a dozen unnecessary trips around the eval loop).

Reference Implementation

Patch attached to Issue 5251 [1]. That patch uses only existing opcodes (i.e. no SETUP_WITH).

Acknowledgements

James William Pye both raised the issue and suggested the basic outline of the solution described in this PEP.

References

[1](1, 2) Issue 5251: contextlib.nested inconsistent with nested with statements (http://bugs.python.org/issue5251)
[2]PEP 343: The "with" Statement (http://www.python.org/dev/peps/pep-0343/)
[3]Import-style syntax to reduce indentation of nested with statements (http://mail.python.org/pipermail/python-ideas/2009-March/003188.html)
[4]Guido's rejection of the PEP (http://mail.python.org/pipermail/python-dev/2009-March/087263.html)

pep-0378 Format Specifier for Thousands Separator

PEP:378
Title:Format Specifier for Thousands Separator
Version:$Revision$
Last-Modified:$Date$
Author:Raymond Hettinger <python at rcn.com>
Status:Final
Type:Standards Track
Content-Type:text/x-rst
Created:12-Mar-2009
Python-Version:2.7 and 3.1
Post-History:12-Mar-2009

Motivation

Provide a simple, non-locale aware way to format a number with a thousands separator.

Adding thousands separators is one of the simplest ways to humanize a program's output, improving its professional appearance and readability.

In the finance world, output with thousands separators is the norm. Finance users and non-professional programmers find the locale approach to be frustrating, arcane and non-obvious.

The locale module presents two other challenges. First, it is a global setting and not suitable for multi-threaded apps that need to serve-up requests in multiple locales. Second, the name of a relevant locale (such as "de_DE") can vary from platform to platform or may not be defined at all. The docs for the locale module describe these and many other challenges [1] in detail.

It is not the goal to replace the locale module, to perform internationalization tasks, or accommodate every possible convention. Such tasks are better suited to robust tools like Babel [2]. Instead, the goal is to make a common, everyday task easier for many users.

Main Proposal (from Nick Coghlan, originally called Proposal I)

A comma will be added to the format() specifier mini-language:

[[fill]align][sign][#][0][width][,][.precision][type]

The ',' option indicates that commas should be included in the output as a thousands separator. As with locales which do not use a period as the decimal point, locales which use a different convention for digit separation will need to use the locale module to obtain appropriate formatting.

The proposal works well with floats, ints, and decimals. It also allows easy substitution for other separators. For example:

format(n, "6,d").replace(",", "_")

This technique is completely general but it is awkward in the one case where the commas and periods need to be swapped:

format(n, "6,f").replace(",", "X").replace(".", ",").replace("X", ".")

The width argument means the total length including the commas and decimal point:

format(1234, "08,d")     -->    '0001,234'
format(1234.5, "08,.1f") -->    '01,234.5'

The ',' option is defined as shown above for types 'd', 'e', 'f', 'g', 'E', 'G', '%', 'F' and ''. To allow future extensions, it is undefined for other types: binary, octal, hex, character, etc.

This proposal has the virtue of being simpler than the alternative proposal but is much less flexible and meets the needs of fewer users right out of the box. It is expected that some other solution will arise for specifying alternative separators.

Research into what Other Languages Do

Scanning the web, I've found that thousands separators are usually one of COMMA, DOT, SPACE, APOSTROPHE or UNDERSCORE.

C-Sharp [4] provides both styles (picture formatting and type specifiers). The type specifier approach is locale aware. The picture formatting only offers a COMMA as a thousands separator:

String.Format("{0:n}", 12400)     ==>    "12,400"
String.Format("{0:0,0}", 12400)   ==>    "12,400"

Common Lisp [5] uses a COLON before the ~D decimal type specifier to emit a COMMA as a thousands separator. The general form of ~D is ~mincol,padchar,commachar,commaintervalD. The padchar defaults to SPACE. The commachar defaults to COMMA. The commainterval defaults to three.

(format nil "~:D" 229345007)   =>   "229,345,007"

Visual Basic and its brethren (like MS Excel [7]) use a completely different style and have ultra-flexible custom format specifiers like:

"_($* #,##0_)".

COBOL [8] uses picture clauses like:

PICTURE $***,**9.99CR

Java offers a Decimal.Format Class [9] that uses picture patterns (one for positive numbers and an optional one for negatives) such as: "#,##0.00;(#,##0.00)". It allows arbitrary groupings including hundreds and ten-thousands and uneven groupings. The special patten characters are non-localized (using a DOT for a decimal separator and a COMMA for a grouping separator). The user can supply an alternate set of symbols using the formatter's DecimalFormatSymbols object.

Alternative Proposal (from Eric Smith, originally called Proposal II)

Make both the thousands separator and decimal separator user specifiable but not locale aware. For simplicity, limit the choices to a COMMA, DOT, SPACE, APOSTROPHE or UNDERSCORE. The SPACE can be either U+0020 or U+00A0.

Whenever a separator is followed by a precision, it is a decimal separator and an optional separator preceding it is a thousands separator. When the precision is absent, a lone specifier means a thousands separator:

[[fill]align][sign][#][0][width][tsep][dsep precision][type]

Examples:

format(1234, "8.1f")     -->    '  1234.0'
format(1234, "8,1f")     -->    '  1234,0'
format(1234, "8.,1f")    -->    ' 1.234,0'
format(1234, "8 ,f")     -->    ' 1 234,0'
format(1234, "8d")       -->    '    1234'
format(1234, "8,d")      -->    '   1,234'
format(1234, "8_d")      -->    '   1_234'

This proposal meets mosts needs, but it comes at the expense of taking a bit more effort to parse. Not every possible convention is covered, but at least one of the options (spaces or underscores) should be readable, understandable, and useful to folks from many diverse backgrounds.

As shown in the examples, the width argument means the total length including the thousands separators and decimal separators.

No change is proposed for the locale module.

The thousands separator is defined as shown above for types 'd', 'e', 'f', 'g', '%', 'E', 'G' and 'F'. To allow future extensions, it is undefined for other types: binary, octal, hex, character, etc.

The drawback to this alternative proposal is the difficulty of mentally parsing whether a single separator is a thousands separator or decimal separator. Perhaps it is too arcane to link the decimal separator with the precision specifier.

Commentary

  • Some commenters do not like the idea of format strings at all and find them to be unreadable. Suggested alternatives include the COBOL style PICTURE approach or a convenience function with keyword arguments for every possible combination.
  • Some newsgroup respondants think there is no place for any scripts that are not internationalized and that it is a step backwards to provide a simple way to hardwire a particular choice (thus reducing incentive to use a locale sensitive approach).
  • Another thought is that embedding some particular convention in individual format strings makes it hard to change that convention later. No workable alternative was suggested but the general idea is to set the convention once and have it apply everywhere (others commented that locale already provides a way to do this).
  • There are some precedents for grouping digits in the fractional part of a floating point number, but this PEP does not venture into that territory. Only digits to the left of the decimal point are grouped. This does not preclude future extensions; it just focuses on a single, generally useful extension to the formatting language.
  • James Knight observed that Indian/Pakistani numbering systems group by hundreds. Ben Finney noted that Chinese group by ten-thousands. Eric Smith pointed-out that these are already handled by the "n" specifier in the locale module (albeit only for integers). This PEP does not attempt to support all of those possibilities. It focues on a single, relatively common grouping convention that offers a quick way to improve readability in many (though not all) contexts.

pep-0379 Adding an Assignment Expression

PEP: 379
Title: Adding an Assignment Expression
Version: $Revision$
Last-Modified: $Date$
Author: Jervis Whitley <jervisau at gmail.com>
Status: Withdrawn
Type: Standards Track
Content-Type: text/plain
Created: 14-Mar-2009
Python-Version: 2.7, 3.2
Post-History: 

Abstract

    This PEP adds a new assignment expression to the Python language
    to make it possible to assign the result of an expression in
    almost any place.  The new expression will allow the assignment of
    the result of an expression at first use (in a comparison for
    example).


Motivation and Summary

   Issue1714448 "if something as x:" [1] describes a feature to allow
   assignment of the result of an expression in an if statement to a
   name.  It supposed that the 'as' syntax could be borrowed for this
   purpose.  Many times it is not the expression itself that is
   interesting, rather one of the terms that make up the
   expression. To be clear, something like this:
   
       if (f_result() == [1, 2, 3]) as res:

   seems awfully limited, when this:

       if (f_result() as res) == [1, 2, 3]:

   is probably the desired result. 


Use Cases

    See the Examples section near the end.


Specification

    A new expression is proposed with the (nominal) syntax:

        EXPR -> VAR

    This single expression does the following:

    - Evaluate the value of EXPR, an arbitrary expression;
    - Assign the result to VAR, a single assignment target; and
    - Leave the result of EXPR on the Top of Stack (TOS)
    
    Here '->' or (RARROW) has been used to illustrate the concept that
    the result of EXPR is assigned to VAR.

    The translation of the proposed syntax is:

        VAR = (EXPR)
        (EXPR)

    The assignment target can be either an attribute, a subscript or
    name:

        f() -> name[0]      # where 'name' exists previously.

	f() -> name.attr    # again 'name' exists prior to this
	expression.

        f() -> name

    This expression should be available anywhere that an expression is
    currently accepted.

    All exceptions that are currently raised during invalid
    assignments will continue to be raised when using the assignment
    expression.  For example, a NameError will be raised when in
    example 1 and 2 above if 'name' is not previously defined, or an
    IndexError if index 0 was out of range.


Examples from the Standard Library

    The following two examples were chosen after a brief search
    through the standard library, specifically both are from ast.py
    which happened to be open at the time of the search.

    Original:

        def walk(node):
            from collections import deque
            todo = deque([node])
            while todo:
                node = todo.popleft()
                todo.extend(iter_child_nodes(node))
                yield node

    Using assignment expression:

        def walk(node):
            from collections import deque
            todo = deque([node])
            while todo:
                todo.extend(iter_child_nodes(todo.popleft() -> node))
                yield node

    Original:

        def get_docstring(node, clean=True):
            if not isinstance(node, (FunctionDef, ClassDef, Module)):
                raise TypeError("%r can't have docstrings" 
                                    % node.__class__.__name__)
            if node.body and isinstance(node.body[0], Expr) and \
               isinstance(node.body[0].value, Str):
                if clean:
                    import inspect
                    return inspect.cleandoc(node.body[0].value.s)
                return node.body[0].value.s

    Using assignment expresion:

        def get_docstring(node, clean=True):
            if not isinstance(node, (FunctionDef, ClassDef, Module)):
                raise TypeError("%r can't have docstrings" 
                                    % node.__class__.__name__)
            if node.body -> body and isinstance(body[0] -> elem, Expr) and \
               isinstance(elem.value -> value, Str):
                if clean:
                    import inspect
                    return inspect.cleandoc(value.s)
                return value.s


Examples

    The examples shown below highlight some of the desirable features
    of the assignment expression, and some of the possible corner
    cases.

    1. Assignment in an if statement for use later.

        def expensive():
            import time; time.sleep(1)
            return 'spam'

        if expensive() -> res in ('spam', 'eggs'):
            dosomething(res)

    2. Assignment in a while loop clause.

        while len(expensive() -> res) == 4:
            dosomething(res)

    3. Keep the iterator object from the for loop.

        for ch in expensive() -> res:
            sell_on_internet(res)

    4. Corner case.

        for ch -> please_dont in expensive():
            pass
        # who would want to do this? Not I. 


References

    [1] Issue1714448 "if something as x:", k0wax 
        http://bugs.python.org/issue1714448


Copyright

    This document has been placed in the public domain.



pep-0380 Syntax for Delegating to a Subgenerator

PEP:380
Title:Syntax for Delegating to a Subgenerator
Version:$Revision$
Last-Modified:$Date$
Author:Gregory Ewing <greg.ewing at canterbury.ac.nz>
Status:Final
Type:Standards Track
Content-Type:text/x-rst
Created:13-Feb-2009
Python-Version:3.3
Post-History:
Resolution:http://mail.python.org/pipermail/python-dev/2011-June/112010.html

Abstract

A syntax is proposed for a generator to delegate part of its operations to another generator. This allows a section of code containing 'yield' to be factored out and placed in another generator. Additionally, the subgenerator is allowed to return with a value, and the value is made available to the delegating generator.

The new syntax also opens up some opportunities for optimisation when one generator re-yields values produced by another.

PEP Acceptance

Guido officially accepted the PEP [1] on 26th June, 2011.

Motivation

A Python generator is a form of coroutine, but has the limitation that it can only yield to its immediate caller. This means that a piece of code containing a yield cannot be factored out and put into a separate function in the same way as other code. Performing such a factoring causes the called function to itself become a generator, and it is necessary to explicitly iterate over this second generator and re-yield any values that it produces.

If yielding of values is the only concern, this can be performed without much difficulty using a loop such as

for v in g:
    yield v

However, if the subgenerator is to interact properly with the caller in the case of calls to send(), throw() and close(), things become considerably more difficult. As will be seen later, the necessary code is very complicated, and it is tricky to handle all the corner cases correctly.

A new syntax will be proposed to address this issue. In the simplest use cases, it will be equivalent to the above for-loop, but it will also handle the full range of generator behaviour, and allow generator code to be refactored in a simple and straightforward way.

Proposal

The following new expression syntax will be allowed in the body of a generator:

yield from <expr>

where <expr> is an expression evaluating to an iterable, from which an iterator is extracted. The iterator is run to exhaustion, during which time it yields and receives values directly to or from the caller of the generator containing the yield from expression (the "delegating generator").

Furthermore, when the iterator is another generator, the subgenerator is allowed to execute a return statement with a value, and that value becomes the value of the yield from expression.

The full semantics of the yield from expression can be described in terms of the generator protocol as follows:

  • Any values that the iterator yields are passed directly to the caller.
  • Any values sent to the delegating generator using send() are passed directly to the iterator. If the sent value is None, the iterator's __next__() method is called. If the sent value is not None, the iterator's send() method is called. If the call raises StopIteration, the delegating generator is resumed. Any other exception is propagated to the delegating generator.
  • Exceptions other than GeneratorExit thrown into the delegating generator are passed to the throw() method of the iterator. If the call raises StopIteration, the delegating generator is resumed. Any other exception is propagated to the delegating generator.
  • If a GeneratorExit exception is thrown into the delegating generator, or the close() method of the delegating generator is called, then the close() method of the iterator is called if it has one. If this call results in an exception, it is propagated to the delegating generator. Otherwise, GeneratorExit is raised in the delegating generator.
  • The value of the yield from expression is the first argument to the StopIteration exception raised by the iterator when it terminates.
  • return expr in a generator causes StopIteration(expr) to be raised upon exit from the generator.

Enhancements to StopIteration

For convenience, the StopIteration exception will be given a value attribute that holds its first argument, or None if there are no arguments.

Formal Semantics

Python 3 syntax is used in this section.

  1. The statement

    RESULT = yield from EXPR
    

is semantically equivalent to

_i = iter(EXPR)
try:
    _y = next(_i)
except StopIteration as _e:
    _r = _e.value
else:
    while 1:
        try:
            _s = yield _y
        except GeneratorExit as _e:
            try:
                _m = _i.close
            except AttributeError:
                pass
            else:
                _m()
            raise _e
        except BaseException as _e:
            _x = sys.exc_info()
            try:
                _m = _i.throw
            except AttributeError:
                raise _e
            else:
                try:
                    _y = _m(*_x)
                except StopIteration as _e:
                    _r = _e.value
                    break
        else:
            try:
                if _s is None:
                    _y = next(_i)
                else:
                    _y = _i.send(_s)
            except StopIteration as _e:
                _r = _e.value
                break
RESULT = _r
  1. In a generator, the statement

    return value
    

is semantically equivalent to

raise StopIteration(value)

except that, as currently, the exception cannot be caught by except clauses within the returning generator.

  1. The StopIteration exception behaves as though defined thusly:

    class StopIteration(Exception):
    
        def __init__(self, *args):
            if len(args) > 0:
                self.value = args[0]
            else:
                self.value = None
            Exception.__init__(self, *args)
    

Rationale

The Refactoring Principle

The rationale behind most of the semantics presented above stems from the desire to be able to refactor generator code. It should be possible to take a section of code containing one or more yield expressions, move it into a separate function (using the usual techniques to deal with references to variables in the surrounding scope, etc.), and call the new function using a yield from expression.

The behaviour of the resulting compound generator should be, as far as reasonably practicable, the same as the original unfactored generator in all situations, including calls to __next__(), send(), throw() and close().

The semantics in cases of subiterators other than generators has been chosen as a reasonable generalization of the generator case.

The proposed semantics have the following limitations with regard to refactoring:

  • A block of code that catches GeneratorExit without subsequently re-raising it cannot be factored out while retaining exactly the same behaviour.
  • Factored code may not behave the same way as unfactored code if a StopIteration exception is thrown into the delegating generator.

With use cases for these being rare to non-existent, it was not considered worth the extra complexity required to support them.

Finalization

There was some debate as to whether explicitly finalizing the delegating generator by calling its close() method while it is suspended at a yield from should also finalize the subiterator. An argument against doing so is that it would result in premature finalization of the subiterator if references to it exist elsewhere.

Consideration of non-refcounting Python implementations led to the decision that this explicit finalization should be performed, so that explicitly closing a factored generator has the same effect as doing so to an unfactored one in all Python implementations.

The assumption made is that, in the majority of use cases, the subiterator will not be shared. The rare case of a shared subiterator can be accommodated by means of a wrapper that blocks throw() and close() calls, or by using a means other than yield from to call the subiterator.

Generators as Threads

A motivation for generators being able to return values concerns the use of generators to implement lightweight threads. When using generators in that way, it is reasonable to want to spread the computation performed by the lightweight thread over many functions. One would like to be able to call a subgenerator as though it were an ordinary function, passing it parameters and receiving a returned value.

Using the proposed syntax, a statement such as

y = f(x)

where f is an ordinary function, can be transformed into a delegation call

y = yield from g(x)

where g is a generator. One can reason about the behaviour of the resulting code by thinking of g as an ordinary function that can be suspended using a yield statement.

When using generators as threads in this way, typically one is not interested in the values being passed in or out of the yields. However, there are use cases for this as well, where the thread is seen as a producer or consumer of items. The yield from expression allows the logic of the thread to be spread over as many functions as desired, with the production or consumption of items occuring in any subfunction, and the items are automatically routed to or from their ultimate source or destination.

Concerning throw() and close(), it is reasonable to expect that if an exception is thrown into the thread from outside, it should first be raised in the innermost generator where the thread is suspended, and propagate outwards from there; and that if the thread is terminated from outside by calling close(), the chain of active generators should be finalised from the innermost outwards.

Syntax

The particular syntax proposed has been chosen as suggestive of its meaning, while not introducing any new keywords and clearly standing out as being different from a plain yield.

Optimisations

Using a specialised syntax opens up possibilities for optimisation when there is a long chain of generators. Such chains can arise, for instance, when recursively traversing a tree structure. The overhead of passing __next__() calls and yielded values down and up the chain can cause what ought to be an O(n) operation to become, in the worst case, O(n**2).

A possible strategy is to add a slot to generator objects to hold a generator being delegated to. When a __next__() or send() call is made on the generator, this slot is checked first, and if it is nonempty, the generator that it references is resumed instead. If it raises StopIteration, the slot is cleared and the main generator is resumed.

This would reduce the delegation overhead to a chain of C function calls involving no Python code execution. A possible enhancement would be to traverse the whole chain of generators in a loop and directly resume the one at the end, although the handling of StopIteration is more complicated then.

Use of StopIteration to return values

There are a variety of ways that the return value from the generator could be passed back. Some alternatives include storing it as an attribute of the generator-iterator object, or returning it as the value of the close() call to the subgenerator. However, the proposed mechanism is attractive for a couple of reasons:

  • Using a generalization of the StopIteration exception makes it easy for other kinds of iterators to participate in the protocol without having to grow an extra attribute or a close() method.
  • It simplifies the implementation, because the point at which the return value from the subgenerator becomes available is the same point at which the exception is raised. Delaying until any later time would require storing the return value somewhere.

Rejected Ideas

Some ideas were discussed but rejected.

Suggestion: There should be some way to prevent the initial call to __next__(), or substitute it with a send() call with a specified value, the intention being to support the use of generators wrapped so that the initial __next__() is performed automatically.

Resolution: Outside the scope of the proposal. Such generators should not be used with yield from.

Suggestion: If closing a subiterator raises StopIteration with a value, return that value from the close() call to the delegating generator.

The motivation for this feature is so that the end of a stream of values being sent to a generator can be signalled by closing the generator. The generator would catch GeneratorExit, finish its computation and return a result, which would then become the return value of the close() call.

Resolution: This usage of close() and GeneratorExit would be incompatible with their current role as a bail-out and clean-up mechanism. It would require that when closing a delegating generator, after the subgenerator is closed, the delegating generator be resumed instead of re-raising GeneratorExit. But this is not acceptable, because it would fail to ensure that the delegating generator is finalised properly in the case where close() is being called for cleanup purposes.

Signalling the end of values to a consumer is better addressed by other means, such as sending in a sentinel value or throwing in an exception agreed upon by the producer and consumer. The consumer can then detect the sentinel or exception and respond by finishing its computation and returning normally. Such a scheme behaves correctly in the presence of delegation.

Suggestion: If close() is not to return a value, then raise an exception if StopIteration with a non-None value occurs.

Resolution: No clear reason to do so. Ignoring a return value is not considered an error anywhere else in Python.

Criticisms

Under this proposal, the value of a yield from expression would be derived in a very different way from that of an ordinary yield expression. This suggests that some other syntax not containing the word yield might be more appropriate, but no acceptable alternative has so far been proposed. Rejected alternatives include call, delegate and gcall.

It has been suggested that some mechanism other than return in the subgenerator should be used to establish the value returned by the yield from expression. However, this would interfere with the goal of being able to think of the subgenerator as a suspendable function, since it would not be able to return values in the same way as other functions.

The use of an exception to pass the return value has been criticised as an "abuse of exceptions", without any concrete justification of this claim. In any case, this is only one suggested implementation; another mechanism could be used without losing any essential features of the proposal.

It has been suggested that a different exception, such as GeneratorReturn, should be used instead of StopIteration to return a value. However, no convincing practical reason for this has been put forward, and the addition of a value attribute to StopIteration mitigates any difficulties in extracting a return value from a StopIteration exception that may or may not have one. Also, using a different exception would mean that, unlike ordinary functions, 'return' without a value in a generator would not be equivalent to 'return None'.

Alternative Proposals

Proposals along similar lines have been made before, some using the syntax yield * instead of yield from. While yield * is more concise, it could be argued that it looks too similar to an ordinary yield and the difference might be overlooked when reading code.

To the author's knowledge, previous proposals have focused only on yielding values, and thereby suffered from the criticism that the two-line for-loop they replace is not sufficiently tiresome to write to justify a new syntax. By dealing with the full generator protocol, this proposal provides considerably more benefit.

Additional Material

Some examples of the use of the proposed syntax are available, and also a prototype implementation based on the first optimisation outlined above.

Examples and Implementation [2]

A version of the implementation updated for Python 3.3 is available from tracker issue #11682 [3]

pep-0381 Mirroring infrastructure for PyPI

PEP:381
Title:Mirroring infrastructure for PyPI
Version:$Revision$
Last-Modified:$Date$
Author:Tarek ZiadĂŠ <tarek at ziade.org>, Martin v. LĂświs <martin at v.loewis.de>
Status:Draft
Type:Standards Track
Content-Type:text/x-rst
Created:21-March-2009
Python-Version:N.A.
Post-History:

Abstract

This PEP describes a mirroring infrastructure for PyPI.

Rationale

PyPI is hosting over 6000 projects and is used on a daily basis by people to build applications. Especially systems like easy_install and zc.buildout make intensive usage of PyPI.

For people making intensive use of PyPI, it can act as a single point of failure. People have started to set up some mirrors, both private and public. Those mirrors are active mirrors, which means that they are browsing PyPI to get synced.

In order to make the system more reliable, this PEP describes:

  • the mirror listing and registering at PyPI
  • the pages a public mirror should maintain. These pages will be used by PyPI, in order to get hit counts and the last modified date.
  • how a mirror should synchronize with PyPI
  • how a client can implement a fail-over mechanism

Mirror listing and registering

People that wants to mirror PyPI make a proposal on catalog-SIG. When a mirror is proposed on the mailing list, it is manually added in a mirror list in the PyPI application after it has been checked to be compliant with the mirroring rules.

The mirror list is provided as a list of host names of the form

X.pypi.python.org

The values of X are the sequence a,b,c,...,aa,ab,... a.pypi.python.org is the master server; the mirrors start with b. A CNAME record last.pypi.python.org points to the last host name. Mirror operators should use a static address, and report planned changes to that address in advance to distutils-sig.

The new mirror also appears at http://pypi.python.org/mirrors which is a human-readable page that gives the list of mirrors. This page also explains how to register a new mirror.

Statistics page

PyPI provides statistics on downloads at /stats. This page is calculated daily by PyPI, by reading all mirrors' local stats and summing them.

The stats are presented in daily or montly files, under /stats/days and /stats/months. Each file is a bzip2 file with these formats:

  • YYYY-MM-DD.bz2 for daily files
  • YYYY-MM.bz2 for monthly files

Examples:

  • /stats/days/2008-11-06.bz2
  • /stats/days/2008-11-07.bz2
  • /stats/days/2008-11-08.bz2
  • /stats/months/2008-11.bz2
  • /stats/months/2008-10.bz2

Mirror Authenticity

With a distributed mirroring system, clients may want to verify that the mirrored copies are authentic. There are multiple threats to consider:

  1. the central index may get compromised
  2. the central index is assumed to be trusted, but the mirrors might be tampered.
  3. a man in the middle between the central index and the end user, or between a mirror and the end user might tamper with datagrams.

This specification only deals with the second threat. Some provisions are made to detect man-in-the-middle attacks. To detect the first attack, package authors need to sign their packages using PGP keys, so that users verify that the package comes from the author they trust.

The central index provides a DSA key at the URL /serverkey, in the PEM format as generated by "openssl dsa -pubout" (i.e. RFC 3280 SubjectPublicKeyInfo, with the algorithm 1.3.14.3.2.12). This URL must not be mirrored, and clients must fetch the official serverkey from PyPI directly, or use the copy that came with the PyPI client software. Mirrors should still download the key, to detect a key rollover.

For each package, a mirrored signature is provided at /serversig/<package>. This is the DSA signature of the parallel URL /simple/<package>, in DER form, using SHA-1 with DSA (i.e. as a RFC 3279 Dsa-Sig-Value, created by algorithm 1.2.840.10040.4.3)

Clients using a mirror need to perform the following steps to verify a package:

  1. download the /simple page, and compute its SHA-1 hash
  2. compute the DSA signature of that hash
  3. download the corresponding /serversig, and compare it (byte-for-byte) with the value computed in step 2.
  4. compute and verify (against the /simple page) the MD-5 hashes of all files they download from the mirror.

An implementation of the verification algorithm is available from https://svn.python.org/packages/trunk/pypi/tools/verify.py

Verification is not needed when downloading from central index, and should be avoided to reduce the computation overhead.

About once a year, the key will be replaced with a new one. Mirrors will have to re-fetch all /serversig pages. Clients using mirrors need to find a trusted copy of the new server key. One way to obtain one is to download it from https://pypi.python.org/serverkey. To detect man-in-the-middle attacks, clients need to verify the SSL server certificate, which will be signed by the CACert authority.

Special pages a mirror needs to provide

A mirror is a subset copy of PyPI, so it provides the same structure by copying it.

  • simple: rest version of the package index
  • packages: packages, stored by Python version, and letters
  • serversig: signatures for the simple pages

It also needs to provide two specific elements:

  • last-modified
  • local-stats

Last modified date

CPAN uses a freshness date system where the mirror's last synchronisation date is made available.

For PyPI, each mirror needs to maintain a URL with simple text content that represents the last synchronisation date the mirror maintains.

The date is provided in GMT time, using the ISO 8601 format [3]. Each mirror will be responsible to maintain its last modified date.

This page must be located at : /last-modified and must be a text/plain page.

Local statistics

Each mirror is responsible to count all the downloads that where done via it. This is used by PyPI to sum up all downloads, to be able to display the grand total.

These statistics are in CSV-like form, with a header in the first line. It needs to obey PEP 305 [1]. Basically, it should be readable by Python's csv module.

The fields in this file are:

  • package: the distutils id of the package.
  • filename: the filename that has been downloaded.
  • useragent: the User-Agent of the client that has downloaded the package.
  • count: the number of downloads.

The content will look like this:

# package,filename,useragent,count
zc.buildout,zc.buildout-1.6.0.tgz,MyAgent,142
...

The counting starts the day the mirror is launched, and there is one file per day, compressed using the bzip2 format. Each file is named like the day. For example 2008-11-06.bz2 is the file for the 6th of November 2008.

They are then provided in a folder called days. For example:

  • /local-stats/days/2008-11-06.bz2
  • /local-stats/days/2008-11-07.bz2
  • /local-stats/days/2008-11-08.bz2

This page must be located at /local-stats.

How a mirror should synchronize with PyPI

A mirroring protocol called Simple Index was described and implemented by Martin v. Loewis and Jim Fulton, based on how easy_install works. This section synthesizes it and gives a few relevant links, plus a small part about User-Agent.

The mirroring protocol

Mirrors must reduce the amount of data transfered between the central server and the mirror. To achieve that, they MUST use the changelog() PyPI XML-RPC call, and only refetch the packages that have been changed since the last time. For each package P, they MUST copy documents /simple/P/ and /serversig/P. If a package is deleted on the central server, they MUST delete the package and all associated files. To detect modification of package files, they MAY cache the file's ETag, and MAY request skipping it using the If-none-match header.

Each mirroring tool MUST identify itself using a descripte User-agent header.

The pep381client package [2] provides an application that respects this protocol to browse PyPI.

User-agent request header

In order to be able to differentiate actions taken by clients over PyPI, a specific user agent name should be provided by all mirroring softwares.

This is also true for all clients like:

XXX user agent registering mechanism at PyPI ?

How a client can use PyPI and its mirrors

Clients that are browsing PyPI should be able to use alternative mirrors, by getting the list of the mirrors using last.pypi.python.org.

Code example:

>>> import socket
>>> socket.gethostbyname_ex('last.pypi.python.org')[0]
'h.pypi.python.org'

The clients so far that could use this mechanism:

  • setuptools
  • zc.buildout (through setuptools)
  • pip

Fail-over mechanism

Clients that are browsing PyPI should be able to use a fail-over mechanism when PyPI or the used mirror is not responding.

It is up to the client to decide wich mirror should be used, maybe by looking at its geographical location and its responsivness.

This PEP does not describe how this fail-over mechanism should work, but it is strongly encouraged that the clients try to use the nearest mirror.

The clients so far that could use this mechanism:

  • setuptools
  • zc.buildout (through setuptools)
  • pip

Extra package indexes

It is obvious that some packages will not be uploaded to PyPI, whether because they are private or whether because the project maintainer runs his own server where people might get the project package. However, it is strongly encouraged that a public package index follows PyPI and Distutils protocols.

In other words, the register and upload command should be compatible with any package index server out there.

Softwares that are compatible with PyPI and Distutils so far:

  • PloneSoftwareCenter [7] wich is used to run plone.org products section.
  • EggBasket [8].

An extra package index is not a mirror of PyPI, but can have some mirrors itself.

Merging several indexes

When a client needs to get some packages from several distinct indexes, it should be able to use each one of them as a potential source of packages. Different indexes should be defined as a sorted list for the client to look for a package.

Each independant index can of course provide a list of its mirrors.

XXX define how to get the hostname for the mirrors of an arbitrary index.

That permits all combinations at client level, for a reliable packaging system with all levels of privacy.

It is up the client to deal with the merging.

Acknowledgments

Georg Brandl.

pep-0382 Namespace Packages

PEP:382
Title:Namespace Packages
Version:$Revision$
Last-Modified:$Date$
Author:Martin v. Lรถwis <martin at v.loewis.de>
Status:Rejected
Type:Standards Track
Content-Type:text/x-rst
Created:02-Apr-2009
Python-Version:3.2
Post-History:

Rejection Notice

On the first day of sprints at US PyCon 2012 we had a long and fruitful discussion about PEP 382 and PEP 402. We ended up rejecting both but a new PEP will be written to carry on in the spirit of PEP 402. Martin von Lรถwis wrote up a summary: [2].

Abstract

Namespace packages are a mechanism for splitting a single Python package across multiple directories on disk. In current Python versions, an algorithm to compute the packages __path__ must be formulated. With the enhancement proposed here, the import machinery itself will construct the list of directories that make up the package. An implementation of this PEP is available at [1].

Terminology

Within this PEP, the term package refers to Python packages as defined by Python's import statement. The term distribution refers to separately installable sets of Python modules as stored in the Python package index, and installed by distutils or setuptools. The term vendor package refers to groups of files installed by an operating system's packaging mechanism (e.g. Debian or Redhat packages install on Linux systems).

The term portion refers to a set of files in a single directory (possibly stored in a zip file) that contribute to a namespace package.

Namespace packages today

Python currently provides the pkgutil.extend_path to denote a package as a namespace package. The recommended way of using it is to put:

from pkgutil import extend_path
__path__ = extend_path(__path__, __name__)

in the package's __init__.py. Every distribution needs to provide the same contents in its __init__.py, so that extend_path is invoked independent of which portion of the package gets imported first. As a consequence, the package's __init__.py cannot practically define any names as it depends on the order of the package fragments on sys.path which portion is imported first. As a special feature, extend_path reads files named <packagename>.pkg which allow to declare additional portions.

setuptools provides a similar function pkg_resources.declare_namespace that is used in the form:

import pkg_resources
pkg_resources.declare_namespace(__name__)

In the portion's __init__.py, no assignment to __path__ is necessary, as declare_namespace modifies the package __path__ through sys.modules. As a special feature, declare_namespace also supports zip files, and registers the package name internally so that future additions to sys.path by setuptools can properly add additional portions to each package.

setuptools allows declaring namespace packages in a distribution's setup.py, so that distribution developers don't need to put the magic __path__ modification into __init__.py themselves.

Rationale

The current imperative approach to namespace packages has lead to multiple slightly-incompatible mechanisms for providing namespace packages. For example, pkgutil supports *.pkg files; setuptools doesn't. Likewise, setuptools supports inspecting zip files, and supports adding portions to its _namespace_packages variable, whereas pkgutil doesn't.

In addition, the current approach causes problems for system vendors. Vendor packages typically must not provide overlapping files, and an attempt to install a vendor package that has a file already on disk will fail or cause unpredictable behavior. As vendors might chose to package distributions such that they will end up all in a single directory for the namespace package, all portions would contribute conflicting __init__.py files.

Specification

Rather than using an imperative mechanism for importing packages, a declarative approach is proposed here: A directory whose name ends with .pyp (for Python package) contains a portion of a package.

The import statement is extended so that computes the package's __path__ attribute for a package named P as consisting of optionally a single directory name P containing a file __init__.py, plus all directories named P.pyp, in the order in which they are found in the parent's package __path__ (or sys.path). If either of these are found, search for additional portions of the package continues.

A directory may contain both a package in the P/__init__.py and the P.pyp form.

No other change to the importing mechanism is made; searching modules (including __init__.py) will continue to stop at the first module encountered. In summary, the process import a package foo works like this:

  1. sys.path is searched for directories foo or foo.pyp, or a file foo.<ext>. If a file is found and no directory, it is treated as a module, and imported.
  2. If a directory foo is found, a check is made whether it contains __init__.py. If so, the location of the __init__.py is remembered. Otherwise, the directory is skipped. Once an __init__.py is found, further directories called foo are skipped.
  3. For both directories foo and foo.pyp, the directories are added to the package's __path__.
  4. If an __init__ module was found, it is imported, with __path__ being initialized to the path computed all .pyp directories.

Impact on Import Hooks

Both loaders and finders as defined in PEP 302 will need to be changed to support namespace packages. Failure to comform to the protocol below might cause a package not being recognized as a namespace package; loaders and finders not supporting this protocol must raise AttributeError when the functions below get accessed.

Finders need to support looking for *.pth files in step 1 of above algorithm. To do so, a finder used as a path hook must support a method:

finder.find_package_portion(fullname)

This method will be called in the same manner as find_module, and it must return an string to be added to the package's __path__. If the finder doesn't find a portion of the package, it shall return None. Raising AttributeError from above call will be treated as non-conformance with this PEP, and the exception will be ignored. All other exceptions are reported.

A finder may report both success from find_module and from find_package_portion, allowing for both a package containing an __init__.py and a portion of the same package.

All strings returned from find_package_portion, along with all path names of .pyp directories are added to the new package's __path__.

Discussion

Original versions of this specification proposed the addition of *.pth files, similar to the way those files are used on sys.path. With a wildcard marker (*), a package could indicate that the entire path is derived by looking at the parent path, searching for properly-named subdirectories.

People then observed that the support for the full .pth syntax is inappropriate, and the .pth files were changed to be mere marker files, indicating that a directories is a package. Peter Trรถger suggested that .pth is an unsuitable file extension, as all file extensions related to Python should start with .py. Therefore, the marker file was renamed to be .pyp.

Dinu Gherman then observed that using a marker file is not necessary, and that a directoy extension could well serve as a such as a marker. This is what this PEP currently proposes.

Phillip Eby designed PEP 402 as an alternative approach to this PEP, after comparing Python's package syntax with that found in other languages. PEP 402 proposes not to use a marker file at all. At the discussion at PyCon DE 2011, people remarked that having an explicit declaration of a directory as contributing to a package is a desirable property, rather than an obstactle. In particular, Jython developers noticed that Jython could easily mistake a directory that is a Java package as being a Python package, if there is no need to declare Python packages.

Packages can stop filling out the namespace package's __init__.py. As a consequence, extend_path and declare_namespace become obsolete.

Namespace packages can start providing non-trivial __init__.py implementations; to do so, it is recommended that a single distribution provides a portion with just the namespace package's __init__.py (and potentially other modules that belong to the namespace package proper).

The mechanism is mostly compatible with the existing namespace mechanisms. extend_path will be adjusted to this specification; any other mechanism might cause portions to get added twice to __path__.

pep-0383 Non-decodable Bytes in System Character Interfaces

PEP:383
Title:Non-decodable Bytes in System Character Interfaces
Version:$Revision$
Last-Modified:$Date$
Author:Martin v. Lรถwis <martin at v.loewis.de>
Status:Final
Type:Standards Track
Content-Type:text/x-rst
Created:22-Apr-2009
Python-Version:3.1
Post-History:

Abstract

File names, environment variables, and command line arguments are defined as being character data in POSIX; the C APIs however allow passing arbitrary bytes - whether these conform to a certain encoding or not. This PEP proposes a means of dealing with such irregularities by embedding the bytes in character strings in such a way that allows recreation of the original byte string.

Rationale

The C char type is a data type that is commonly used to represent both character data and bytes. Certain POSIX interfaces are specified and widely understood as operating on character data, however, the system call interfaces make no assumption on the encoding of these data, and pass them on as-is. With Python 3, character strings use a Unicode-based internal representation, making it difficult to ignore the encoding of byte strings in the same way that the C interfaces can ignore the encoding.

On the other hand, Microsoft Windows NT has corrected the original design limitation of Unix, and made it explicit in its system interfaces that these data (file names, environment variables, command line arguments) are indeed character data, by providing a Unicode-based API (keeping a C-char-based one for backwards compatibility).

For Python 3, one proposed solution is to provide two sets of APIs: a byte-oriented one, and a character-oriented one, where the character-oriented one would be limited to not being able to represent all data accurately. Unfortunately, for Windows, the situation would be exactly the opposite: the byte-oriented interface cannot represent all data; only the character-oriented API can. As a consequence, libraries and applications that want to support all user data in a cross-platform manner have to accept mish-mash of bytes and characters exactly in the way that caused endless troubles for Python 2.x.

With this PEP, a uniform treatment of these data as characters becomes possible. The uniformity is achieved by using specific encoding algorithms, meaning that the data can be converted back to bytes on POSIX systems only if the same encoding is used.

Being able to treat such strings uniformly will allow application writers to abstract from details specific to the operating system, and reduces the risk of one API failing when the other API would have worked.

Specification

On Windows, Python uses the wide character APIs to access character-oriented APIs, allowing direct conversion of the environmental data to Python str objects ([1]).

On POSIX systems, Python currently applies the locale's encoding to convert the byte data to Unicode, failing for characters that cannot be decoded. With this PEP, non-decodable bytes >= 128 will be represented as lone surrogate codes U+DC80..U+DCFF. Bytes below 128 will produce exceptions; see the discussion below.

To convert non-decodable bytes, a new error handler ([2]) "surrogateescape" is introduced, which produces these surrogates. On encoding, the error handler converts the surrogate back to the corresponding byte. This error handler will be used in any API that receives or produces file names, command line arguments, or environment variables.

The error handler interface is extended to allow the encode error handler to return byte strings immediately, in addition to returning Unicode strings which then get encoded again (also see the discussion below).

Byte-oriented interfaces that already exist in Python 3.0 are not affected by this specification. They are neither enhanced nor deprecated.

External libraries that operate on file names (such as GUI file choosers) should also encode them according to the PEP.

Discussion

This surrogateescape encoding is based on Markus Kuhn's idea that he called UTF-8b [3].

While providing a uniform API to non-decodable bytes, this interface has the limitation that chosen representation only "works" if the data get converted back to bytes with the surrogateescape error handler also. Encoding the data with the locale's encoding and the (default) strict error handler will raise an exception, encoding them with UTF-8 will produce non-sensical data.

Data obtained from other sources may conflict with data produced by this PEP. Dealing with such conflicts is out of scope of the PEP.

This PEP allows the possibility of "smuggling" bytes in character strings. This would be a security risk if the bytes are security-critical when interpreted as characters on a target system, such as path name separators. For this reason, the PEP rejects smuggling bytes below 128. If the target system uses EBCDIC, such smuggled bytes may still be a security risk, allowing smuggling of e.g. square brackets or the backslash. Python currently does not support EBCDIC, so this should not be a problem in practice. Anybody porting Python to an EBCDIC system might want to adjust the error handlers, or come up with other approaches to address the security risks.

Encodings that are not compatible with ASCII are not supported by this specification; bytes in the ASCII range that fail to decode will cause an exception. It is widely agreed that such encodings should not be used as locale charsets.

For most applications, we assume that they eventually pass data received from a system interface back into the same system interfaces. For example, an application invoking os.listdir() will likely pass the result strings back into APIs like os.stat() or open(), which then encodes them back into their original byte representation. Applications that need to process the original byte strings can obtain them by encoding the character strings with the file system encoding, passing "surrogateescape" as the error handler name. For example, a function that works like os.listdir, except for accepting and returning bytes, would be written as:

def listdir_b(dirname):
    fse = sys.getfilesystemencoding()
    dirname = dirname.decode(fse, "surrogateescape")
    for fn in os.listdir(dirname):
        # fn is now a str object
        yield fn.encode(fse, "surrogateescape")

The extension to the encode error handler interface proposed by this PEP is necessary to implement the 'surrogateescape' error handler, because there are required byte sequences which cannot be generated from replacement Unicode. However, the encode error handler interface presently requires replacement Unicode to be provided in lieu of the non-encodable Unicode from the source string. Then it promptly encodes that replacement Unicode. In some error handlers, such as the 'surrogateescape' proposed here, it is also simpler and more efficient for the error handler to provide a pre-encoded replacement byte string, rather than forcing it to calculating Unicode from which the encoder would create the desired bytes.

A few alternative approaches have been proposed:

  • create a new string subclass that supports embedded bytes
  • use different escape schemes, such as escaping with a NUL character, or mapping to infrequent characters.

Of these proposals, the approach of escaping each byte XX with the sequence U+0000 U+00XX has the disadvantage that encoding to UTF-8 will introduce a NUL byte in the UTF-8 sequence. As a consequence, C libraries may interpret this as a string termination, even though the string continues. In particular, the gtk libraries will truncate text in this case; other libraries may show similar problems.

pep-0384 Defining a Stable ABI

PEP:384
Title:Defining a Stable ABI
Version:$Revision$
Last-Modified:$Date$
Author:Martin v. Lรถwis <martin at v.loewis.de>
Status:Final
Type:Standards Track
Content-Type:text/x-rst
Created:17-May-2009
Python-Version:3.2
Post-History:

Abstract

Currently, each feature release introduces a new name for the Python DLL on Windows, and may cause incompatibilities for extension modules on Unix. This PEP proposes to define a stable set of API functions which are guaranteed to be available for the lifetime of Python 3, and which will also remain binary-compatible across versions. Extension modules and applications embedding Python can work with different feature releases as long as they restrict themselves to this stable ABI.

Rationale

The primary source of ABI incompatibility are changes to the lay-out of in-memory structures. For example, the way in which string interning works, or the data type used to represent the size of an object, have changed during the life of Python 2.x. As a consequence, extension modules making direct access to fields of strings, lists, or tuples, would break if their code is loaded into a newer version of the interpreter without recompilation: offsets of other fields may have changed, making the extension modules access the wrong data.

In some cases, the incompatibilities only affect internal objects of the interpreter, such as frame or code objects. For example, the way line numbers are represented has changed in the 2.x lifetime, as has the way in which local variables are stored (due to the introduction of closures). Even though most applications probably never used these objects, changing them had required to change the PYTHON_API_VERSION.

On Linux, changes to the ABI are often not much of a problem: the system will provide a default Python installation, and many extension modules are already provided pre-compiled for that version. If additional modules are needed, or additional Python versions, users can typically compile them themselves on the system, resulting in modules that use the right ABI.

On Windows, multiple simultaneous installations of different Python versions are common, and extension modules are compiled by their authors, not by end users. To reduce the risk of ABI incompatibilities, Python currently introduces a new DLL name pythonXY.dll for each feature release, whether or not ABI incompatibilities actually exist.

With this PEP, it will be possible to reduce the dependency of binary extension modules on a specific Python feature release, and applications embedding Python can be made work with different releases.

Specification

The ABI specification falls into two parts: an API specification, specifying what function (groups) are available for use with the ABI, and a linkage specification specifying what libraries to link with. The actual ABI (layout of structures in memory, function calling conventions) is not specified, but implied by the compiler. As a recommendation, a specific ABI is recommended for selected platforms.

During evolution of Python, new ABI functions will be added. Applications using them will then have a requirement on a minimum version of Python; this PEP provides no mechanism for such applications to fall back when the Python library is too old.

Terminology

Applications and extension modules that want to use this ABI are collectively referred to as "applications" from here on.

Header Files and Preprocessor Definitions

Applications shall only include the header file Python.h (before including any system headers), or, optionally, include pyconfig.h, and then Python.h.

During the compilation of applications, the preprocessor macro Py_LIMITED_API must be defined. Doing so will hide all definitions that are not part of the ABI.

Structures

Only the following structures and structure fields are accessible to applications:

  • PyObject (ob_refcnt, ob_type)
  • PyVarObject (ob_base, ob_size)
  • PyMethodDef (ml_name, ml_meth, ml_flags, ml_doc)
  • PyMemberDef (name, type, offset, flags, doc)
  • PyGetSetDef (name, get, set, doc, closure)
  • PyModuleDefBase (ob_base, m_init, m_index, m_copy)
  • PyModuleDef (m_base, m_name, m_doc, m_size, m_methods, m_traverse, m_clear, m_free)
  • PyStructSequence_Field (name, doc)
  • PyStructSequence_Desc (name, doc, fields, sequence)
  • PyType_Slot (see below)
  • PyType_Spec (see below)

The accessor macros to these fields (Py_REFCNT, Py_TYPE, Py_SIZE) are also available to applications.

The following types are available, but opaque (i.e. incomplete):

  • PyThreadState
  • PyInterpreterState
  • struct _frame
  • struct symtable
  • struct _node
  • PyWeakReference
  • PyLongObject
  • PyTypeObject

Type Objects

The structure of type objects is not available to applications; declaration of "static" type objects is not possible anymore (for applications using this ABI). Instead, type objects get created dynamically. To allow an easy creation of types (in particular, to be able to fill out function pointers easily), the following structures and functions are available:

typedef struct{
  int slot;    /* slot id, see below */
  void *pfunc; /* function pointer */
} PyType_Slot;

typedef struct{
  const char* name;
  const char* doc;
  int basicsize;
  int itemsize;
  int flags;
  PyType_Slot *slots; /* terminated by slot==0. */
} PyType_Spec;

PyObject* PyType_FromSpec(PyType_Spec*);

To specify a slot, a unique slot id must be provided. New Python versions may introduce new slot ids, but slot ids will never be recycled. Slots may get deprecated, but continue to be supported throughout Python 3.x.

The slot ids are named like the field names of the structures that hold the pointers in Python 3.1, with an added Py_ prefix (i.e. Py_tp_dealloc instead of just tp_dealloc):

  • tp_dealloc, tp_getattr, tp_setattr, tp_repr, tp_hash, tp_call, tp_str, tp_getattro, tp_setattro, tp_doc, tp_traverse, tp_clear, tp_richcompare, tp_iter, tp_iternext, tp_methods, tp_base, tp_descr_get, tp_descr_set, tp_init, tp_alloc, tp_new, tp_is_gc, tp_bases, tp_del
  • nb_add nb_subtract nb_multiply nb_remainder nb_divmod nb_power nb_negative nb_positive nb_absolute nb_bool nb_invert nb_lshift nb_rshift nb_and nb_xor nb_or nb_int nb_float nb_inplace_add nb_inplace_subtract nb_inplace_multiply nb_inplace_remainder nb_inplace_power nb_inplace_lshift nb_inplace_rshift nb_inplace_and nb_inplace_xor nb_inplace_or nb_floor_divide nb_true_divide nb_inplace_floor_divide nb_inplace_true_divide nb_index
  • sq_length sq_concat sq_repeat sq_item sq_ass_item sq_contains sq_inplace_concat sq_inplace_repeat
  • mp_length mp_subscript mp_ass_subscript

The following fields cannot be set during type definition: - tp_dict tp_mro tp_cache tp_subclasses tp_weaklist tp_print - tp_weaklistoffset tp_dictoffset

typedefs

In addition to the typedefs for structs listed above, the following typedefs are available. Their inclusion in the ABI means that the underlying type must not change on a platform (even though it may differ across platforms).

  • Py_uintptr_t Py_intptr_t Py_ssize_t
  • unaryfunc binaryfunc ternaryfunc inquiry lenfunc ssizeargfunc ssizessizeargfunc ssizeobjargproc ssizessizeobjargproc objobjargproc objobjproc visitproc traverseproc destructor getattrfunc getattrofunc setattrfunc setattrofunc reprfunc hashfunc richcmpfunc getiterfunc iternextfunc descrgetfunc descrsetfunc initproc newfunc allocfunc
  • PyCFunction PyCFunctionWithKeywords PyNoArgsFunction PyCapsule_Destructor
  • getter setter
  • PyOS_sighandler_t
  • PyGILState_STATE
  • Py_UCS4

Most notably, Py_UNICODE is not available as a typedef, since the same Python version may use different definitions of it on the same platform (depending on whether it uses narrow or wide code units). Applications that need to access the contents of a Unicode string can convert it to wchar_t.

Functions and function-like Macros

All functions starting with _Py are not available to applications (see exceptions below). Also, all functions that expect parameter types that are unavailable to applications are excluded from the ABI, such as PyAST_FromNode (which expects a node*).

All other functions are available, unless excluded below.

Function-like macros (in particular, field access macros) remain available to applications, but get replaced by function calls (unless their definition only refers to features of the ABI, such as the various _Check macros)

ABI function declarations will not change their parameters or return types. If a change to the signature becomes necessary, a new function will be introduced. If the new function is source-compatible (e.g. if just the return type changes), an alias macro may get added to redirect calls to the new function when the applications is recompiled.

If continued provision of the old function is not possible, it may get deprecated, then removed, in accordance with PEP 7, causing applications that use that function to break.

Excluded Functions

Functions declared in the following header files are not part of the ABI:

  • bytes_methods.h
  • cellobject.h
  • classobject.h
  • code.h
  • compile.h
  • datetime.h
  • dtoa.h
  • frameobject.h
  • funcobject.h
  • genobject.h
  • longintrepr.h
  • parsetok.h
  • pyarena.h
  • pyatomic.h
  • pyctype.h
  • pydebug.h
  • pytime.h
  • symtable.h
  • token.h
  • ucnhash.h

In addition, functions expecting FILE* are not part of the ABI, to avoid depending on a specific version of the Microsoft C runtime DLL on Windows.

Module and type initalizer functions are not available (PyByteArray_Init, PyByteArray_Fini, PyBytes_Fini, PyCFunction_Fini, PyDict_Fini, PyFloat_ClearFreeList, PyFloat_Fini, PyFrame_Fini, PyList_Fini, PyMethod_Fini, PyOS_FiniInterrupts, PySet_Fini, PyTuple_Fini).

Several functions dealing with interpreter implementation details are not available:

  • PyInterpreterState_Head, PyInterpreterState_Next, PyInterpreterState_ThreadHead, PyThreadState_Next
  • Py_SubversionRevision, Py_SubversionShortBranch

PyStructSequence_InitType is not available, as it requires the caller to provide a static type object.

Py_FatalError will be moved from pydebug.h into some other header file (e.g. pyerrors.h).

The exact list of functions being available is given in the Windows module definition file for python3.dll [1].

Global Variables

Global variables representing types and exceptions are available to applications. In addition, selected global variables referenced in macros (such as Py_True and Py_False) are available.

A complete list of global variable definitions is given in the python3.def file [1]; those declared DATA denote variables.

Other Macros

All macros defining symbolic constants are available to applications; the numeric values will not change.

In addition, the following macros are available:

  • Py_BEGIN_ALLOW_THREADS, Py_BLOCK_THREADS, Py_UNBLOCK_THREADS, Py_END_ALLOW_THREADS

The Buffer Interface

The buffer interface (type Py_buffer, type slots bf_getbuffer and bf_releasebuffer, etc) has been omitted from the ABI, since the stability of the Py_buffer structure is not clear at this time. Inclusion in the ABI can be considered in future releases.

Signature Changes

A number of functions currently expect a specific struct, even though callers typically have PyObject* available. These have been changed to expect PyObject* as the parameter; this will cause warnings in applications that currently explicitly cast to the parameter type. These functions are PySlice_GetIndices, PySlice_GetIndicesEx, PyUnicode_AsWideChar, and PyEval_EvalCode.

Linkage

On Windows, applications shall link with python3.dll; an import library python3.lib will be available. This DLL will redirect all of its API functions through /export linker options to the full interpreter DLL, i.e. python3y.dll.

On Unix systems, the ABI is typically provided by the python executable itself. PyModule_Create is changed to pass 3 as the API version if the extension module was compiled with Py_LIMITED_API; the version check for the API version will accept either 3 or the current PYTHON_API_VERSION as conforming. If Python is compiled as a shared library, it is installed as both libpython3.so, and libpython3.y.so; applications conforming to this PEP should then link to the former (extension modules can continue to link with no libpython shared object, but rather rely on runtime linking). The ABI version is symbolically available as PYTHON_ABI_VERSION.

Also on Unix, the PEP 3149 tag abi<PYTHON_ABI_VERSION> is accepted in file names of extension modules. No checking is performed that files named in this way are actually restricted to the limited API, and no support for building such files will be added to distutils due to the distutils code freeze.

Implementation Strategy

This PEP will be implemented in a branch [2], allowing users to check whether their modules conform to the ABI. To avoid users having to rewrite their type definitions, a script to convert C source code containing type definitions will be provided [3].

pep-0385 Migrating from Subversion to Mercurial

PEP:385
Title:Migrating from Subversion to Mercurial
Version:$Revision$
Last-Modified:$Date$
Author:Dirkjan Ochtman <dirkjan at ochtman.nl>, Antoine Pitrou <solipsis at pitrou.net>, Georg Brandl <georg at python.org>
Status:Final
Type:Process
Content-Type:text/x-rst
Created:25-May-2009

Motivation

After having decided to switch to the Mercurial DVCS, the actual migration still has to be performed. In the case of an important piece of infrastructure like the version control system for a large, distributed project like Python, this is a significant effort. This PEP is an attempt to describe the steps that must be taken for further discussion. It's somewhat similar to PEP 347 [3], which discussed the migration to SVN.

To make the most of hg, we would like to make a high-fidelity conversion, such that (a) as much of the svn metadata as possible is retained, and (b) all metadata is converted to formats that are common in Mercurial. This way, tools written for Mercurial can be optimally used. In order to do this, we want to use the hgsubversion [4] software to do an initial conversion. This hg extension is focused on providing high-quality conversion from Subversion to Mercurial for use in two-way correspondence, meaning it doesn't throw away as much available metadata as other solutions.

Such a conversion also seems like a good time to reconsider the contents of the repository and determine if some things are still valuable. In this spirit, the following sections also propose discarding some of the older metadata.

Timeline

The current schedule for conversion milestones:

  • 2011-02-24: availability of a test repo at hg.python.org

    Test commits will be allowed (and encouraged) from all committers to the Subversion repository. The test repository and all test commits will be removed once the final conversion is done. The server-side hooks will be installed for the test repository, in order to test buildbot, diff-email and whitespace checking integration.

  • 2011-03-05: final conversion (tentative)

    Commits to the Subversion branches now maintained in Mercurial will be blocked. Developers should refrain from pushing to the Mercurial repositories until all infrastructure is ensured to work after their switch over to the new repository.

Transition plan

Branch strategy

Mercurial has two basic ways of using branches: cloned branches, where each branch is kept in a separate repository, and named branches, where each revision keeps metadata to note on which branch it belongs. The former makes it easier to distinguish branches, at the expense of requiring more disk space on the client. The latter makes it a little easier to switch between branches, but all branch names are a persistent part of history. [1]

Differences between named branches and cloned branches:

  • Tags in a different (maintenance) clone aren't available in the local clone
  • Clones with named branches will be larger, since they contain more data

We propose to use named branches for release branches and adopt cloned branches for feature branches.

History management

In order to minimize the loss of information due to the conversion, we propose to provide several repositories as a conversion result:

  • A repository trimmed to the mainline trunk (and py3k), as well as past and present maintenance branches -- this is called the "working" repo and is where development continues. This repository has all the history needed for development work, including annotating source files with changes back up to 1990 and other common history-digging operations.

    The default branch in that repo is what is known as py3k in Subversion, while the Subversion trunk lives on with the branch name legacy-trunk; however in Mercurial this branch will be closed. Release branches are named after their major.minor version, e.g. 3.2.

  • A repository with the full, unedited conversion of the Subversion repository (actually, its /python subdirectory) -- this is called the "historic" or "archive" repo and will be offered as a read-only resource. [2]

  • One more repository per active feature branch; "active" means that at least one core developer asks for the branch to be provided. Each such repository will contain both the feature branch and all ancestor changesets from mainline (coming from trunk and/or py3k in SVN).

Since all branches are present in the historic repo, they can later be extracted as separate repositories at any time should it prove to be necessary.

The final revision map between SVN revision numbers, Mercurial changesets and SVN branch names will be made available in a file stored in the Misc directory. Its format is as following:

[...]
88483 e65daae6cf4499a0863cb7645109a4798c28d83e issue10276-snowleopard
88484 835cb57abffeceaff0d85c2a3aa0625458dd3e31 py3k
88485 d880f9d8492f597a030772c7485a34aadb6c4ece release32-maint
88486 0c431b8c22f5dbeb591414c154acb7890c1809df py3k
88487 82cda1f21396bbd10db8083ea20146d296cb630b release32-maint
88488 8174d00d07972d6f109ed57efca8273a4d59302c release27-maint
[...]

Converting tags

The SVN tags directory contains a lot of old stuff. Some of these are not, in fact, full tags, but contain only a smaller subset of the repository. All release tags will be kept; other tags will be included based on requests from the developer community. We propose to make the tag naming scheme consistent, in this style: v3.2.1a2.

Author map

In order to provide user names the way they are common in hg (in the 'First Last <user@example.org>' format), we need an author map to map cvs and svn user names to real names and their email addresses. We have a complete version of such a map in the migration tools repository (not publicly accessible to avoid leaking addresses to harvesters). The email addresses in it might be out of date; that's bound to happen, although it would be nice to try and have as many people as possible review it for addresses that are out of date. The current version also still seems to contain some encoding problems.

Generating .hgignore

The .hgignore file can be used in Mercurial repositories to help ignore files that are not eligible for version control. It does this by employing several possible forms of pattern matching. The current Python repository already includes a rudimentary .hgignore file to help with using the hg mirrors.

Since the current Python repository already includes a .hgignore file (for use with hg mirrors), we'll just use that. Generating full history of the file was debated but deemed impractical (because it's relatively hard with fairly little gain, since ignoring is less important for older revisions).

Repository size

A bare conversion result of the current Python repository weighs 1.9 GB; although this is smaller than the Subversion repository (2.7 GB) it is not feasible.

The size becomes more manageable by the trimming applied to the working repository, and by a process called "revlog reordering" that optimizes the layout of internal Mercurial storage very efficiently.

After all optimizations done, the size of the working repository is around 180 MB on disk. The amount of data transferred over the network when cloning is estimated to be around 80 MB.

Other repositories

There are a number of other projects hosted in svn.python.org's "projects" repository. The "peps" directory will be converted along with the main Python one. Richard Tew has indicated that he'd like the Stackless repository to also be converted. What other projects in the svn.python.org repository should be converted?

There's now an initial stab at converting the Jython repository. The current tip of hgsubversion unfortunately fails at some point. Pending investigation.

Other repositories that would like to converted to Mercurial can announce themselves to me after the main Python migration is done, and I'll take care of their needs.

Infrastructure

hg-ssh

Developers should access the repositories through ssh, similar to the current setup. Public keys can be used to grant people access to a shared hg@ account. A hgwebdir instance also has been set up at hg.python.org for easy browsing and read-only access. It is configured so that developers can trivially start new clones (for longer-term features that profit from development in a separate repository).

Also, direct creation of public repositories is allowed for core developers, although it is not yet decided which naming scheme will be enforced:

$ hg init ssh://hg@hg.python.org/sandbox/mywork
repo created, public URL is http://hg.python.org/sandbox/mywork

Hooks

A number of hooks is currently in use. The hg equivalents for these should be developed and deployed. The following hooks are being used:

  • check whitespace: a hook to reject commits in case the whitespace doesn't match the rules for the Python codebase. In a changegroup, only the tip is checked (this allows cleanup commits for changes pulled from third-party repos). We can also offer a whitespace hook for use with client-side repositories that people can use; it could either warn about whitespace issues and/or truncate trailing whitespace from changed lines.
  • push mails: Emails will include diffs for each changeset pushed to the public repository, including the username which pushed the changesets (this is not necessarily the same as the author recorded in the changesets).
  • buildbots: the python.org build master will be notified of each changeset pushed to the cpython repository, and will trigger an appropriate build on every build slave for the branch in which the changeset occurs.

The hooks repository [5] contains ports of these server-side hooks to Mercurial, as well as a couple additional ones:

  • check branch heads: a hook to reject pushes which create a new head on an existing branch. The pusher then has to merge the excess heads and try pushing again.
  • check branches: a hook to reject all changesets not on an allowed named branch. This hook's whitelist will have to be updated when we want to create new maintenance branches.
  • check line endings: a hook, based on the eol extension [6], to reject all changesets committing files with the wrong line endings. The commits then have to be stripped and redone, possibly with the eol extension [6] enabled on the comitter's computer.

One additional hook could be beneficial:

  • check contributors: in the current setup, all changesets bear the username of committers, who must have signed the contributor agreement. We might want to use a hook to check if the committer is a contributor if we keep a list of registered contributors. Then, the hook might warn users that push a group of revisions containing changesets from unknown contributors.

End-of-line conversions

Discussion about the lack of end-of-line conversion support in Mercurial, which was provided initially by the win32text extension [7], led to the development of the new eol extension [6] that supports a versioned management of line-ending conventions on a file-by-file basis, akin to Subversion's svn:eol-style properties. This information is kept in a versioned file called .hgeol, and such a file has already been checked into the Subversion repository.

A hook also exists on the server side to reject any changeset introducing inconsistent newline data (see above).

hgwebdir

A more or less stock hgwebdir installation should be set up. We might want to come up with a style to match the Python website.

A small WSGI application has been written that can look up Subversion revisions and redirect to the appropriate hgweb page for the given changeset, regardless in which repository the converted revision ended up (since one big Subversion repository is converted into several Mercurial repositories). It can also look up Mercurial changesets by their hexadecimal ID.

roundup

By pointing Roundup to the URL of the lookup script mentioned above, links to SVN revisions will continue to work, and links to Mercurial changesets can be created as well, without having to give repository and changeset ID.

After migration

Where to get code

After migration, the hgwebdir will live at hg.python.org. This is an accepted standard for many organizations, and an easy parallel to svn.python.org. The working repo might live at http://hg.python.org/cpython/, for example, with the archive repo at http://hg.python.org/cpython-archive/. For write access, developers will have to use ssh, which could be ssh://hg@hg.python.org/cpython/.

code.python.org was also proposed as the hostname. We think that using the VCS name in the hostname is good because it prevents confusion: it should be clear that you can't use svn or bzr for hg.python.org.

hgwebdir can already provide tarballs for every changeset. This obviates the need for daily snapshots; we can just point users to tip.tar.gz instead, meaning they will get the latest. If desired, we could even use buildbot results to point to the last good changeset.

Python-specific documentation

hg comes with good built-in documentation (available through hg help) and a wiki [10] that's full of useful information and recipes, not to mention a popular book [11] (readable online).

In addition to that, the recently overhauled Python Developer's Guide [8] already has a branch with instructions for Mercurial instead of Subversion; an online build of this branch [9] is also available.

Proposed workflow

We propose two workflows for the migration of patches between several branches.

For migration within 2.x or 3.x branches, we propose a patch always gets committed to the oldest branch where it applies first. Then, the resulting changeset can be merged using hg merge to all newer branches within that series (2.x or 3.x). If it does not apply as-is to the newer branch, hg revert can be used to easily revert to the new-branch-native head, patch in some alternative version of the patch (or none, if it's not applicable), then commit the merge. The premise here is that all changesets from an older branch within the series are eventually merged to all newer branches within the series.

The upshot is that this provides for the most painless merging procedure. This means that in the general case, people have to think about the oldest branch to which the patch should be applied before actually applying it. Usually, that is one of only two branches: the latest maintenance branch and the trunk, except for security fixes applicable to older branches in security-fix-only mode.

For merging bug fixes from the 3.x to the 2.7 maintenance branch (2.6 and 2.5 are in security-fix-only mode and their maintenance will continue in the Subversion repository), changesets should be transplanted (not merged) in some other way. The transplant extension, import/export and bundle/unbundle work equally well here.

Choosing this approach allows 3.x not to carry all of the 2.x history-since-it-was-branched, meaning the clone is not as big and the merges not as complicated.

The future of Subversion

What happens to the Subversion repositories after the migration? Since the svn server contains a bunch of repositories, not just the CPython one, it will probably live on for a bit as not every project may want to migrate or it takes longer for other projects to migrate. To prevent people from staying behind, we may want to move migrated projects from the repository to a new, read-only repository with a new name.

Build identification

Python currently provides the sys.subversion tuple to allow Python code to find out exactly what version of Python it's running against. The current version looks something like this:

  • ('CPython', 'tags/r262', '71600')
  • ('CPython', 'trunk', '73128M')

Another value is returned from Py_GetBuildInfo() in the C API, and available to Python code as part of sys.version:

  • 'r262:71600, Jun 2 2009, 09:58:33'
  • 'trunk:73128M, Jun 2 2009, 01:24:14'

I propose that the revision identifier will be the short version of hg's revision hash, for example 'dd3ebf81af43', augmented with '+' (instead of 'M') if the working directory from which it was built was modified. This mirrors the output of the hg id command, which is intended for this kind of usage. The sys.subversion value will also be renamed to sys.mercurial to reflect the change in VCS.

For the tag/branch identifier, I propose that hg will check for tags on the currently checked out revision, use the tag if there is one ('tip' doesn't count), and uses the branch name otherwise. sys.subversion becomes

  • ('CPython', 'v2.6.2', 'dd3ebf81af43')
  • ('CPython', 'default', 'af694c6a888c+')

and the build info string becomes

  • 'v2.6.2:dd3ebf81af43, Jun 2 2009, 09:58:33'
  • 'default:af694c6a888c+, Jun 2 2009, 01:24:14'

This reflects that the default branch in hg is called 'default' instead of Subversion's 'trunk', and reflects the proposed new tag format.

Mercurial also allows to find out the latest tag and the number of changesets separating the current changeset from that tag, allowing for a descriptive version string:

$ hg parent --template "{latesttag}+{latesttagdistance}-{node|short}\n"
v3.2+37-4b5d0d260e72
$ hg up 2.7
3316 files updated, 0 files merged, 379 files removed, 0 files unresolved
$ hg parent --template "{latesttag}+{latesttagdistance}-{node|short}\n"
v2.7.1+216-9619d21d8198

Footnotes

[1]The Mercurial book discourages the use of named branches, but it is, in this respect, somewhat outdated. Named branches have gotten much easier to use since that comment was written, due to improvements in hg.
[2]Since the initial working repo is a subset of the archive repo, it would also be feasible to pull changes from the working repo into the archive repo periodically.

pep-0386 Changing the version comparison module in Distutils

PEP:386
Title:Changing the version comparison module in Distutils
Version:$Revision$
Last-Modified:$Date$
Author:Tarek ZiadĂŠ <tarek at ziade.org>
Status:Superseded
Type:Standards Track
Content-Type:text/x-rst
Created:4-June-2009
Superseded-By:440

Abstract

Note: This PEP has been superseded by the version identification and dependency specification scheme defined in PEP 440.

This PEP proposed a new version comparison schema system in Distutils.

Motivation

In Python there are no real restrictions yet on how a project should manage its versions, and how they should be incremented.

Distutils provides a version distribution meta-data field but it is freeform and current users, such as PyPI usually consider the latest version pushed as the latest one, regardless of the expected semantics.

Distutils will soon extend its capabilities to allow distributions to express a dependency on other distributions through the Requires-Dist metadata field (see PEP 345) and it will optionally allow use of that field to restrict the dependency to a set of compatible versions. Notice that this field is replacing Requires that was expressing dependencies on modules and packages.

The Requires-Dist field will allow a distribution to define a dependency on another package and optionally restrict this dependency to a set of compatible versions, so one may write:

Requires-Dist: zope.interface (>3.5.0)

This means that the distribution requires zope.interface with a version greater than 3.5.0.

This also means that Python projects will need to follow the same convention as the tool that will be used to install them, so they are able to compare versions.

That is why this PEP proposes, for the sake of interoperability, a standard schema to express version information and its comparison semantics.

Furthermore, this will make OS packagers' work easier when repackaging standards compliant distributions, because as of now it can be difficult to decide how two distribution versions compare.

Requisites and current status

It is not in the scope of this PEP to provide a universal versioning schema intended to support all or even most of existing versioning schemas. There will always be competing grammars, either mandated by distro or project policies or by historical reasons that we cannot expect to change.

The proposed schema should be able to express the usual versioning semantics, so it's possible to parse any alternative versioning schema and transform it into a compliant one. This is how OS packagers usually deal with the existing version schemas and is a preferable alternative than supporting an arbitrary set of versioning schemas.

Conformance to usual practice and conventions, as well as a simplicity are a plus, to ease frictionless adoption and painless transition. Practicality beats purity, sometimes.

Projects have very different versioning needs, but the following are widely considered important semantics:

  1. it should be possible to express more than one versioning level (usually this is expressed as major and minor revision and, sometimes, also a micro revision).
  2. a significant number of projects need special meaning versions for "pre-releases" (such as "alpha", "beta", "rc"), and these have widely used aliases ("a" stands for "alpha", "b" for "beta" and "c" for "rc"). And these pre-release versions make it impossible to use a simple alphanumerical ordering of the version string components. (Example: 3.1a1 < 3.1)
  3. some projects also need "post-releases" of regular versions, mainly for installer work which can't be clearly expressed otherwise.
  4. development versions allow packagers of unreleased work to avoid version clash with later regular releases.

For people that want to go further and use a tool to manage their version numbers, the two major ones are:

  • The current Distutils system [1]
  • Setuptools [2]

Distutils

Distutils currently provides a StrictVersion and a LooseVersion class that can be used to manage versions.

The LooseVersion class is quite lax. From Distutils doc:

Version numbering for anarchists and software realists.
Implements the standard interface for version number classes as
described above.  A version number consists of a series of numbers,
separated by either periods or strings of letters.  When comparing
version numbers, the numeric components will be compared
numerically, and the alphabetic components lexically.  The following
are all valid version numbers, in no particular order:

    1.5.1
    1.5.2b2
    161
    3.10a
    8.02
    3.4j
    1996.07.12
    3.2.pl0
    3.1.1.6
    2g6
    11g
    0.960923
    2.2beta29
    1.13++
    5.5.kw
    2.0b1pl0

In fact, there is no such thing as an invalid version number under
this scheme; the rules for comparison are simple and predictable,
but may not always give the results you want (for some definition
of "want").

This class makes any version string valid, and provides an algorithm to sort them numerically then lexically. It means that anything can be used to version your project:

>>> from distutils.version import LooseVersion as V
>>> v1 = V('FunkyVersion')
>>> v2 = V('GroovieVersion')
>>> v1 > v2
False

The problem with this is that while it allows expressing any nesting level it doesn't allow giving special meaning to versions (pre and post-releases as well as development versions), as expressed in requisites 2, 3 and 4.

The StrictVersion class is more strict. From the doc:

Version numbering for meticulous retentive and software idealists.
Implements the standard interface for version number classes as
described above.  A version number consists of two or three
dot-separated numeric components, with an optional "pre-release" tag
on the end.  The pre-release tag consists of the letter 'a' or 'b'
followed by a number.  If the numeric components of two version
numbers are equal, then one with a pre-release tag will always
be deemed earlier (lesser) than one without.

The following are valid version numbers (shown in the order that
would be obtained by sorting according to the supplied cmp function):

    0.4       0.4.0  (these two are equivalent)
    0.4.1
    0.5a1
    0.5b3
    0.5
    0.9.6
    1.0
    1.0.4a3
    1.0.4b1
    1.0.4

The following are examples of invalid version numbers:

    1
    2.7.2.2
    1.3.a4
    1.3pl1
    1.3c4

This class enforces a few rules, and makes a decent tool to work with version numbers:

>>> from distutils.version import StrictVersion as V
>>> v2 = V('GroovieVersion')
Traceback (most recent call last):
...
ValueError: invalid version number 'GroovieVersion'
>>> v2 = V('1.1')
>>> v3 = V('1.3')
>>> v2 < v3
True

It adds pre-release versions, and some structure, but lacks a few semantic elements to make it usable, such as development releases or post-release tags, as expressed in requisites 3 and 4.

Also, note that Distutils version classes have been present for years but are not really used in the community.

Setuptools

Setuptools provides another version comparison tool [3] which does not enforce any rules for the version, but tries to provide a better algorithm to convert the strings to sortable keys, with a parse_version function.

From the doc:

Convert a version string to a chronologically-sortable key

This is a rough cross between Distutils' StrictVersion and LooseVersion;
if you give it versions that would work with StrictVersion, then it behaves
the same; otherwise it acts like a slightly-smarter LooseVersion. It is
*possible* to create pathological version coding schemes that will fool
this parser, but they should be very rare in practice.

The returned value will be a tuple of strings.  Numeric portions of the
version are padded to 8 digits so they will compare numerically, but
without relying on how numbers compare relative to strings.  Dots are
dropped, but dashes are retained.  Trailing zeros between alpha segments
or dashes are suppressed, so that e.g. "2.4.0" is considered the same as
"2.4". Alphanumeric parts are lower-cased.

The algorithm assumes that strings like "-" and any alpha string that
alphabetically follows "final"  represents a "patch level".  So, "2.4-1"
is assumed to be a branch or patch of "2.4", and therefore "2.4.1" is
considered newer than "2.4-1", which in turn is newer than "2.4".

Strings like "a", "b", "c", "alpha", "beta", "candidate" and so on (that
come before "final" alphabetically) are assumed to be pre-release versions,
so that the version "2.4" is considered newer than "2.4a1".

Finally, to handle miscellaneous cases, the strings "pre", "preview", and
"rc" are treated as if they were "c", i.e. as though they were release
candidates, and therefore are not as new as a version string that does not
contain them, and "dev" is replaced with an '@' so that it sorts lower than
than any other pre-release tag.

In other words, parse_version will return a tuple for each version string, that is compatible with StrictVersion but also accept arbitrary version and deal with them so they can be compared:

>>> from pkg_resources import parse_version as V
>>> V('1.2')
('00000001', '00000002', '*final')
>>> V('1.2b2')
('00000001', '00000002', '*b', '00000002', '*final')
>>> V('FunkyVersion')
('*funkyversion', '*final')

In this schema practicality takes priority over purity, but as a result it doesn't enforce any policy and leads to very complex semantics due to the lack of a clear standard. It just tries to adapt to widely used conventions.

Caveats of existing systems

The major problem with the described version comparison tools is that they are too permissive and, at the same time, aren't capable of expressing some of the required semantics. Many of the versions on PyPI [4] are obviously not useful versions, which makes it difficult for users to grok the versioning that a particular package was using and to provide tools on top of PyPI.

Distutils classes are not really used in Python projects, but the Setuptools function is quite widespread because it's used by tools like easy_install [6], pip [5] or zc.buildout [7] to install dependencies of a given project.

While Setuptools does provide a mechanism for comparing/sorting versions, it is much preferable if the versioning spec is such that a human can make a reasonable attempt at that sorting without having to run it against some code.

Also there's a problem with the use of dates at the "major" version number (e.g. a version string "20090421") with RPMs: it means that any attempt to switch to a more typical "major.minor..." version scheme is problematic because it will always sort less than "20090421".

Last, the meaning of - is specific to Setuptools, while it is avoided in some packaging systems like the one used by Debian or Ubuntu.

The new versioning algorithm

During Pycon, members of the Python, Ubuntu and Fedora community worked on a version standard that would be acceptable for everyone.

It's currently called verlib and a prototype lives at [10].

The pseudo-format supported is:

N.N[.N]+[{a|b|c|rc}N[.N]+][.postN][.devN]

The real regular expression is:

expr = r"""^
(?P<version>\d+\.\d+)         # minimum 'N.N'
(?P<extraversion>(?:\.\d+)*)  # any number of extra '.N' segments
(?:
    (?P<prerel>[abc]|rc)         # 'a' = alpha, 'b' = beta
                                 # 'c' or 'rc' = release candidate
    (?P<prerelversion>\d+(?:\.\d+)*)
)?
(?P<postdev>(\.post(?P<post>\d+))?(\.dev(?P<dev>\d+))?)?
$"""

Some examples probably make it clearer:

>>> from verlib import NormalizedVersion as V
>>> (V('1.0a1')
...  < V('1.0a2.dev456')
...  < V('1.0a2')
...  < V('1.0a2.1.dev456')
...  < V('1.0a2.1')
...  < V('1.0b1.dev456')
...  < V('1.0b2')
...  < V('1.0b2.post345')
...  < V('1.0c1.dev456')
...  < V('1.0c1')
...  < V('1.0.dev456')
...  < V('1.0')
...  < V('1.0.post456.dev34')
...  < V('1.0.post456'))
True

The trailing .dev123 is for pre-releases. The .post123 is for post-releases -- which apparently are used by a number of projects out there (e.g. Twisted [8]). For example after a 1.2.0 release there might be a 1.2.0-r678 release. We used post instead of r because the r is ambiguous as to whether it indicates a pre- or post-release.

.post456.dev34 indicates a dev marker for a post release, that sorts before a .post456 marker. This can be used to do development versions of post releases.

Pre-releases can use a for "alpha", b for "beta" and c for "release candidate". rc is an alternative notation for "release candidate" that is added to make the version scheme compatible with Python's own version scheme. rc sorts after c:

>>> from verlib import NormalizedVersion as V
>>> (V('1.0a1')
...  < V('1.0a2')
...  < V('1.0b3')
...  < V('1.0c1')
...  < V('1.0rc2')
...  < V('1.0'))
True

Note that c is the preferred marker for third party projects.

verlib provides a NormalizedVersion class and a suggest_normalized_version function.

NormalizedVersion

The NormalizedVersion class is used to hold a version and to compare it with others. It takes a string as an argument, that contains the representation of the version:

>>> from verlib import NormalizedVersion
>>> version = NormalizedVersion('1.0')

The version can be represented as a string:

>>> str(version)
'1.0'

Or compared with others:

>>> NormalizedVersion('1.0') > NormalizedVersion('0.9')
True
>>> NormalizedVersion('1.0') < NormalizedVersion('1.1')
True

A class method called from_parts is available if you want to create an instance by providing the parts that composes the version.

Examples

>>> version = NormalizedVersion.from_parts((1, 0))
>>> str(version)
'1.0'

>>> version = NormalizedVersion.from_parts((1, 0), ('c', 4))
>>> str(version)
'1.0c4'

>>> version = NormalizedVersion.from_parts((1, 0), ('c', 4), ('dev', 34))
>>> str(version)
'1.0c4.dev34'

suggest_normalized_version

suggest_normalized_version is a function that suggests a normalized version close to the given version string. If you have a version string that isn't normalized (i.e. NormalizedVersion doesn't like it) then you might be able to get an equivalent (or close) normalized version from this function.

This does a number of simple normalizations to the given string, based on an observation of versions currently in use on PyPI.

Given a dump of those versions on January 6th 2010, the function has given those results out of the 8821 distributions PyPI had:

  • 7822 (88.67%) already match NormalizedVersion without any change
  • 717 (8.13%) match when using this suggestion method
  • 282 (3.20%) don't match at all.

The 3.20% of projects that are incompatible with NormalizedVersion and cannot be transformed into a compatible form, are for most of them date-based version schemes, versions with custom markers, or dummy versions. Examples:

  • working proof of concept
  • 1 (first draft)
  • unreleased.unofficialdev
  • 0.1.alphadev
  • 2008-03-29_r219
  • etc.

When a tool needs to work with versions, a strategy is to use suggest_normalized_version on the versions string. If this function returns None, it means that the provided version is not close enough to the standard scheme. If it returns a version that slightly differs from the original version, it's a suggested normalized version. Last, if it returns the same string, it means that the version matches the scheme.

Here's an example of usage:

>>> from verlib import suggest_normalized_version, NormalizedVersion
>>> import warnings
>>> def validate_version(version):
...     rversion = suggest_normalized_version(version)
...     if rversion is None:
...         raise ValueError('Cannot work with "%s"' % version)
...     if rversion != version:
...         warnings.warn('"%s" is not a normalized version.\n'
...                       'It has been transformed into "%s" '
...                       'for interoperability.' % (version, rversion))
...     return NormalizedVersion(rversion)
...

>>> validate_version('2.4-rc1')
__main__:8: UserWarning: "2.4-rc1" is not a normalized version.
It has been transformed into "2.4c1" for interoperability.
NormalizedVersion('2.4c1')

>>> validate_version('2.4c1')
NormalizedVersion('2.4c1')

>>> validate_version('foo')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 4, in validate_version
ValueError: Cannot work with "foo"

Roadmap

Distutils will deprecate its existing versions class in favor of NormalizedVersion. The verlib module presented in this PEP will be renamed to version and placed into the distutils package.

Acknowledgments

Trent Mick, Matthias Klose, Phillip Eby, David Lyon, and many people at Pycon and Distutils-SIG.

pep-0387 Backwards Compatibility Policy

PEP:387
Title:Backwards Compatibility Policy
Version:$Revision$
Last-Modified:$Date$
Author:Benjamin Peterson <benjamin at python.org>
Status:Draft
Type:Process
Content-Type:text/x-rst
Created:18-Jun-2009
Post-History:19-Jun-2009

Abstract

This PEP outlines Python's backwards compatibility policy.

Rationale

As one of the most used programming languages today [1], the Python core language and its standard library play a critcal role in thousands of applications and libraries. This is fantastic; it is probably one of a language designer's most wishful dreams. However, it means the development team must be very careful not to break this existing 3rd party code with new releases.

Backwards Compatibility Rules

This policy applies to all public APIs. These include:

  • Syntax and behavior of these constructs as defined by the reference manual
  • The C-API
  • Function, class, module, attribute, and method names and types.
  • Given a set of arguments, the return value, side effects, and raised exceptions of a function. This does not preclude changes from reasonable bug fixes.
  • The position and expected types of arguments and returned values.
  • Behavior of classes with regards to subclasses: the conditions under which overridden methods are called.

Others are explicity not part of the public API. They can change or be removed at any time in any way. These include:

  • Function, class, module, attribute, method, and C-API names and types that are prefixed by "_" (except special names). The contents of these can also are not subject to the policy.
  • Inheritance patterns of internal classes.
  • Test suites. (Anything in the Lib/test directory or test subdirectories of packages.)

This is the basic policy for backwards compatibility:

  • Unless it is going through the deprecation process below, the behavior of an API must not change between any two consecutive releases.
  • Similarly a feature cannot be removed without notice between any two consecutive releases.
  • Addition of a feature which breaks 3rd party libraries or applications should have a large benefit to breakage ratio, and/or the incompatibility should be trival to fix in broken code. For example, adding an stdlib module with the same name as a third party package is not acceptable. Adding a method or attribute that conflicts with 3rd party code through inheritance, however, is likely reasonable.

Making Incompatible Changes

It's a fact: design mistakes happen. Thus it is important to be able to change APIs or remove misguided features. This is accomplished through a gradual process over several releases:

  1. Discuss the change. Depending on the size of the incompatibility, this could be on the bug tracker, python-dev, python-list, or the appropriate SIG. A PEP or similar document may be written. Hopefully users of the affected API will pipe up to comment.
  2. Add a warning [2]. If behavior is changing, the API may gain a new function or method to perform the new behavior; old usage should raise the warning. If an API is being removed, simply warn whenever it is entered. DeprecationWarning is the usual warning category to use, but PendingDeprecationWarning may be used in special cases were the old and new versions of the API will coexist for many releases.
  3. Wait for a release of whichever branch contains the warning.
  4. See if there's any feedback. Users not involved in the original discussions may comment now after seeing the warning. Perhaps reconsider.
  5. The behavior change or feature removal may now be made default or permanent in the next release. Remove the old version and warning.

pep-0389 argparse - New Command Line Parsing Module

PEP:389
Title:argparse - New Command Line Parsing Module
Version:$Revision$
Last-Modified:$Date$
Author:Steven Bethard <steven.bethard at gmail.com>
Status:Final
Type:Standards Track
Content-Type:text/x-rst
Created:25-Sep-2009
Python-Version:2.7 and 3.2
Post-History:27-Sep-2009, 24-Oct-2009

Acceptance

This PEP was approved by Guido on python-dev on February 21, 2010 [17].

Abstract

This PEP proposes inclusion of the argparse [1] module in the Python standard library in Python 2.7 and 3.2.

Motivation

The argparse module is a command line parsing library which provides more functionality than the existing command line parsing modules in the standard library, getopt [2] and optparse [3]. It includes support for positional arguments (not just options), subcommands, required options, options syntaxes like "/f" and "+rgb", zero-or-more and one-or-more style arguments, and many other features the other two lack.

The argparse module is also already a popular third-party replacement for these modules. It is used in projects like IPython (the Scipy Python shell) [4], is included in Debian testing and unstable [5], and since 2007 has had various requests for its inclusion in the standard library [6] [7] [8]. This popularity suggests it may be a valuable addition to the Python libraries.

Why aren't getopt and optparse enough?

One argument against adding argparse is that thare are "already two different option parsing modules in the standard library" [9]. The following is a list of features provided by argparse but not present in getopt or optparse:

  • While it is true there are two option parsing libraries, there are no full command line parsing libraries -- both getopt and optparse support only options and have no support for positional arguments. The argparse module handles both, and as a result, is able to generate better help messages, avoiding redundancies like the usage= string usually required by optparse.
  • The argparse module values practicality over purity. Thus, argparse allows required options and customization of which characters are used to identify options, while optparse explicitly states "the phrase 'required option' is self-contradictory" and that the option syntaxes -pf, -file, +f, +rgb, /f and /file "are not supported by optparse, and they never will be".
  • The argparse module allows options to accept a variable number of arguments using nargs='?', nargs='*' or nargs='+'. The optparse module provides an untested recipe for some part of this functionality [10] but admits that "things get hairy when you want an option to take a variable number of arguments."
  • The argparse module supports subcommands, where a main command line parser dispatches to other command line parsers depending on the command line arguments. This is a common pattern in command line interfaces, e.g. svn co and svn up.

Why isn't the functionality just being added to optparse?

Clearly all the above features offer improvements over what is available through optparse. A reasonable question then is why these features are not simply provided as patches to optparse, instead of introducing an entirely new module. In fact, the original development of argparse intended to do just that, but because of various fairly constraining design decisions of optparse, this wasn't really possible. Some of the problems included:

  • The optparse module exposes the internals of its parsing algorithm. In particular, parser.largs and parser.rargs are guaranteed to be available to callbacks [11]. This makes it extremely difficult to improve the parsing algorithm as was necessary in argparse for proper handling of positional arguments and variable length arguments. For example, nargs='+' in argparse is matched using regular expressions and thus has no notion of things like parser.largs.

  • The optparse extension APIs are extremely complex. For example, just to use a simple custom string-to-object conversion function, you have to subclass Option, hack class attributes, and then specify your custom option type to the parser, like this:

    class MyOption(Option):
        TYPES = Option.TYPES + ("mytype",)
        TYPE_CHECKER = copy(Option.TYPE_CHECKER)
        TYPE_CHECKER["mytype"] = check_mytype
    parser = optparse.OptionParser(option_class=MyOption)
    parser.add_option("-m", type="mytype")
    

    For comparison, argparse simply allows conversion functions to be used as type= arguments directly, e.g.:

    parser = argparse.ArgumentParser()
    parser.add_option("-m", type=check_mytype)
    

    But given the baroque customization APIs of optparse, it is unclear how such a feature should interact with those APIs, and it is quite possible that introducing the simple argparse API would break existing custom Option code.

  • Both optparse and argparse parse command line arguments and assign them as attributes to an object returned by parse_args. However, the optparse module guarantees that the take_action method of custom actions will always be passed a values object which provides an ensure_value method [12], while the argparse module allows attributes to be assigned to any object, e.g.:

    foo_object = ...
    parser.parse_args(namespace=foo_object)
    foo_object.some_attribute_parsed_from_command_line
    

    Modifying optparse to allow any object to be passed in would be difficult because simply passing the foo_object around instead of a Values instance will break existing custom actions that depend on the ensure_value method.

Because of issues like these, which made it unreasonably difficult for argparse to stay compatible with the optparse APIs, argparse was developed as an independent module. Given these issues, merging all the argparse features into optparse with no backwards incompatibilities seems unlikely.

Deprecation of optparse

Because all of optparse's features are available in argparse, the optparse module will be deprecated. However, because of the widespread use of optparse, the deprecation strategy contains only documentation changes and warnings that will not be visible by default:

  • Python 2.7+ and 3.2+ -- The following note will be added to the optparse documentation:

    The optparse module is deprecated and will not be developed further; development will continue with the argparse module.

  • Python 2.7+ -- If the Python 3 compatibility flag, -3, is provided at the command line, then importing optparse will issue a DeprecationWarning. Otherwise no warnings will be issued.

  • Python 3.2+ -- Importing optparse will issue a PendingDeprecationWarning, which is not displayed by default.

Note that no removal date is proposed for optparse.

Updates to getopt documentation

The getopt module will not be deprecated. However, its documentation will be updated to point to argparse in a couple of places. At the top of the module, the following note will be added:

The getopt module is a parser for command line options whose API is designed to be familiar to users of the C getopt function. Users who are unfamiliar with the C getopt function or who would like to write less code and get better help and error messages should consider using the argparse module instead.

Additionally, after the final getopt example, the following note will be added:

Note that an equivalent command line interface could be produced with less code by using the argparse module:

import argparse

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('-o', '--output')
    parser.add_argument('-v', dest='verbose', action='store_true')
    args = parser.parse_args()
    # ... do something with args.output ...
    # ... do something with args.verbose ..

Deferred: string formatting

The argparse module supports Python from 2.3 up through 3.2 and as a result relies on traditional %(foo)s style string formatting. It has been suggested that it might be better to use the new style {foo} string formatting [13]. There was some discussion about how best to do this for modules in the standard library [14] and several people are developing functions for automatically converting %-formatting to {}-formatting [15] [16]. When one of these is added to the standard library, argparse will use them to support both formatting styles.

Rejected: getopt compatibility methods

Previously, when this PEP was suggesting the deprecation of getopt as well as optparse, there was some talk of adding a method like:

ArgumentParser.add_getopt_arguments(options[, long_options])

However, this method will not be added for a number of reasons:

  • The getopt module is not being deprecated, so there is less need.
  • This method would not actually ease the transition for any getopt users who were already maintaining usage messages, because the API above gives no way of adding help messages to the arguments.
  • Some users of getopt consider it very important that only a single function call is necessary. The API above does not satisfy this requirement because both ArgumentParser() and parse_args() must also be called.

Out of Scope: Various Feature Requests

Several feature requests for argparse were made in the discussion of this PEP:

  • Support argument defaults from environment variables
  • Support argument defaults from configuration files
  • Support "foo --help subcommand" in addition to the currently supported "foo subcommand --help"

These are all reasonable feature requests for the argparse module, but are out of the scope of this PEP, and have been redirected to the argparse issue tracker.

Discussion: sys.stderr and sys.exit

There were some concerns that argparse by default always writes to sys.stderr and always calls sys.exit when invalid arguments are provided. This is the desired behavior for the vast majority of argparse use cases which revolve around simple command line interfaces. However, in some cases, it may be desirable to keep argparse from exiting, or to have it write its messages to something other than sys.stderr. These use cases can be supported by subclassing ArgumentParser and overriding the exit or _print_message methods. The latter is an undocumented implementation detail, but could be officially exposed if this turns out to be a common need.

References

[1]argparse (http://code.google.com/p/argparse/)
[2]getopt (http://docs.python.org/library/getopt.html)
[3]optparse (http://docs.python.org/library/optparse.html)
[4]argparse in IPython (http://mail.scipy.org/pipermail/ipython-dev/2009-April/005102.html)
[5]argparse in Debian (http://packages.debian.org/search?keywords=argparse)
[6](1, 2) 2007-01-03 request for argparse in the standard library (http://mail.python.org/pipermail/python-list/2007-January/472276.html)
[7]2009-06-09 request for argparse in the standard library (http://bugs.python.org/issue6247)
[8]2009-09-10 request for argparse in the standard library (http://mail.python.org/pipermail/stdlib-sig/2009-September/000342.html)
[9]Fredrik Lundh response to [6] (http://mail.python.org/pipermail/python-list/2007-January/1086892.html)
[10]optparse variable args (http://docs.python.org/library/optparse.html#callback-example-6-variable-arguments)
[11]parser.largs and parser.rargs (http://docs.python.org/library/optparse.html#how-callbacks-are-called)
[12]take_action values argument (http://docs.python.org/library/optparse.html#adding-new-actions)
[13]use {}-formatting instead of %-formatting (http://bugs.python.org/msg89279)
[14]transitioning from % to {} formatting (http://mail.python.org/pipermail/python-dev/2009-September/092326.html)
[15]Vinay Sajip's %-to-{} converter (http://gist.github.com/200936)
[16]Benjamin Peterson's %-to-{} converter (http://bazaar.launchpad.net/~gutworth/+junk/mod2format/files)
[17]Guido's approval (http://mail.python.org/pipermail/python-dev/2010-February/097839.html)

pep-0390 Static metadata for Distutils

PEP:390
Title:Static metadata for Distutils
Version:$Revision$
Last-Modified:$Date$
Author:Tarek ZiadĂŠ <tarek at ziade.org>
BDFL-Delegate:Nick Coghlan
Discussions-To:<distutils-sig at python.org>
Status:Rejected
Type:Standards Track
Content-Type:text/x-rst
Created:10-October-2009
Python-Version:2.7 and 3.2
Post-History:
Resolution:http://mail.python.org/pipermail/distutils-sig/2013-April/020597.html

Abstract

This PEP describes a new section and a new format for the setup.cfg file, that allows describing the Metadata of a package without using setup.py.

Rejection Notice

As distutils2 is no longer going to be incorporated into the standard library, this PEP was rejected by Nick Coghlan in late April, 2013.

A replacement PEP based on PEP 426 (metadata 2.0) will be created that defines the minimum amount of information needed to generate an sdist archive given a source tarball or VCS checkout.

Rationale

Today, if you want to list all the Metadata of a distribution (see PEP 314) that is not installed, you need to use the setup.py command line interface.

So, basically, you download it, and run:

$ python setup.py --name
Distribute

$ python setup.py --version
0.6.4

Where name and version are metadata fields. This is working fine but as soon as the developers add more code in setup.py, this feature might break or in worst cases might do unwanted things on the target system.

Moreover, when an OS packager wants to get the metadata of a distribution he is re-packaging, he might encounter some problems to understand the setup.py file he's working with.

So the rationale of this PEP is to provide a way to declare the metadata in a static configuration file alongside setup.py that doesn't require any third party code to run.

Adding a metadata section in setup.cfg

The first thing we want to introduce is a [metadata] section, in the setup.cfg file, that may contain any field from the Metadata:

[metadata]
name = Distribute
version = 0.6.4

The setup.cfg file is used to avoid adding yet another configuration file to work with in Distutils.

This file is already read by Distutils when a command is executed, and if the metadata section is found, it will be used to fill the metadata fields. If an option that corresponds to a Metadata field is given to setup(), it will override the value that was possibly present in setup.cfg.

Notice that setup.py is still used and can be required to define some options that are not part of the Metadata fields. For instance, the sdist command can use options like packages or scripts.

Multi-lines values

Some Metadata fields can have multiple values. To keep setup.cfg compatible with ConfigParser and the RFC 822 LONG HEADER FIELDS (see section 3.1.1), these are expressed with ,-separated values:

requires = pywin32, bar > 1.0, foo

When this variable is read, the values are parsed and transformed into a list: ['pywin32', 'bar > 1.0', 'foo'].

Context-dependant sections

The metadata section will also be able to use context-dependant sections.

A context-dependant section is a section with a condition about the execution environment. Here's some examples:

[metadata]
name = Distribute
version = 0.6.4

[metadata:sys_platform == 'win32']
requires = pywin32, bar > 1.0
obsoletes = pywin31

[metadata:os_machine == 'i386']
requires = foo

[metadata:python_version == '2.4' or python_version == '2.5']
requires = bar

[metadata:'linux' in sys_platform]
requires = baz

Every [metadata:condition] section will be used only if the condition is met when the file is read. The background motivation for these context-dependant sections is to be able to define requirements that varies depending on the platform the distribution might be installed on. (see PEP 314).

The micro-language behind this is the simplest possible: it compares only strings, with the == and in operators (and their opposites), and with the ability to combine expressions. It makes it also easy to understand to non-pythoneers.

The pseudo-grammar is

EXPR [in|==|!=|not in] EXPR [or|and] ...

where EXPR belongs to any of those:

  • python_version = '%s.%s' % (sys.version_info[0], sys.version_info[1])
  • os_name = os.name
  • sys_platform = sys.platform
  • platform_version = platform.version()
  • platform_machine = platform.machine()
  • a free string, like 2.4, or win32

Notice that in is restricted to strings, meaning that it is not possible to use other sequences like tuples or lists on the right side.

Distutils will provide a function that is able to generate the metadata of a distribution, given a setup.cfg file, for the execution environment:

>>> from distutils.util import local_metadata
>>> local_metadata('setup.cfg')
<DistributionMetadata instance>

This means that a vanilla Python will be able to read the metadata of a package without running any third party code.

Notice that this feature is not restricted to the metadata namespace. Consequently, any other section can be extended with such context-dependant sections.

Impact on PKG-INFO generation and PEP 314

When PKG-INFO is generated by Distutils, every field that relies on a condition will have that condition written at the end of the line, after a ; separator:

Metadata-Version: 1.2
Name: distribute
Version: 0.6.4
...
Requires: pywin32, bar > 1.0; sys_platform == 'win32'
Requires: foo; os_machine == 'i386'
Requires: bar; python_version == '2.4' or python_version == '2.5'
Requires: baz; 'linux' in sys_platform
Obsoletes = pywin31; sys_platform == 'win32'
...
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: Python Software Foundation License

Notice that this file can be opened with the DistributionMetadata class. This class will be able to use the micro-language using the execution environment.

Let's run in on a Python 2.5 i386 Linux:

>>> from distutils.dist import DistributionMetadata
>>> metadata = DistributionMetadata('PKG_INFO')
>>> metadata.get_requires()
['foo', 'bar', 'baz']

The execution environment can be overriden in case we want to get the meyadata for another environment:

>>> env = {'python_version': '2.4',
...        'os_name': 'nt',
...        'sys_platform': 'win32',
...        'platform_version': 'MVCC++ 6.0'
...        'platform_machine': 'i386'}
...
>>> metadata = DistributionMetadata('PKG_INFO', environment=env)
>>> metadata.get_requires()
['bar > 1.0', 'foo', 'bar']

PEP 314 is changed accordingly, meaning that each field will be able to have that extra condition marker.

Compatiblity

This change is is based on a new metadata 1.2 format meaning that Distutils will be able to distinguish old PKG-INFO files from new ones.

The setup.cfg file change will stay ConfigParser-compatible and will not break existing setup.cfg files.

Limitations

We are not providing < and > operators at this time, and python_version is a regular string. This implies using or operators when a section needs to be restricted to a couple of Python versions. Although, if PEP 386 is accepted, python_version could be changed internally into something comparable with strings, and < and > operators introduced.

Last, if a distribution is unable to set all metadata fields in setup.cfg, that's fine, the fields will be set to UNKNOWN when local_metadata is called. Getting UNKNOWN values will mean that it might be necessary to run the setup.py command line interface to get the whole set of metadata.

Acknowledgments

The Distutils-SIG.

pep-0391 Dictionary-Based Configuration For Logging

PEP:391
Title:Dictionary-Based Configuration For Logging
Version:$Revision$
Last-Modified:$Date$
Author:Vinay Sajip <vinay_sajip at red-dove.com>
Status:Final
Type:Standards Track
Content-Type:text/x-rst
Created:15-Oct-2009
Python-Version:2.7, 3.2
Post-History:

Abstract

This PEP describes a new way of configuring logging using a dictionary to hold configuration information.

Rationale

The present means for configuring Python's logging package is either by using the logging API to configure logging programmatically, or else by means of ConfigParser-based configuration files.

Programmatic configuration, while offering maximal control, fixes the configuration in Python code. This does not facilitate changing it easily at runtime, and, as a result, the ability to flexibly turn the verbosity of logging up and down for different parts of a using application is lost. This limits the usability of logging as an aid to diagnosing problems - and sometimes, logging is the only diagnostic aid available in production environments.

The ConfigParser-based configuration system is usable, but does not allow its users to configure all aspects of the logging package. For example, Filters cannot be configured using this system. Furthermore, the ConfigParser format appears to engender dislike (sometimes strong dislike) in some quarters. Though it was chosen because it was the only configuration format supported in the Python standard at that time, many people regard it (or perhaps just the particular schema chosen for logging's configuration) as 'crufty' or 'ugly', in some cases apparently on purely aesthetic grounds.

Recent versions of Python include JSON support in the standard library, and this is also usable as a configuration format. In other environments, such as Google App Engine, YAML is used to configure applications, and usually the configuration of logging would be considered an integral part of the application configuration. Although the standard library does not contain YAML support at present, support for both JSON and YAML can be provided in a common way because both of these serialization formats allow deserialization to Python dictionaries.

By providing a way to configure logging by passing the configuration in a dictionary, logging will be easier to configure not only for users of JSON and/or YAML, but also for users of custom configuration methods, by providing a common format in which to describe the desired configuration.

Another drawback of the current ConfigParser-based configuration system is that it does not support incremental configuration: a new configuration completely replaces the existing configuration. Although full flexibility for incremental configuration is difficult to provide in a multi-threaded environment, the new configuration mechanism will allow the provision of limited support for incremental configuration.

Specification

The specification consists of two parts: the API and the format of the dictionary used to convey configuration information (i.e. the schema to which it must conform).

Naming

Historically, the logging package has not been PEP 8 conformant [1]. At some future time, this will be corrected by changing method and function names in the package in order to conform with PEP 8. However, in the interests of uniformity, the proposed additions to the API use a naming scheme which is consistent with the present scheme used by logging.

API

The logging.config module will have the following addition:

  • A function, called dictConfig(), which takes a single argument - the dictionary holding the configuration. Exceptions will be raised if there are errors while processing the dictionary.

It will be possible to customize this API - see the section on API Customization. Incremental configuration is covered in its own section.

Dictionary Schema - Overview

Before describing the schema in detail, it is worth saying a few words about object connections, support for user-defined objects and access to external and internal objects.

Object connections

The schema is intended to describe a set of logging objects - loggers, handlers, formatters, filters - which are connected to each other in an object graph. Thus, the schema needs to represent connections between the objects. For example, say that, once configured, a particular logger has attached to it a particular handler. For the purposes of this discussion, we can say that the logger represents the source, and the handler the destination, of a connection between the two. Of course in the configured objects this is represented by the logger holding a reference to the handler. In the configuration dict, this is done by giving each destination object an id which identifies it unambiguously, and then using the id in the source object's configuration to indicate that a connection exists between the source and the destination object with that id.

So, for example, consider the following YAML snippet:

formatters:
  brief:
    # configuration for formatter with id 'brief' goes here
  precise:
    # configuration for formatter with id 'precise' goes here
handlers:
  h1: #This is an id
   # configuration of handler with id 'h1' goes here
   formatter: brief
  h2: #This is another id
   # configuration of handler with id 'h2' goes here
   formatter: precise
loggers:
  foo.bar.baz:
    # other configuration for logger 'foo.bar.baz'
    handlers: [h1, h2]

(Note: YAML will be used in this document as it is a little more readable than the equivalent Python source form for the dictionary.)

The ids for loggers are the logger names which would be used programmatically to obtain a reference to those loggers, e.g. foo.bar.baz. The ids for Formatters and Filters can be any string value (such as brief, precise above) and they are transient, in that they are only meaningful for processing the configuration dictionary and used to determine connections between objects, and are not persisted anywhere when the configuration call is complete.

Handler ids are treated specially, see the section on Handler Ids, below.

The above snippet indicates that logger named foo.bar.baz should have two handlers attached to it, which are described by the handler ids h1 and h2. The formatter for h1 is that described by id brief, and the formatter for h2 is that described by id precise.

User-defined objects

The schema should support user-defined objects for handlers, filters and formatters. (Loggers do not need to have different types for different instances, so there is no support - in the configuration - for user-defined logger classes.)

Objects to be configured will typically be described by dictionaries which detail their configuration. In some places, the logging system will be able to infer from the context how an object is to be instantiated, but when a user-defined object is to be instantiated, the system will not know how to do this. In order to provide complete flexibility for user-defined object instantiation, the user will need to provide a 'factory' - a callable which is called with a configuration dictionary and which returns the instantiated object. This will be signalled by an absolute import path to the factory being made available under the special key '()'. Here's a concrete example:

formatters:
  brief:
    format: '%(message)s'
  default:
    format: '%(asctime)s %(levelname)-8s %(name)-15s %(message)s'
    datefmt: '%Y-%m-%d %H:%M:%S'
  custom:
      (): my.package.customFormatterFactory
      bar: baz
      spam: 99.9
      answer: 42

The above YAML snippet defines three formatters. The first, with id brief, is a standard logging.Formatter instance with the specified format string. The second, with id default, has a longer format and also defines the time format explicitly, and will result in a logging.Formatter initialized with those two format strings. Shown in Python source form, the brief and default formatters have configuration sub-dictionaries:

{
  'format' : '%(message)s'
}

and:

{
  'format' : '%(asctime)s %(levelname)-8s %(name)-15s %(message)s',
  'datefmt' : '%Y-%m-%d %H:%M:%S'
}

respectively, and as these dictionaries do not contain the special key '()', the instantiation is inferred from the context: as a result, standard logging.Formatter instances are created. The configuration sub-dictionary for the third formatter, with id custom, is:

{
  '()' : 'my.package.customFormatterFactory',
  'bar' : 'baz',
  'spam' : 99.9,
  'answer' : 42
}

and this contains the special key '()', which means that user-defined instantiation is wanted. In this case, the specified factory callable will be used. If it is an actual callable it will be used directly - otherwise, if you specify a string (as in the example) the actual callable will be located using normal import mechanisms. The callable will be called with the remaining items in the configuration sub-dictionary as keyword arguments. In the above example, the formatter with id custom will be assumed to be returned by the call:

my.package.customFormatterFactory(bar='baz', spam=99.9, answer=42)

The key '()' has been used as the special key because it is not a valid keyword parameter name, and so will not clash with the names of the keyword arguments used in the call. The '()' also serves as a mnemonic that the corresponding value is a callable.

Access to external objects

There are times where a configuration will need to refer to objects external to the configuration, for example sys.stderr. If the configuration dict is constructed using Python code then this is straightforward, but a problem arises when the configuration is provided via a text file (e.g. JSON, YAML). In a text file, there is no standard way to distinguish sys.stderr from the literal string 'sys.stderr'. To facilitate this distinction, the configuration system will look for certain special prefixes in string values and treat them specially. For example, if the literal string 'ext://sys.stderr' is provided as a value in the configuration, then the ext:// will be stripped off and the remainder of the value processed using normal import mechanisms.

The handling of such prefixes will be done in a way analogous to protocol handling: there will be a generic mechanism to look for prefixes which match the regular expression ^(?P<prefix>[a-z]+)://(?P<suffix>.*)$ whereby, if the prefix is recognised, the suffix is processed in a prefix-dependent manner and the result of the processing replaces the string value. If the prefix is not recognised, then the string value will be left as-is.

The implementation will provide for a set of standard prefixes such as ext:// but it will be possible to disable the mechanism completely or provide additional or different prefixes for special handling.

Access to internal objects

As well as external objects, there is sometimes also a need to refer to objects in the configuration. This will be done implicitly by the configuration system for things that it knows about. For example, the string value 'DEBUG' for a level in a logger or handler will automatically be converted to the value logging.DEBUG, and the handlers, filters and formatter entries will take an object id and resolve to the appropriate destination object.

However, a more generic mechanism needs to be provided for the case of user-defined objects which are not known to logging. For example, take the instance of logging.handlers.MemoryHandler, which takes a target which is another handler to delegate to. Since the system already knows about this class, then in the configuration, the given target just needs to be the object id of the relevant target handler, and the system will resolve to the handler from the id. If, however, a user defines a my.package.MyHandler which has a alternate handler, the configuration system would not know that the alternate referred to a handler. To cater for this, a generic resolution system will be provided which allows the user to specify:

handlers:
  file:
    # configuration of file handler goes here

  custom:
    (): my.package.MyHandler
    alternate: cfg://handlers.file

The literal string 'cfg://handlers.file' will be resolved in an analogous way to the strings with the ext:// prefix, but looking in the configuration itself rather than the import namespace. The mechanism will allow access by dot or by index, in a similar way to that provided by str.format. Thus, given the following snippet:

handlers:
  email:
    class: logging.handlers.SMTPHandler
    mailhost: localhost
    fromaddr: my_app@domain.tld
    toaddrs:
      - support_team@domain.tld
      - dev_team@domain.tld
    subject: Houston, we have a problem.

in the configuration, the string 'cfg://handlers' would resolve to the dict with key handlers, the string 'cfg://handlers.email would resolve to the dict with key email in the handlers dict, and so on. The string 'cfg://handlers.email.toaddrs[1] would resolve to 'dev_team.domain.tld' and the string 'cfg://handlers.email.toaddrs[0]' would resolve to the value 'support_team@domain.tld'. The subject value could be accessed using either 'cfg://handlers.email.subject' or, equivalently, 'cfg://handlers.email[subject]'. The latter form only needs to be used if the key contains spaces or non-alphanumeric characters. If an index value consists only of decimal digits, access will be attempted using the corresponding integer value, falling back to the string value if needed.

Given a string cfg://handlers.myhandler.mykey.123, this will resolve to config_dict['handlers']['myhandler']['mykey']['123']. If the string is specified as cfg://handlers.myhandler.mykey[123], the system will attempt to retrieve the value from config_dict['handlers']['myhandler']['mykey'][123], and fall back to config_dict['handlers']['myhandler']['mykey']['123'] if that fails.

Handler Ids

Some specific logging configurations require the use of handler levels to achieve the desired effect. However, unlike loggers which can always be identified by their names, handlers have no persistent handles whereby levels can be changed via an incremental configuration call.

Therefore, this PEP proposes to add an optional name property to handlers. If used, this will add an entry in a dictionary which maps the name to the handler. (The entry will be removed when the handler is closed.) When an incremental configuration call is made, handlers will be looked up in this dictionary to set the handler level according to the value in the configuration. See the section on incremental configuration for more details.

In theory, such a "persistent name" facility could also be provided for Filters and Formatters. However, there is not a strong case to be made for being able to configure these incrementally. On the basis that practicality beats purity, only Handlers will be given this new name property. The id of a handler in the configuration will become its name.

The handler name lookup dictionary is for configuration use only and will not become part of the public API for the package.

Dictionary Schema - Detail

The dictionary passed to dictConfig() must contain the following keys:

  • version - to be set to an integer value representing the schema version. The only valid value at present is 1, but having this key allows the schema to evolve while still preserving backwards compatibility.

All other keys are optional, but if present they will be interpreted as described below. In all cases below where a 'configuring dict' is mentioned, it will be checked for the special '()' key to see if a custom instantiation is required. If so, the mechanism described above is used to instantiate; otherwise, the context is used to determine how to instantiate.

  • formatters - the corresponding value will be a dict in which each key is a formatter id and each value is a dict describing how to configure the corresponding Formatter instance.

    The configuring dict is searched for keys format and datefmt (with defaults of None) and these are used to construct a logging.Formatter instance.

  • filters - the corresponding value will be a dict in which each key is a filter id and each value is a dict describing how to configure the corresponding Filter instance.

    The configuring dict is searched for key name (defaulting to the empty string) and this is used to construct a logging.Filter instance.

  • handlers - the corresponding value will be a dict in which each key is a handler id and each value is a dict describing how to configure the corresponding Handler instance.

    The configuring dict is searched for the following keys:

    • class (mandatory). This is the fully qualified name of the handler class.
    • level (optional). The level of the handler.
    • formatter (optional). The id of the formatter for this handler.
    • filters (optional). A list of ids of the filters for this handler.

    All other keys are passed through as keyword arguments to the handler's constructor. For example, given the snippet:

    handlers:
      console:
        class : logging.StreamHandler
        formatter: brief
        level   : INFO
        filters: [allow_foo]
        stream  : ext://sys.stdout
      file:
        class : logging.handlers.RotatingFileHandler
        formatter: precise
        filename: logconfig.log
        maxBytes: 1024
        backupCount: 3
    

    the handler with id console is instantiated as a logging.StreamHandler, using sys.stdout as the underlying stream. The handler with id file is instantiated as a logging.handlers.RotatingFileHandler with the keyword arguments filename='logconfig.log', maxBytes=1024, backupCount=3.

  • loggers - the corresponding value will be a dict in which each key is a logger name and each value is a dict describing how to configure the corresponding Logger instance.

    The configuring dict is searched for the following keys:

    • level (optional). The level of the logger.
    • propagate (optional). The propagation setting of the logger.
    • filters (optional). A list of ids of the filters for this logger.
    • handlers (optional). A list of ids of the handlers for this logger.

    The specified loggers will be configured according to the level, propagation, filters and handlers specified.

  • root - this will be the configuration for the root logger. Processing of the configuration will be as for any logger, except that the propagate setting will not be applicable.

  • incremental - whether the configuration is to be interpreted as incremental to the existing configuration. This value defaults to False, which means that the specified configuration replaces the existing configuration with the same semantics as used by the existing fileConfig() API.

    If the specified value is True, the configuration is processed as described in the section on Incremental Configuration, below.

  • disable_existing_loggers - whether any existing loggers are to be disabled. This setting mirrors the parameter of the same name in fileConfig(). If absent, this parameter defaults to True. This value is ignored if incremental is True.

A Working Example

The following is an actual working configuration in YAML format (except that the email addresses are bogus):

formatters:
  brief:
    format: '%(levelname)-8s: %(name)-15s: %(message)s'
  precise:
    format: '%(asctime)s %(name)-15s %(levelname)-8s %(message)s'
filters:
  allow_foo:
    name: foo
handlers:
  console:
    class : logging.StreamHandler
    formatter: brief
    level   : INFO
    stream  : ext://sys.stdout
    filters: [allow_foo]
  file:
    class : logging.handlers.RotatingFileHandler
    formatter: precise
    filename: logconfig.log
    maxBytes: 1024
    backupCount: 3
  debugfile:
    class : logging.FileHandler
    formatter: precise
    filename: logconfig-detail.log
    mode: a
  email:
    class: logging.handlers.SMTPHandler
    mailhost: localhost
    fromaddr: my_app@domain.tld
    toaddrs:
      - support_team@domain.tld
      - dev_team@domain.tld
    subject: Houston, we have a problem.
loggers:
  foo:
    level : ERROR
    handlers: [debugfile]
  spam:
    level : CRITICAL
    handlers: [debugfile]
    propagate: no
  bar.baz:
    level: WARNING
root:
  level     : DEBUG
  handlers  : [console, file]

Incremental Configuration

It is difficult to provide complete flexibility for incremental configuration. For example, because objects such as filters and formatters are anonymous, once a configuration is set up, it is not possible to refer to such anonymous objects when augmenting a configuration.

Furthermore, there is not a compelling case for arbitrarily altering the object graph of loggers, handlers, filters, formatters at run-time, once a configuration is set up; the verbosity of loggers and handlers can be controlled just by setting levels (and, in the case of loggers, propagation flags). Changing the object graph arbitrarily in a safe way is problematic in a multi-threaded environment; while not impossible, the benefits are not worth the complexity it adds to the implementation.

Thus, when the incremental key of a configuration dict is present and is True, the system will ignore any formatters and filters entries completely, and process only the level settings in the handlers entries, and the level and propagate settings in the loggers and root entries.

It's certainly possible to provide incremental configuration by other means, for example making dictConfig() take an incremental keyword argument which defaults to False. The reason for suggesting that a value in the configuration dict be used is that it allows for configurations to be sent over the wire as pickled dicts to a socket listener. Thus, the logging verbosity of a long-running application can be altered over time with no need to stop and restart the application.

Note: Feedback on incremental configuration needs based on your practical experience will be particularly welcome.

API Customization

The bare-bones dictConfig() API will not be sufficient for all use cases. Provision for customization of the API will be made by providing the following:

  • A class, called DictConfigurator, whose constructor is passed the dictionary used for configuration, and which has a configure() method.
  • A callable, called dictConfigClass, which will (by default) be set to DictConfigurator. This is provided so that if desired, DictConfigurator can be replaced with a suitable user-defined implementation.

The dictConfig() function will call dictConfigClass passing the specified dictionary, and then call the configure() method on the returned object to actually put the configuration into effect:

def dictConfig(config):
    dictConfigClass(config).configure()

This should cater to all customization needs. For example, a subclass of DictConfigurator could call DictConfigurator.__init__() in its own __init__(), then set up custom prefixes which would be usable in the subsequent configure() call. The dictConfigClass would be bound to the subclass, and then dictConfig() could be called exactly as in the default, uncustomized state.

Change to Socket Listener Implementation

The existing socket listener implementation will be modified as follows: when a configuration message is received, an attempt will be made to deserialize to a dictionary using the json module. If this step fails, the message will be assumed to be in the fileConfig format and processed as before. If deserialization is successful, then dictConfig() will be called to process the resulting dictionary.

Configuration Errors

If an error is encountered during configuration, the system will raise a ValueError, TypeError, AttributeError or ImportError with a suitably descriptive message. The following is a (possibly incomplete) list of conditions which will raise an error:

  • A level which is not a string or which is a string not corresponding to an actual logging level
  • A propagate value which is not a boolean
  • An id which does not have a corresponding destination
  • A non-existent handler id found during an incremental call
  • An invalid logger name
  • Inability to resolve to an internal or external object

Discussion in the community

The PEP has been announced on python-dev and python-list. While there hasn't been a huge amount of discussion, this is perhaps to be expected for a niche topic.

Discussion threads on python-dev:

http://mail.python.org/pipermail/python-dev/2009-October/092695.html http://mail.python.org/pipermail/python-dev/2009-October/092782.html http://mail.python.org/pipermail/python-dev/2009-October/093062.html

And on python-list:

http://mail.python.org/pipermail/python-list/2009-October/1223658.html http://mail.python.org/pipermail/python-list/2009-October/1224228.html

There have been some comments in favour of the proposal, no objections to the proposal as a whole, and some questions and objections about specific details. These are believed by the author to have been addressed by making changes to the PEP.

Reference implementation

A reference implementation of the changes is available as a module dictconfig.py with accompanying unit tests in test_dictconfig.py, at:

http://bitbucket.org/vinay.sajip/dictconfig

This incorporates all features other than the socket listener change.

References

[1]PEP 8, Style Guide for Python Code, van Rossum, Warsaw (http://www.python.org/dev/peps/pep-0008)

pep-0392 Python 3.2 Release Schedule

PEP:392
Title:Python 3.2 Release Schedule
Version:$Revision$
Last-Modified:$Date$
Author:Georg Brandl <georg at python.org>
Status:Active
Type:Informational
Content-Type:text/x-rst
Created:30-Dec-2009
Python-Version:3.2

Abstract

This document describes the development and release schedule for the Python 3.2 series. The schedule primarily concerns itself with PEP-sized items.

Release Manager and Crew

  • 3.2 Release Manager: Georg Brandl
  • Windows installers: Martin v. Loewis
  • Mac installers: Ronald Oussoren
  • Documentation: Georg Brandl

3.2 Lifespan

3.2 will receive bugfix updates approximately every 4-6 months for approximately 18 months. After the release of 3.3.0 final (see PEP 398), a final 3.2 bugfix update will be released. After that, security updates (source only) will be released until 5 years after the release of 3.2 final, which will be February 2016.

Release Schedule

3.2 schedule

  • 3.2 alpha 1: August 1, 2010
  • 3.2 alpha 2: September 6, 2010
  • 3.2 alpha 3: October 12, 2010
  • 3.2 alpha 4: November 16, 2010
  • 3.2 beta 1: December 6, 2010

(No new features beyond this point.)

  • 3.2 beta 2: December 20, 2010
  • 3.2 candidate 1: January 16, 2011
  • 3.2 candidate 2: January 31, 2011
  • 3.2 candidate 3: February 14, 2011
  • 3.2 final: February 20, 2011

3.2.1 schedule

  • 3.2.1 beta 1: May 8, 2011
  • 3.2.1 candidate 1: May 17, 2011
  • 3.2.1 candidate 2: July 3, 2011
  • 3.2.1 final: July 11, 2011

3.2.2 schedule

  • 3.2.2 candidate 1: August 14, 2011
  • 3.2.2 final: September 4, 2011

3.2.3 schedule

  • 3.2.3 candidate 1: February 25, 2012
  • 3.2.3 candidate 2: March 18, 2012
  • 3.2.3 final: April 11, 2012

3.2.4 schedule

  • 3.2.4 candidate 1: March 23, 2013
  • 3.2.4 final: April 6, 2013

3.2.5 schedule (regression fix release)

  • 3.2.5 final: May 13, 2013

-- Only security releases after 3.2.5 --

3.2.6 schedule

  • 3.2.6 candidate 1 (source-only release): October 4, 2014
  • 3.2.6 final (source-only release): October 11, 2014

Features for 3.2

Note that PEP 3003 [1] is in effect: no changes to language syntax and no additions to the builtins may be made.

No large-scale changes have been recorded yet.

pep-0393 Flexible String Representation

PEP:393
Title:Flexible String Representation
Version:$Revision$
Last-Modified:$Date$
Author:Martin v. Lรถwis <martin at v.loewis.de>
Status:Final
Type:Standards Track
Content-Type:text/x-rst
Created:24-Jan-2010
Python-Version:3.3
Post-History:

Abstract

The Unicode string type is changed to support multiple internal representations, depending on the character with the largest Unicode ordinal (1, 2, or 4 bytes). This will allow a space-efficient representation in common cases, but give access to full UCS-4 on all systems. For compatibility with existing APIs, several representations may exist in parallel; over time, this compatibility should be phased out. The distinction between narrow and wide Unicode builds is dropped. An implementation of this PEP is available at [1].

Rationale

There are two classes of complaints about the current implementation of the unicode type: on systems only supporting UTF-16, users complain that non-BMP characters are not properly supported. On systems using UCS-4 internally (and also sometimes on systems using UCS-2), there is a complaint that Unicode strings take up too much memory - especially compared to Python 2.x, where the same code would often use ASCII strings (i.e. ASCII-encoded byte strings). With the proposed approach, ASCII-only Unicode strings will again use only one byte per character; while still allowing efficient indexing of strings containing non-BMP characters (as strings containing them will use 4 bytes per character).

One problem with the approach is support for existing applications (e.g. extension modules). For compatibility, redundant representations may be computed. Applications are encouraged to phase out reliance on a specific internal representation if possible. As interaction with other libraries will often require some sort of internal representation, the specification chooses UTF-8 as the recommended way of exposing strings to C code.

For many strings (e.g. ASCII), multiple representations may actually share memory (e.g. the shortest form may be shared with the UTF-8 form if all characters are ASCII). With such sharing, the overhead of compatibility representations is reduced. If representations do share data, it is also possible to omit structure fields, reducing the base size of string objects.

Specification

Unicode structures are now defined as a hierarchy of structures, namely:

typedef struct {
  PyObject_HEAD
  Py_ssize_t length;
  Py_hash_t hash;
  struct {
      unsigned int interned:2;
      unsigned int kind:2;
      unsigned int compact:1;
      unsigned int ascii:1;
      unsigned int ready:1;
  } state;
  wchar_t *wstr;
} PyASCIIObject;

typedef struct {
  PyASCIIObject _base;
  Py_ssize_t utf8_length;
  char *utf8;
  Py_ssize_t wstr_length;
} PyCompactUnicodeObject;

typedef struct {
  PyCompactUnicodeObject _base;
  union {
      void *any;
      Py_UCS1 *latin1;
      Py_UCS2 *ucs2;
      Py_UCS4 *ucs4;
  } data;
} PyUnicodeObject;

Objects for which both size and maximum character are known at creation time are called "compact" unicode objects; character data immediately follow the base structure. If the maximum character is less than 128, they use the PyASCIIObject structure, and the UTF-8 data, the UTF-8 length and the wstr length are the same as the length of the ASCII data. For non-ASCII strings, the PyCompactObject structure is used. Resizing compact objects is not supported.

Objects for which the maximum character is not given at creation time are called "legacy" objects, created through PyUnicode_FromStringAndSize(NULL, length). They use the PyUnicodeObject structure. Initially, their data is only in the wstr pointer; when PyUnicode_READY is called, the data pointer (union) is allocated. Resizing is possible as long PyUnicode_READY has not been called.

The fields have the following interpretations:

  • length: number of code points in the string (result of sq_length)

  • interned: interned-state (SSTATE_*) as in 3.2

  • kind: form of string
    • 00 => str is not initialized (data are in wstr)
    • 01 => 1 byte (Latin-1)
    • 10 => 2 byte (UCS-2)
    • 11 => 4 byte (UCS-4);
  • compact: the object uses one of the compact representations (implies ready)

  • ascii: the object uses the PyASCIIObject representation (implies compact and ready)

  • ready: the canonical representation is ready to be accessed through PyUnicode_DATA and PyUnicode_GET_LENGTH. This is set either if the object is compact, or the data pointer and length have been initialized.

  • wstr_length, wstr: representation in platform's wchar_t (null-terminated). If wchar_t is 16-bit, this form may use surrogate pairs (in which cast wstr_length differs form length). wstr_length differs from length only if there are surrogate pairs in the representation.

  • utf8_length, utf8: UTF-8 representation (null-terminated).

  • data: shortest-form representation of the unicode string. The string is null-terminated (in its respective representation).

All three representations are optional, although the data form is considered the canonical representation which can be absent only while the string is being created. If the representation is absent, the pointer is NULL, and the corresponding length field may contain arbitrary data.

The Py_UNICODE type is still supported but deprecated. It is always defined as a typedef for wchar_t, so the wstr representation can double as Py_UNICODE representation.

The data and utf8 pointers point to the same memory if the string uses only ASCII characters (using only Latin-1 is not sufficient). The data and wstr pointers point to the same memory if the string happens to fit exactly to the wchar_t type of the platform (i.e. uses some BMP-not-Latin-1 characters if sizeof(wchar_t) is 2, and uses some non-BMP characters if sizeof(wchar_t) is 4).

String Creation

The recommended way to create a Unicode object is to use the function PyUnicode_New:

PyObject* PyUnicode_New(Py_ssize_t size, Py_UCS4 maxchar);

Both parameters must denote the eventual size/range of the strings. In particular, codecs using this API must compute both the number of characters and the maximum character in advance. An string is allocated according to the specified size and character range and is null-terminated; the actual characters in it may be uninitialized.

PyUnicode_FromString and PyUnicode_FromStringAndSize remain supported for processing UTF-8 input; the input is decoded, and the UTF-8 representation is not yet set for the string.

PyUnicode_FromUnicode remains supported but is deprecated. If the Py_UNICODE pointer is non-null, the data representation is set. If the pointer is NULL, a properly-sized wstr representation is allocated, which can be modified until PyUnicode_READY() is called (explicitly or implicitly). Resizing a Unicode string remains possible until it is finalized.

PyUnicode_READY() converts a string containing only a wstr representation into the canonical representation. Unless wstr and data can share the memory, the wstr representation is discarded after the conversion. The macro returns 0 on success and -1 on failure, which happens in particular if the memory allocation fails.

String Access

The canonical representation can be accessed using two macros PyUnicode_Kind and PyUnicode_Data. PyUnicode_Kind gives one of the values PyUnicode_WCHAR_KIND (0), PyUnicode_1BYTE_KIND (1), PyUnicode_2BYTE_KIND (2), or PyUnicode_4BYTE_KIND (3). PyUnicode_DATA gives the void pointer to the data. Access to individual characters should use PyUnicode_{READ|WRITE}[_CHAR]:

  • PyUnicode_READ(kind, data, index)
  • PyUnicode_WRITE(kind, data, index, value)
  • PyUnicode_READ_CHAR(unicode, index)

All these macros assume that the string is in canonical form; callers need to ensure this by calling PyUnicode_READY.

A new function PyUnicode_AsUTF8 is provided to access the UTF-8 representation. It is thus identical to the existing _PyUnicode_AsString, which is removed. The function will compute the utf8 representation when first called. Since this representation will consume memory until the string object is released, applications should use the existing PyUnicode_AsUTF8String where possible (which generates a new string object every time). APIs that implicitly converts a string to a char* (such as the ParseTuple functions) will use PyUnicode_AsUTF8 to compute a conversion.

New API

This section summarizes the API additions.

Macros to access the internal representation of a Unicode object (read-only):

  • PyUnicode_IS_COMPACT_ASCII(o), PyUnicode_IS_COMPACT(o), PyUnicode_IS_READY(o)
  • PyUnicode_GET_LENGTH(o)
  • PyUnicode_KIND(o), PyUnicode_CHARACTER_SIZE(o), PyUnicode_MAX_CHAR_VALUE(o)
  • PyUnicode_DATA(o), PyUnicode_1BYTE_DATA(o), PyUnicode_2BYTE_DATA(o), PyUnicode_4BYTE_DATA(o)

Character access macros:

  • PyUnicode_READ(kind, data, index), PyUnicode_READ_CHAR(o, index)
  • PyUnicode_WRITE(kind, data, index, value)

Other macros:

  • PyUnicode_READY(o)
  • PyUnicode_CONVERT_BYTES(from_type, to_type, begin, end, to)

String creation functions:

  • PyUnicode_New(size, maxchar)
  • PyUnicode_FromKindAndData(kind, data, size)
  • PyUnicode_Substring(o, start, end)

Character access utility functions:

  • PyUnicode_GetLength(o), PyUnicode_ReadChar(o, index), PyUnicode_WriteChar(o, index, character)
  • PyUnicode_CopyCharacters(to, to_start, from, from_start, how_many)
  • PyUnicode_FindChar(str, ch, start, end, direction)

Representation conversion:

  • PyUnicode_AsUCS4(o, buffer, buflen)
  • PyUnicode_AsUCS4Copy(o)
  • PyUnicode_AsUnicodeAndSize(o, size_out)
  • PyUnicode_AsUTF8(o)
  • PyUnicode_AsUTF8AndSize(o, size_out)

UCS4 utility functions:

  • Py_UCS4_{strlen, strcpy, strcat, strncpy, strcmp, strncpy, strcmp, strncmp, strchr, strrchr}

Stable ABI

The following functions are added to the stable ABI (PEP 384), as they are independent of the actual representation of Unicode objects: PyUnicode_New, PyUnicode_Substring, PyUnicode_GetLength, PyUnicode_ReadChar, PyUnicode_WriteChar, PyUnicode_Find, PyUnicode_FindChar.

GDB Debugging Hooks

Tools/gdb/libpython.py contains debugging hooks that embed knowledge about the internals of CPython's data types, include PyUnicodeObject instances. It has been updated to track the change.

Deprecations, Removals, and Incompatibilities

While the Py_UNICODE representation and APIs are deprecated with this PEP, no removal of the respective APIs is scheduled. The APIs should remain available at least five years after the PEP is accepted; before they are removed, existing extension modules should be studied to find out whether a sufficient majority of the open-source code on PyPI has been ported to the new API. A reasonable motivation for using the deprecated API even in new code is for code that shall work both on Python 2 and Python 3.

The following macros and functions are deprecated:

  • PyUnicode_FromUnicode
  • PyUnicode_GET_SIZE, PyUnicode_GetSize, PyUnicode_GET_DATA_SIZE,
  • PyUnicode_AS_UNICODE, PyUnicode_AsUnicode, PyUnicode_AsUnicodeAndSize
  • PyUnicode_COPY, PyUnicode_FILL, PyUnicode_MATCH
  • PyUnicode_Encode, PyUnicode_EncodeUTF7, PyUnicode_EncodeUTF8, PyUnicode_EncodeUTF16, PyUnicode_EncodeUTF32, PyUnicode_EncodeUnicodeEscape, PyUnicode_EncodeRawUnicodeEscape, PyUnicode_EncodeLatin1, PyUnicode_EncodeASCII, PyUnicode_EncodeCharmap, PyUnicode_TranslateCharmap, PyUnicode_EncodeMBCS, PyUnicode_EncodeDecimal, PyUnicode_TransformDecimalToASCII
  • Py_UNICODE_{strlen, strcat, strcpy, strcmp, strchr, strrchr}
  • PyUnicode_AsUnicodeCopy
  • PyUnicode_GetMax

_PyUnicode_AsDefaultEncodedString is removed. It previously returned a borrowed reference to an UTF-8-encoded bytes object. Since the unicode object cannot anymore cache such a reference, implementing it without leaking memory is not possible. No deprecation phase is provided, since it was an API for internal use only.

Extension modules using the legacy API may inadvertently call PyUnicode_READY, by calling some API that requires that the object is ready, and then continue accessing the (now invalid) Py_UNICODE pointer. Such code will break with this PEP. The code was already flawed in 3.2, as there is was no explicit guarantee that the PyUnicode_AS_UNICODE result would stay valid after an API call (due to the possibility of string resizing). Modules that face this issue need to re-fetch the Py_UNICODE pointer after API calls; doing so will continue to work correctly in earlier Python versions.

Discussion

Several concerns have been raised about the approach presented here:

It makes the implementation more complex. That's true, but considered worth it given the benefits.

The Py_UNICODE representation is not instantaneously available, slowing down applications that request it. While this is also true, applications that care about this problem can be rewritten to use the data representation.

Performance

Performance of this patch must be considered for both memory consumption and runtime efficiency. For memory consumption, the expectation is that applications that have many large strings will see a reduction in memory usage. For small strings, the effects depend on the pointer size of the system, and the size of the Py_UNICODE/wchar_t type. The following table demonstrates this for various small ASCII and Latin-1 string sizes and platforms.

string size Python 3.2 This PEP
16-bit wchar_t 32-bit wchar_t ASCII Latin-1
32-bit 64-bit 32-bit 64-bit 32-bit 64-bit 32-bit 64-bit
1 32 64 40 64 32 56 40 80
2 40 64 40 72 32 56 40 80
3 40 64 48 72 32 56 40 80
4 40 72 48 80 32 56 48 80
5 40 72 56 80 32 56 48 80
6 48 72 56 88 32 56 48 80
7 48 72 64 88 32 56 48 80
8 48 80 64 96 40 64 48 88

The runtime effect is significantly affected by the API being used. After porting the relevant pieces of code to the new API, the iobench, stringbench, and json benchmarks see typically slowdowns of 1% to 30%; for specific benchmarks, speedups may happen as may happen significantly larger slowdowns.

In actual measurements of a Django application ([2]), significant reductions of memory usage could be found. For example, the storage for Unicode objects reduced to 2216807 bytes, down from 6378540 bytes for a wide Unicode build, and down from 3694694 bytes for a narrow Unicode build (all on a 32-bit system). This reduction came from the prevalence of ASCII strings in this application; out of 36,000 strings (with 1,310,000 chars), 35713 where ASCII strings (with 1,300,000 chars). The sources for these strings where not further analysed; many of them likely originate from identifiers in the library, and string constants in Django's source code.

In comparison to Python 2, both Unicode and byte strings need to be accounted. In the test application, Unicode and byte strings combined had a length of 2,046,000 units (bytes/chars) in 2.x, and 2,200,000 units in 3.x. On a 32-bit system, where the 2.x build used 32-bit wchar_t/Py_UNICODE, the 2.x test used 3,620,000 bytes, and the 3.x build 3,340,000 bytes. This reduction in 3.x using the PEP compared to 2.x only occurs when comparing with a wide unicode build.

Porting Guidelines

Only a small fraction of C code is affected by this PEP, namely code that needs to look "inside" unicode strings. That code doesn't necessarily need to be ported to this API, as the existing API will continue to work correctly. In particular, modules that need to support both Python 2 and Python 3 might get too complicated when simultaneously supporting this new API and the old Unicode API.

In order to port modules to the new API, try to eliminate the use of these API elements:

  • the Py_UNICODE type,
  • PyUnicode_AS_UNICODE and PyUnicode_AsUnicode,
  • PyUnicode_GET_SIZE and PyUnicode_GetSize, and
  • PyUnicode_FromUnicode.

When iterating over an existing string, or looking at specific characters, use indexing operations rather than pointer arithmetic; indexing works well for PyUnicode_READ(_CHAR) and PyUnicode_WRITE. Use void* as the buffer type for characters to let the compiler detect invalid dereferencing operations. If you do want to use pointer arithmetics (e.g. when converting existing code), use (unsigned) char* as the buffer type, and keep the element size (1, 2, or 4) in a variable. Notice that (1<<(kind-1)) will produce the element size given a buffer kind.

When creating new strings, it was common in Python to start of with a heuristical buffer size, and then grow or shrink if the heuristics failed. With this PEP, this is now less practical, as you need not only a heuristics for the length of the string, but also for the maximum character.

In order to avoid heuristics, you need to make two passes over the input: once to determine the output length, and the maximum character; then allocate the target string with PyUnicode_New and iterate over the input a second time to produce the final output. While this may sound expensive, it could actually be cheaper than having to copy the result again as in the following approach.

If you take the heuristical route, avoid allocating a string meant to be resized, as resizing strings won't work for their canonical representation. Instead, allocate a separate buffer to collect the characters, and then construct a unicode object from that using PyUnicode_FromKindAndData. One option is to use Py_UCS4 as the buffer element, assuming for the worst case in character ordinals. This will allow for pointer arithmetics, but may require a lot of memory. Alternatively, start with a 1-byte buffer, and increase the element size as you encounter larger characters. In any case, PyUnicode_FromKindAndData will scan over the buffer to verify the maximum character.

For common tasks, direct access to the string representation may not be necessary: PyUnicode_Find, PyUnicode_FindChar, PyUnicode_Ord, and PyUnicode_CopyCharacters help in analyzing and creating string objects, operating on indexes instead of data pointers.

pep-0394 The "python" Command on Unix-Like Systems

PEP:394
Title:The "python" Command on Unix-Like Systems
Version:$Revision$
Last-Modified:$Date$
Author:Kerrick Staley <mail at kerrickstaley.com>, Nick Coghlan <ncoghlan at gmail.com>, Barry Warsaw <barry at python.org>
Status:Active
Type:Informational
Content-Type:text/x-rst
Created:02-Mar-2011
Post-History:04-Mar-2011, 20-Jul-2011, 16-Feb-2012, 30-Sep-2014
Resolution:http://mail.python.org/pipermail/python-dev/2012-February/116594.html

Abstract

This PEP provides a convention to ensure that Python scripts can continue to be portable across *nix systems, regardless of the default version of the Python interpreter (i.e. the version invoked by the python command).

  • python2 will refer to some version of Python 2.x.
  • python3 will refer to some version of Python 3.x.
  • for the time being, all distributions should ensure that python refers to the same target as python2.
  • however, end users should be aware that python refers to python3 on at least Arch Linux (that change is what prompted the creation of this PEP), so python should be used in the shebang line only for scripts that are source compatible with both Python 2 and 3.
  • in preparation for an eventual change in the default version of Python, Python 2 only scripts should either be updated to be source compatible with Python 3 or else to use python2 in the shebang line.

Recommendation

  • Unix-like software distributions (including systems like Mac OS X and Cygwin) should install the python2 command into the default path whenever a version of the Python 2 interpreter is installed, and the same for python3 and the Python 3 interpreter.
  • When invoked, python2 should run some version of the Python 2 interpreter, and python3 should run some version of the Python 3 interpreter.
  • The more general python command should be installed whenever any version of Python 2 is installed and should invoke the same version of Python as the python2 command (however, note that some distributions have already chosen to have python implement the python3 command; see the Rationale and Migration Notes below).
  • The Python 2.x idle, pydoc, and python-config commands should likewise be available as idle2, pydoc2, and python2-config, with the original commands invoking these versions by default, but possibly invoking the Python 3.x versions instead if configured to do so by the system administrator.
  • In order to tolerate differences across platforms, all new code that needs to invoke the Python interpreter should not specify python, but rather should specify either python2 or python3 (or the more specific python2.x and python3.x versions; see the Migration Notes). This distinction should be made in shebangs, when invoking from a shell script, when invoking via the system() call, or when invoking in any other context.
  • One exception to this is scripts that are deliberately written to be source compatible with both Python 2.x and 3.x. Such scripts may continue to use python on their shebang line without affecting their portability.
  • When reinvoking the interpreter from a Python script, querying sys.executable to avoid hardcoded assumptions regarding the interpreter location remains the preferred approach.

These recommendations are the outcome of the relevant python-dev discussions in March and July 2011 ([1], [2]), February 2012 ([4]) and September 2014 ([6]).

Rationale

This recommendation is needed as, even though the majority of distributions still alias the python command to Python 2, some now alias it to Python 3 ([5]). As some of the former distributions did not provide a python2 command by default, there was previously no way for Python 2 code (or any code that invokes the Python 2 interpreter directly rather than via sys.executable) to reliably run on all Unix-like systems without modification, as the python command would invoke the wrong interpreter version on some systems, and the python2 command would fail completely on others. The recommendations in this PEP provide a very simple mechanism to restore cross-platform support, with minimal additional work required on the part of distribution maintainers.

Future Changes to this Recommendation

It is anticipated that there will eventually come a time where the third party ecosystem surrounding Python 3 is sufficiently mature for this recommendation to be updated to suggest that the python symlink refer to python3 rather than python2.

This recommendation will be periodically reviewed over the next few years, and updated when the core development team judges it appropriate. As a point of reference, regular maintenance releases for the Python 2.7 series will continue until at least 2020.

Migration Notes

This section does not contain any official recommendations from the core CPython developers. It's merely a collection of notes regarding various aspects of migrating to Python 3 as the default version of Python for a system. They will hopefully be helpful to any distributions considering making such a change.

  • The main barrier to a distribution switching the python command from python2 to python3 isn't breakage within the distribution, but instead breakage of private third party scripts developed by sysadmins and other users. Updating the python command to invoke python3 by default indicates that a distribution is willing to break such scripts with errors that are potentially quite confusing for users that aren't yet familiar with the backwards incompatible changes in Python 3. For example, while the change of print from a statement to a builtin function is relatively simple for automated converters to handle, the SyntaxError from attempting to use the Python 2 notation in versions of Python 3 prior to 3.4.2 is thoroughly confusing if you aren't already aware of the change:

    $ python3 -c 'print "Hello, world!"'
      File "<string>", line 1
        print "Hello, world!"
                            ^
    SyntaxError: invalid syntax
    

    (In Python 3.4.2+, that generic error message has been replaced with the more explicit "SyntaxError: Missing parentheses in call to 'print'")

  • Avoiding breakage of such third party scripts is the key reason this PEP recommends that python continue to refer to python2 for the time being. Until the conventions described in this PEP are more widely adopted, having python invoke python2 will remain the recommended option.

  • The pythonX.X (e.g. python2.6) commands exist on some systems, on which they invoke specific minor versions of the Python interpreter. It can be useful for distribution-specific packages to take advantage of these utilities if they exist, since it will prevent code breakage if the default minor version of a given major version is changed. However, scripts intending to be cross-platform should not rely on the presence of these utilities, but rather should be tested on several recent minor versions of the target major version, compensating, if necessary, for the small differences that exist between minor versions. This prevents the need for sysadmins to install many very similar versions of the interpreter.

  • When the pythonX.X binaries are provided by a distribution, the python2 and python3 commands should refer to one of those files rather than being provided as a separate binary file.

  • It is suggested that even distribution-specific packages follow the python2/python3 convention, even in code that is not intended to operate on other distributions. This will reduce problems if the distribution later decides to change the version of the Python interpreter that the python command invokes, or if a sysadmin installs a custom python command with a different major version than the distribution default. Distributions can test whether they are fully following this convention by changing the python interpreter on a test box and checking to see if anything breaks.

  • If the above point is adhered to and sysadmins are permitted to change the python command, then the python command should always be implemented as a link to the interpreter binary (or a link to a link) and not vice versa. That way, if a sysadmin does decide to replace the installed python file, they can do so without inadvertently deleting the previously installed binary.

  • If the Python 2 interpreter becomes uncommon, scripts should nevertheless continue to use the python3 convention rather that just python. This will ease transition in the event that yet another major version of Python is released.

  • If these conventions are adhered to, it will become the case that the python command is only executed in an interactive manner as a user convenience, or to run scripts that are source compatible with both Python 2 and Python 3.

Backwards Compatibility

A potential problem can arise if a script adhering to the python2/python3 convention is executed on a system not supporting these commands. This is mostly a non-issue, since the sysadmin can simply create these symbolic links and avoid further problems. It is a significantly more obvious breakage than the sometimes cryptic errors that can arise when attempting to execute a script containing Python 2 specific syntax with a Python 3 interpreter.

Application to the CPython Reference Interpreter

While technically a new feature, the make install and make bininstall command in the 2.7 version of CPython were adjusted to create the following chains of symbolic links in the relevant bin directory (the final item listed in the chain is the actual installed binary, preceding items are relative symbolic links):

python -> python2 -> python2.7
python-config -> python2-config -> python2.7-config

Similar adjustments were made to the Mac OS X binary installer.

This feature first appeared in the default installation process in CPython 2.7.3.

The installation commands in the CPython 3.x series already create the appropriate symlinks. For example, CPython 3.2 creates:

python3 -> python3.2
idle3 -> idle3.2
pydoc3 -> pydoc3.2
python3-config -> python3.2-config

And CPython 3.3 creates:

python3 -> python3.3
idle3 -> idle3.3
pydoc3 -> pydoc3.3
python3-config -> python3.3-config
pysetup3 -> pysetup3.3

The implementation progress of these features in the default installers was managed on the tracker as issue #12627 ([3]).

Impact on PYTHON* Environment Variables

The choice of target for the python command implicitly affects a distribution's expected interpretation of the various Python related environment variables. The use of *.pth files in the relevant site-packages folder, the "per-user site packages" feature (see python -m site) or more flexible tools such as virtualenv are all more tolerant of the presence of multiple versions of Python on a system than the direct use of PYTHONPATH.

Exclusion of MS Windows

This PEP deliberately excludes any proposals relating to Microsoft Windows, as devising an equivalent solution for Windows was deemed too complex to handle here. PEP 397 and the related discussion on the python-dev mailing list address this issue (like this PEP, the PEP 397 launcher invokes Python 2 by default if versions of both Python 2 and 3 are installed on the system).

References

[1]Support the /usr/bin/python2 symlink upstream (with bonus grammar class!) (http://mail.python.org/pipermail/python-dev/2011-March/108491.html)
[2]Rebooting PEP 394 (aka Support the /usr/bin/python2 symlink upstream) (http://mail.python.org/pipermail/python-dev/2011-July/112322.html)
[3]Implement PEP 394 in the CPython Makefile (http://bugs.python.org/issue12627)
[4]PEP 394 request for pronouncement (python2 symlink in *nix systems) (http://mail.python.org/pipermail/python-dev/2012-February/116435.html)
[5]Arch Linux announcement that their "python" link now refers Python 3 (https://www.archlinux.org/news/python-is-now-python-3/)
[6]PEP 394 - Clarification of what "python" command should invoke (https://mail.python.org/pipermail/python-dev/2014-September/136374.html)

pep-0395 Qualified Names for Modules

PEP:395
Title:Qualified Names for Modules
Version:$Revision$
Last-Modified:$Date$
Author:Nick Coghlan <ncoghlan at gmail.com>
Status:Withdrawn
Type:Standards Track
Content-Type:text/x-rst
Created:4-Mar-2011
Python-Version:3.4
Post-History:5-Mar-2011, 19-Nov-2011

PEP Withdrawal

This PEP was withdrawn by the author in December 2013, as other significant changes in the time since it was written have rendered several aspects obsolete. Most notably PEP 420 namespace packages rendered some of the proposals related to package detection unworkable and PEP 451 module specifications resolved the multiprocessing issues and provide a possible means to tackle the pickle compatibility issues.

A future PEP to resolve the remaining issues would still be appropriate, but it's worth starting any such effort as a fresh PEP restating the remaining problems in an updated context rather than trying to build on this one directly.

Abstract

This PEP proposes new mechanisms that eliminate some longstanding traps for the unwary when dealing with Python's import system, as well as serialisation and introspection of functions and classes.

It builds on the "Qualified Name" concept defined in PEP 3155.

Relationship with Other PEPs

Most significantly, this PEP is currently deferred as it requires significant changes in order to be made compatible with the removal of mandatory __init__.py files in PEP 420 (which has been implemented and released in Python 3.3).

This PEP builds on the "qualified name" concept introduced by PEP 3155, and also shares in that PEP's aim of fixing some ugly corner cases when dealing with serialisation of arbitrary functions and classes.

It also builds on PEP 366, which took initial tentative steps towards making explicit relative imports from the main module work correctly in at least some circumstances.

Finally, PEP 328 eliminated implicit relative imports from imported modules. This PEP proposes that the de facto implicit relative imports from main modules that are provided by the current initialisation behaviour for sys.path[0] also be eliminated.

What's in a __name__?

Over time, a module's __name__ attribute has come to be used to handle a number of different tasks.

The key use cases identified for this module attribute are:

  1. Flagging the main module in a program, using the if __name__ == "__main__": convention.
  2. As the starting point for relative imports
  3. To identify the location of function and class definitions within the running application
  4. To identify the location of classes for serialisation into pickle objects which may be shared with other interpreter instances

Traps for the Unwary

The overloading of the semantics of __name__, along with some historically associated behaviour in the initialisation of sys.path[0], has resulted in several traps for the unwary. These traps can be quite annoying in practice, as they are highly unobvious (especially to beginners) and can cause quite confusing behaviour.

Why are my imports broken?

There's a general principle that applies when modifying sys.path: never put a package directory directly on sys.path. The reason this is problematic is that every module in that directory is now potentially accessible under two different names: as a top level module (since the package directory is on sys.path) and as a submodule of the package (if the higher level directory containing the package itself is also on sys.path).

As an example, Django (up to and including version 1.3) is guilty of setting up exactly this situation for site-specific applications - the application ends up being accessible as both app and site.app in the module namespace, and these are actually two different copies of the module. This is a recipe for confusion if there is any meaningful mutable module level state, so this behaviour is being eliminated from the default site set up in version 1.4 (site-specific apps will always be fully qualified with the site name).

However, it's hard to blame Django for this, when the same part of Python responsible for setting __name__ = "__main__" in the main module commits the exact same error when determining the value for sys.path[0].

The impact of this can be seen relatively frequently if you follow the "python" and "import" tags on Stack Overflow. When I had the time to follow it myself, I regularly encountered people struggling to understand the behaviour of straightforward package layouts like the following (I actually use package layouts along these lines in my own projects):

project/
    setup.py
    example/
        __init__.py
        foo.py
        tests/
            __init__.py
            test_foo.py

While I would often see it without the __init__.py files first, that's a trivial fix to explain. What's hard to explain is that all of the following ways to invoke test_foo.py probably won't work due to broken imports (either failing to find example for absolute imports, complaining about relative imports in a non-package or beyond the toplevel package for explicit relative imports, or issuing even more obscure errors if some other submodule happens to shadow the name of a top-level module, such as an example.json module that handled serialisation or an example.tests.unittest test runner):

# These commands will most likely *FAIL*, even if the code is correct

# working directory: project/example/tests
./test_foo.py
python test_foo.py
python -m package.tests.test_foo
python -c "from package.tests.test_foo import main; main()"

# working directory: project/package
tests/test_foo.py
python tests/test_foo.py
python -m package.tests.test_foo
python -c "from package.tests.test_foo import main; main()"

# working directory: project
example/tests/test_foo.py
python example/tests/test_foo.py

# working directory: project/..
project/example/tests/test_foo.py
python project/example/tests/test_foo.py
# The -m and -c approaches don't work from here either, but the failure
# to find 'package' correctly is easier to explain in this case

That's right, that long list is of all the methods of invocation that will almost certainly break if you try them, and the error messages won't make any sense if you're not already intimately familiar not only with the way Python's import system works, but also with how it gets initialised.

For a long time, the only way to get sys.path right with that kind of setup was to either set it manually in test_foo.py itself (hardly something a novice, or even many veteran, Python programmers are going to know how to do) or else to make sure to import the module instead of executing it directly:

# working directory: project
python -c "from package.tests.test_foo import main; main()"

Since the implementation of PEP 366 (which defined a mechanism that allows relative imports to work correctly when a module inside a package is executed via the -m switch), the following also works properly:

# working directory: project
python -m package.tests.test_foo

The fact that most methods of invoking Python code from the command line break when that code is inside a package, and the two that do work are highly sensitive to the current working directory is all thoroughly confusing for a beginner. I personally believe it is one of the key factors leading to the perception that Python packages are complicated and hard to get right.

This problem isn't even limited to the command line - if test_foo.py is open in Idle and you attempt to run it by pressing F5, or if you try to run it by clicking on it in a graphical filebrowser, then it will fail in just the same way it would if run directly from the command line.

There's a reason the general "no package directories on sys.path" guideline exists, and the fact that the interpreter itself doesn't follow it when determining sys.path[0] is the root cause of all sorts of grief.

In the past, this couldn't be fixed due to backwards compatibility concerns. However, scripts potentially affected by this problem will already require fixes when porting to the Python 3.x (due to the elimination of implicit relative imports when importing modules normally). This provides a convenient opportunity to implement a corresponding change in the initialisation semantics for sys.path[0].

Importing the main module twice

Another venerable trap is the issue of importing __main__ twice. This occurs when the main module is also imported under its real name, effectively creating two instances of the same module under different names.

If the state stored in __main__ is significant to the correct operation of the program, or if there is top-level code in the main module that has non-idempotent side effects, then this duplication can cause obscure and surprising errors.

In a bit of a pickle

Something many users may not realise is that the pickle module sometimes relies on the __module__ attribute when serialising instances of arbitrary classes. So instances of classes defined in __main__ are pickled that way, and won't be unpickled correctly by another python instance that only imported that module instead of running it directly. This behaviour is the underlying reason for the advice from many Python veterans to do as little as possible in the __main__ module in any application that involves any form of object serialisation and persistence.

Similarly, when creating a pseudo-module (see next paragraph), pickles rely on the name of the module where a class is actually defined, rather than the officially documented location for that class in the module hierarchy.

For the purposes of this PEP, a "pseudo-module" is a package designed like the Python 3.2 unittest and concurrent.futures packages. These packages are documented as if they were single modules, but are in fact internally implemented as a package. This is supposed to be an implementation detail that users and other implementations don't need to worry about, but, thanks to pickle (and serialisation in general), the details are often exposed and can effectively become part of the public API.

While this PEP focuses specifically on pickle as the principal serialisation scheme in the standard library, this issue may also affect other mechanisms that support serialisation of arbitrary class instances and rely on __module__ attributes to determine how to handle deserialisation.

Where's the source?

Some sophisticated users of the pseudo-module technique described above recognise the problem with implementation details leaking out via the pickle module, and choose to address it by altering __name__ to refer to the public location for the module before defining any functions or classes (or else by modifying the __module__ attributes of those objects after they have been defined).

This approach is effective at eliminating the leakage of information via pickling, but comes at the cost of breaking introspection for functions and classes (as their __module__ attribute now points to the wrong place).

Forkless Windows

To get around the lack of os.fork on Windows, the multiprocessing module attempts to re-execute Python with the same main module, but skipping over any code guarded by if __name__ == "__main__": checks. It does the best it can with the information it has, but is forced to make assumptions that simply aren't valid whenever the main module isn't an ordinary directly executed script or top-level module. Packages and non-top-level modules executed via the -m switch, as well as directly executed zipfiles or directories, are likely to make multiprocessing on Windows do the wrong thing (either quietly or noisily, depending on application details) when spawning a new process.

While this issue currently only affects Windows directly, it also impacts any proposals to provide Windows-style "clean process" invocation via the multiprocessing module on other platforms.

Qualified Names for Modules

To make it feasible to fix these problems once and for all, it is proposed to add a new module level attribute: __qualname__. This abbreviation of "qualified name" is taken from PEP 3155, where it is used to store the naming path to a nested class or function definition relative to the top level module.

For modules, __qualname__ will normally be the same as __name__, just as it is for top-level functions and classes in PEP 3155. However, it will differ in some situations so that the above problems can be addressed.

Specifically, whenever __name__ is modified for some other purpose (such as to denote the main module), then __qualname__ will remain unchanged, allowing code that needs it to access the original unmodified value.

If a module loader does not initialise __qualname__ itself, then the import system will add it automatically (setting it to the same value as __name__).

Alternative Names

Two alternative names were also considered for the new attribute: "full name" (__fullname__) and "implementation name" (__implname__).

Either of those would actually be valid for the use case in this PEP. However, as a meta-issue, PEP 3155 is also adding a new attribute (for functions and classes) that is "like __name__, but different in some cases where __name__ is missing necessary information" and those terms aren't accurate for the PEP 3155 function and class use case.

PEP 3155 deliberately omits the module information, so the term "full name" is simply untrue, and "implementation name" implies that it may specify an object other than that specified by __name__, and that is never the case for PEP 3155 (in that PEP, __name__ and __qualname__ always refer to the same function or class, it's just that __name__ is insufficient to accurately identify nested functions and classes).

Since it seems needlessly inconsistent to add two new terms for attributes that only exist because backwards compatibility concerns keep us from changing the behaviour of __name__ itself, this PEP instead chose to adopt the PEP 3155 terminology.

If the relative inscrutability of "qualified name" and __qualname__ encourages interested developers to look them up at least once rather than assuming they know what they mean just from the name and guessing wrong, that's not necessarily a bad outcome.

Besides, 99% of Python developers should never need to even care these extra attributes exist - they're really an implementation detail to let us fix a few problematic behaviours exhibited by imports, pickling and introspection, not something people are going to be dealing with on a regular basis.

Eliminating the Traps

The following changes are interrelated and make the most sense when considered together. They collectively either completely eliminate the traps for the unwary noted above, or else provide straightforward mechanisms for dealing with them.

A rough draft of some of the concepts presented here was first posted on the python-ideas list ([1]), but they have evolved considerably since first being discussed in that thread. Further discussion has subsequently taken place on the import-sig mailing list ([2]. [3]).

Fixing main module imports inside packages

To eliminate this trap, it is proposed that an additional filesystem check be performed when determining a suitable value for sys.path[0]. This check will look for Python's explicit package directory markers and use them to find the appropriate directory to add to sys.path.

The current algorithm for setting sys.path[0] in relevant cases is roughly as follows:

# Interactive prompt, -m switch, -c switch
sys.path.insert(0, '')
# Valid sys.path entry execution (i.e. directory and zip execution)
sys.path.insert(0, sys.argv[0])
# Direct script execution
sys.path.insert(0, os.path.dirname(sys.argv[0]))

It is proposed that this initialisation process be modified to take package details stored on the filesystem into account:

# Interactive prompt, -m switch, -c switch
in_package, path_entry, _ignored = split_path_module(os.getcwd(), '')
if in_package:
    sys.path.insert(0, path_entry)
else:
    sys.path.insert(0, '')

# Start interactive prompt or run -c command as usual
#   __main__.__qualname__ is set to "__main__"

# The -m switches uses the same sys.path[0] calculation, but:
#   modname is the argument to the -m switch
#   modname is passed to ``runpy._run_module_as_main()`` as usual
#   __main__.__qualname__ is set to modname
# Valid sys.path entry execution (i.e. directory and zip execution)
modname = "__main__"
path_entry, modname = split_path_module(sys.argv[0], modname)
sys.path.insert(0, path_entry)

# modname (possibly adjusted) is passed to ``runpy._run_module_as_main()``
# __main__.__qualname__ is set to modname
# Direct script execution
in_package, path_entry, modname = split_path_module(sys.argv[0])
sys.path.insert(0, path_entry)
if in_package:
    # Pass modname to ``runpy._run_module_as_main()``
else:
    # Run script directly
# __main__.__qualname__ is set to modname

The split_path_module() supporting function used in the above pseudo-code would have the following semantics:

def _splitmodname(fspath):
    path_entry, fname = os.path.split(fspath)
    modname = os.path.splitext(fname)[0]
    return path_entry, modname

def _is_package_dir(fspath):
    return any(os.exists("__init__" + info[0]) for info
                   in imp.get_suffixes())

def split_path_module(fspath, modname=None):
    """Given a filesystem path and a relative module name, determine an
       appropriate sys.path entry and a fully qualified module name.

       Returns a 3-tuple of (package_depth, fspath, modname). A reported
       package depth of 0 indicates that this would be a top level import.

       If no relative module name is given, it is derived from the final
       component in the supplied path with the extension stripped.
    """
    if modname is None:
        fspath, modname = _splitmodname(fspath)
    package_depth = 0
    while _is_package_dir(fspath):
        fspath, pkg = _splitmodname(fspath)
        modname = pkg + '.' + modname
    return package_depth, fspath, modname

This PEP also proposes that the split_path_module() functionality be exposed directly to Python users via the runpy module.

With this fix in place, and the same simple package layout described earlier, all of the following commands would invoke the test suite correctly:

# working directory: project/example/tests
./test_foo.py
python test_foo.py
python -m package.tests.test_foo
python -c "from .test_foo import main; main()"
python -c "from ..tests.test_foo import main; main()"
python -c "from package.tests.test_foo import main; main()"

# working directory: project/package
tests/test_foo.py
python tests/test_foo.py
python -m package.tests.test_foo
python -c "from .tests.test_foo import main; main()"
python -c "from package.tests.test_foo import main; main()"

# working directory: project
example/tests/test_foo.py
python example/tests/test_foo.py
python -m package.tests.test_foo
python -c "from package.tests.test_foo import main; main()"

# working directory: project/..
project/example/tests/test_foo.py
python project/example/tests/test_foo.py
# The -m and -c approaches still don't work from here, but the failure
# to find 'package' correctly is pretty easy to explain in this case

With these changes, clicking Python modules in a graphical file browser should always execute them correctly, even if they live inside a package. Depending on the details of how it invokes the script, Idle would likely also be able to run test_foo.py correctly with F5, without needing any Idle specific fixes.

Optional addition: command line relative imports

With the above changes in place, it would be a fairly minor addition to allow explicit relative imports as arguments to the -m switch:

# working directory: project/example/tests
python -m .test_foo
python -m ..tests.test_foo

# working directory: project/example/
python -m .tests.test_foo

With this addition, system initialisation for the -m switch would change as follows:

# -m switch (permitting explicit relative imports)
in_package, path_entry, pkg_name = split_path_module(os.getcwd(), '')
qualname= <<arguments to -m switch>>
if qualname.startswith('.'):
    modname = qualname
    while modname.startswith('.'):
        modname = modname[1:]
        pkg_name, sep, _ignored = pkg_name.rpartition('.')
        if not sep:
            raise ImportError("Attempted relative import beyond top level package")
    qualname = pkg_name + '.' modname
if in_package:
    sys.path.insert(0, path_entry)
else:
    sys.path.insert(0, '')

# qualname is passed to ``runpy._run_module_as_main()``
# _main__.__qualname__ is set to qualname

Compatibility with PEP 382

Making this proposal compatible with the PEP 382 namespace packaging PEP is trivial. The semantics of _is_package_dir() are merely changed to be:

def _is_package_dir(fspath):
    return (fspath.endswith(".pyp") or
            any(os.exists("__init__" + info[0]) for info
                    in imp.get_suffixes()))

Incompatibility with PEP 402

PEP 402 proposes the elimination of explicit markers in the file system for Python packages. This fundamentally breaks the proposed concept of being able to take a filesystem path and a Python module name and work out an unambiguous mapping to the Python module namespace. Instead, the appropriate mapping would depend on the current values in sys.path, rendering it impossible to ever fix the problems described above with the calculation of sys.path[0] when the interpreter is initialised.

While some aspects of this PEP could probably be salvaged if PEP 402 were adopted, the core concept of making import semantics from main and other modules more consistent would no longer be feasible.

This incompatibility is discussed in more detail in the relevant import-sig threads ([2], [3]).

Potential incompatibilities with scripts stored in packages

The proposed change to sys.path[0] initialisation may break some existing code. Specifically, it will break scripts stored in package directories that rely on the implicit relative imports from __main__ in order to run correctly under Python 3.

While such scripts could be imported in Python 2 (due to implicit relative imports) it is already the case that they cannot be imported in Python 3, as implicit relative imports are no longer permitted when a module is imported.

By disallowing implicit relatives imports from the main module as well, such modules won't even work as scripts with this PEP. Switching them over to explicit relative imports will then get them working again as both executable scripts and as importable modules.

To support earlier versions of Python, a script could be written to use different forms of import based on the Python version:

if __name__ == "__main__" and sys.version_info < (3, 3):
    import peer # Implicit relative import
else:
    from . import peer # explicit relative import

Fixing dual imports of the main module

Given the above proposal to get __qualname__ consistently set correctly in the main module, one simple change is proposed to eliminate the problem of dual imports of the main module: the addition of a sys.metapath hook that detects attempts to import __main__ under its real name and returns the original main module instead:

class AliasImporter:
  def __init__(self, module, alias):
      self.module = module
      self.alias = alias

  def __repr__(self):
      fmt = "{0.__class__.__name__}({0.module.__name__}, {0.alias})"
      return fmt.format(self)

  def find_module(self, fullname, path=None):
      if path is None and fullname == self.alias:
          return self
      return None

  def load_module(self, fullname):
      if fullname != self.alias:
          raise ImportError("{!r} cannot load {!r}".format(self, fullname))
      return self.main_module

This metapath hook would be added automatically during import system initialisation based on the following logic:

main = sys.modules["__main__"]
if main.__name__ != main.__qualname__:
    sys.metapath.append(AliasImporter(main, main.__qualname__))

This is probably the least important proposal in the PEP - it just closes off the last mechanism that is likely to lead to module duplication after the configuration of sys.path[0] at interpreter startup is addressed.

Fixing pickling without breaking introspection

To fix this problem, it is proposed to make use of the new module level __qualname__ attributes to determine the real module location when __name__ has been modified for any reason.

In the main module, __qualname__ will automatically be set to the main module's "real" name (as described above) by the interpreter.

Pseudo-modules that adjust __name__ to point to the public namespace will leave __qualname__ untouched, so the implementation location remains readily accessible for introspection.

If __name__ is adjusted at the top of a module, then this will automatically adjust the __module__ attribute for all functions and classes subsequently defined in that module.

Since multiple submodules may be set to use the same "public" namespace, functions and classes will be given a new __qualmodule__ attribute that refers to the __qualname__ of their module.

This isn't strictly necessary for functions (you could find out their module's qualified name by looking in their globals dictionary), but it is needed for classes, since they don't hold a reference to the globals of their defining module. Once a new attribute is added to classes, it is more convenient to keep the API consistent and add a new attribute to functions as well.

These changes mean that adjusting __name__ (and, either directly or indirectly, the corresponding function and class __module__ attributes) becomes the officially sanctioned way to implement a namespace as a package, while exposing the API as if it were still a single module.

All serialisation code that currently uses __name__ and __module__ attributes will then avoid exposing implementation details by default.

To correctly handle serialisation of items from the main module, the class and function definition logic will be updated to also use __qualname__ for the __module__ attribute in the case where __name__ == "__main__".

With __name__ and __module__ being officially blessed as being used for the public names of things, the introspection tools in the standard library will be updated to use __qualname__ and __qualmodule__ where appropriate. For example:

  • pydoc will report both public and qualified names for modules
  • inspect.getsource() (and similar tools) will use the qualified names that point to the implementation of the code
  • additional pydoc and/or inspect APIs may be provided that report all modules with a given public __name__.

Fixing multiprocessing on Windows

With __qualname__ now available to tell multiprocessing the real name of the main module, it will be able to simply include it in the serialised information passed to the child process, eliminating the need for the current dubious introspection of the __file__ attribute.

For older Python versions, multiprocessing could be improved by applying the split_path_module() algorithm described above when attempting to work out how to execute the main module based on its __file__ attribute.

Explicit relative imports

This PEP proposes that __package__ be unconditionally defined in the main module as __qualname__.rpartition('.')[0]. Aside from that, it proposes that the behaviour of explicit relative imports be left alone.

In particular, if __package__ is not set in a module when an explicit relative import occurs, the automatically cached value will continue to be derived from __name__ rather than __qualname__. This minimises any backwards incompatibilities with existing code that deliberately manipulates relative imports by adjusting __name__ rather than setting __package__ directly.

This PEP does not propose that __package__ be deprecated. While it is technically redundant following the introduction of __qualname__, it just isn't worth the hassle of deprecating it within the lifetime of Python 3.x.

References

[1]Module aliases and/or "real names" (http://mail.python.org/pipermail/python-ideas/2011-January/008983.html)
[2](1, 2) PEP 395 (Module aliasing) and the namespace PEPs (http://mail.python.org/pipermail/import-sig/2011-November/000382.html)
[3](1, 2) Updated PEP 395 (aka "Implicit Relative Imports Must Die!") (http://mail.python.org/pipermail/import-sig/2011-November/000397.html)
[4]Elaboration of compatibility problems between this PEP and PEP 402 (http://mail.python.org/pipermail/import-sig/2011-November/000403.html)

pep-0396 Module Version Numbers

PEP:396
Title:Module Version Numbers
Version:65628
Last-Modified:2008-08-10 09:59:20 -0400 (Sun, 10 Aug 2008)
Author:Barry Warsaw <barry at python.org>
Status:Deferred
Type:Informational
Content-Type:text/x-rst
Created:2011-03-16
Post-History:2011-04-05

Abstract

Given that it is useful and common to specify version numbers for Python modules, and given that different ways of doing this have grown organically within the Python community, it is useful to establish standard conventions for module authors to adhere to and reference. This informational PEP describes best practices for Python module authors who want to define the version number of their Python module.

Conformance with this PEP is optional, however other Python tools (such as distutils2 [1]) may be adapted to use the conventions defined here.

PEP Deferral

Further exploration of the concepts covered in this PEP has been deferred for lack of a current champion interested in promoting the goals of the PEP and collecting and incorporating feedback, and with sufficient available time to do so effectively.

User Stories

Alice is writing a new module, called alice, which she wants to share with other Python developers. alice is a simple module and lives in one file, alice.py. Alice wants to specify a version number so that her users can tell which version they are using. Because her module lives entirely in one file, she wants to add the version number to that file.

Bob has written a module called bob which he has shared with many users. bob.py contains a version number for the convenience of his users. Bob learns about the Cheeseshop [2], and adds some simple packaging using classic distutils so that he can upload The Bob Bundle to the Cheeseshop. Because bob.py already specifies a version number which his users can access programmatically, he wants the same API to continue to work even though his users now get it from the Cheeseshop.

Carol maintains several namespace packages, each of which are independently developed and distributed. In order for her users to properly specify dependencies on the right versions of her packages, she specifies the version numbers in the namespace package's setup.py file. Because Carol wants to have to update one version number per package, she specifies the version number in her module and has the setup.py extract the module version number when she builds the sdist archive.

David maintains a package in the standard library, and also produces standalone versions for other versions of Python. The standard library copy defines the version number in the module, and this same version number is used for the standalone distributions as well.

Rationale

Python modules, both in the standard library and available from third parties, have long included version numbers. There are established de-facto standards for describing version numbers, and many ad-hoc ways have grown organically over the years. Often, version numbers can be retrieved from a module programmatically, by importing the module and inspecting an attribute. Classic Python distutils setup() functions [3] describe a version argument where the release's version number can be specified. PEP 8 [4] describes the use of a module attribute called __version__ for recording "Subversion, CVS, or RCS" version strings using keyword expansion. In the PEP author's own email archives, the earliest example of the use of an __version__ module attribute by independent module developers dates back to 1995.

Another example of version information is the sqlite3 [5] module with its sqlite_version_info, version, and version_info attributes. It may not be immediately obvious which attribute contains a version number for the module, and which contains a version number for the underlying SQLite3 library.

This informational PEP codifies established practice, and recommends standard ways of describing module version numbers, along with some use cases for when -- and when not -- to include them. Its adoption by module authors is purely voluntary; packaging tools in the standard library will provide optional support for the standards defined herein, and other tools in the Python universe may comply as well.

Specification

  1. In general, modules in the standard library SHOULD NOT have version numbers. They implicitly carry the version number of the Python release they are included in.
  2. On a case-by-case basis, standard library modules which are also released in standalone form for other Python versions MAY include a module version number when included in the standard library, and SHOULD include a version number when packaged separately.
  3. When a module (or package) includes a version number, the version SHOULD be available in the __version__ attribute.
  4. For modules which live inside a namespace package, the module SHOULD include the __version__ attribute. The namespace package itself SHOULD NOT include its own __version__ attribute.
  5. The __version__ attribute's value SHOULD be a string.
  6. Module version numbers SHOULD conform to the normalized version format specified in PEP 386 [6].
  7. Module version numbers SHOULD NOT contain version control system supplied revision numbers, or any other semantically different version numbers (e.g. underlying library version number).
  8. The version attribute in a classic distutils setup.py file, or the PEP 345 [7] Version metadata field SHOULD be derived from the __version__ field, or vice versa.

Examples

Retrieving the version number from a third party package:

>>> import bzrlib
>>> bzrlib.__version__
'2.3.0'

Retrieving the version number from a standard library package that is also distributed as a standalone module:

>>> import email
>>> email.__version__
'5.1.0'

Version numbers for namespace packages:

>>> import flufl.i18n
>>> import flufl.enum
>>> import flufl.lock

>>> print flufl.i18n.__version__
1.0.4
>>> print flufl.enum.__version__
3.1
>>> print flufl.lock.__version__
2.1

>>> import flufl
>>> flufl.__version__
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'module' object has no attribute '__version__'
>>>

Deriving

Module version numbers can appear in at least two places, and sometimes more. For example, in accordance with this PEP, they are available programmatically on the module's __version__ attribute. In a classic distutils setup.py file, the setup() function takes a version argument, while the distutils2 setup.cfg file has a version key. The version number must also get into the PEP 345 metadata, preferably when the sdist archive is built. It's desirable for module authors to only have to specify the version number once, and have all the other uses derive from this single definition.

This could be done in any number of ways, a few of which are outlined below. These are included for illustrative purposes only and are not intended to be definitive, complete, or all-encompassing. Other approaches are possible, and some included below may have limitations that prevent their use in some situations.

Let's say Elle adds this attribute to her module file elle.py:

__version__ = '3.1.1'

Classic distutils

In classic distutils, the simplest way to add the version string to the setup() function in setup.py is to do something like this:

from elle import __version__
setup(name='elle', version=__version__)

In the PEP author's experience however, this can fail in some cases, such as when the module uses automatic Python 3 conversion via the 2to3 program (because setup.py is executed by Python 3 before the elle module has been converted).

In that case, it's not much more difficult to write a little code to parse the __version__ from the file rather than importing it. Without providing too much detail, it's likely that modules such as distutils2 will provide a way to parse version strings from files. E.g.:

from distutils2 import get_version
setup(name='elle', version=get_version('elle.py'))

Distutils2

Because the distutils2 style setup.cfg is declarative, we can't run any code to extract the __version__ attribute, either via import or via parsing.

In consultation with the distutils-sig [9], two options are proposed. Both entail containing the version number in a file, and declaring that file in the setup.cfg. When the entire contents of the file contains the version number, the version-file key will be used:

[metadata]
version-file: version.txt

When the version number is contained within a larger file, e.g. of Python code, such that the file must be parsed to extract the version, the key version-from-file will be used:

[metadata]
version-from-file: elle.py

A parsing method similar to that described above will be performed on the file named after the colon. The exact recipe for doing this will be discussed in the appropriate distutils2 development forum.

An alternative is to only define the version number in setup.cfg and use the pkgutil module [8] to make it available programmatically. E.g. in elle.py:

from distutils2._backport import pkgutil
__version__ = pkgutil.get_distribution('elle').metadata['version']

PEP 376 metadata

PEP 376 [10] defines a standard for static metadata, but doesn't describe the process by which this metadata gets created. It is highly desirable for the derived version information to be placed into the PEP 376 .dist-info metadata at build-time rather than install-time. This way, the metadata will be available for introspection even when the code is not installed.

pep-0397 Python launcher for Windows

PEP: 397
Title: Python launcher for Windows
Version: a57419aee37d
Last-Modified:  2012/06/19 15:13:49
Author: Mark Hammond <mhammond at skippinet.com.au>, Martin v. LĂświs <martin at v.loewis.de>
Status: Final
Type: Standards Track
Content-Type: text/plain
Created: 15-Mar-2011
Post-History: 21-July-2011, 17-May-2011, 15-Mar-2011
Resolution: http://mail.python.org/pipermail/python-dev/2012-June/120505.html

Abstract

    This PEP describes a Python launcher for the Windows platform.  A 
    Python launcher is a single executable which uses a number of 
    heuristics to locate a Python executable and launch it with a
    specified command line.


Rationale

    Windows provides "file associations" so an executable can be associated
    with an extension, allowing for scripts to be executed directly in some
    contexts (eg., double-clicking the file in Windows Explorer.)  Until now,
    a strategy of "last installed Python wins" has been used and while not
    ideal, has generally been workable due to the conservative changes in
    Python 2.x releases.  As Python 3.x scripts are often syntactically
    incompatible with Python 2.x scripts, a different strategy must be used
    to allow files with a '.py' extension to use a different executable based
    on the Python version the script targets.  This will be done by borrowing
    the existing practices of another operating system - scripts will be able
    to nominate the version of Python they need by way of a "shebang" line, as
    described below.

    Unix-like operating systems (referred to simply as "Unix" in this
    PEP) allow scripts to be executed as if they were executable images
    by examining the script for a "shebang" line which specifies the
    actual executable to be used to run the script.  This is described in
    detail in the evecve(2) man page [1] and while user documentation will
    be created for this feature, for the purposes of this PEP that man
    page describes a valid shebang line.

    Additionally, these operating systems provide symbolic-links to
    Python executables in well-known directories.  For example, many
    systems will have a link /usr/bin/python which references a
    particular version of Python installed under the operating-system.
    These symbolic links allow Python to be executed without regard for
    where Python it actually installed on the machine (eg., without
    requiring the path where Python is actually installed to be
    referenced in the shebang line or in the PATH.)  PEP 394 'The "python"
    command on Unix-Like Systems' [2] describes additional conventions
    for more fine-grained specification of a particular Python version.

    These 2 facilities combined allow for a portable and somewhat 
    predictable way of both starting Python interactively and for allowing
    Python scripts to execute.  This PEP describes an implementation of a 
    launcher which can offer the same benefits for Python on the Windows 
    platform and therefore allows the launcher to be the executable
    associated with '.py' files to support multiple Python versions
    concurrently.

    While this PEP offers the ability to use a shebang line which should
    work on both Windows and Unix, this is not the primary motivation for
    this PEP - the primary motivation is to allow a specific version to be
    specified without inventing new syntax or conventions to describe
    it.

Specification

    This PEP specifies features of the launcher; a prototype
    implementation is provided in [3] which will be distributed
    together with the Windows installer of Python, but will also be
    available separately (but released along with the Python
    installer). New features may be added to the launcher as
    long as the features prescribed here continue to work.

Installation

    The launcher comes in 2 versions - one which is a console program and
    one which is a "windows" (ie., GUI) program.  These 2 launchers correspond
    to the 'python.exe' and 'pythonw.exe' executables which currently ship
    with Python.  The console launcher will be named 'py.exe' and the Windows
    one named 'pyw.exe'.  The "windows" (ie., GUI) version of the launcher
    will attempt to locate and launch pythonw.exe even if a virtual shebang
    line nominates simply "python" - infact, the trailing 'w' notation is
    not supported in the virtual shebang line at all.

    The launcher is installed into the Windows directory (see
    discussion below) if installed by a privileged user. The
    stand-alone installer asks for an alternative location of the
    installer, and adds that location to the user's PATH. 

    The installation in the Windows directory is a 32-bit executable
    (see discussion); the standalone installer may also offer to install
    64-bit versions of the launcher.

    The launcher installation is registered in
    HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\CurrentVersion\SharedDLLs
    with a reference counter.
    It contains a version resource matching the version number of the
    pythonXY.dll with which it is distributed. Independent
    installations will overwrite older version
    of the launcher with newer versions. Stand-alone releases use
    a release level of 0x10 in FIELD3 of the CPython release on which
    they are based.

    Once installed, the "console" version of the launcher is
    associated with .py files and the "windows" version associated with .pyw
    files.

    The launcher is not tied to a specific version of Python - eg., a
    launcher distributed with Python 3.3 should be capable of locating and
    executing any Python 2.x and Python 3.x version. However, the
    launcher binaries have a version resource that is the same as the
    version resource in the Python binaries that they are released with.

Python Script Launching

    The launcher is restricted to launching Python scripts.
    It is not intended as a general-purpose script launcher or
    shebang processor.

    The launcher supports the syntax of shebang lines as described
    in [1], including all restrictions listed.

    The launcher supports shebang lines referring to Python
    executables with any of the (regex) prefixes "/usr/bin/", "/usr/local/bin"
    and "/usr/bin/env *", as well as binaries specified without 

    For example, a shebang line of '#! /usr/bin/python' should work even 
    though there is unlikely to be an executable in the relative Windows 
    directory "\usr\bin".  This means that many scripts can use a single
    shebang line and be likely to work on both Unix and Windows without
    modification.

    The launcher will support fully-qualified paths to executables.
    While this will make the script inherently non-portable, it is a
    feature offered by Unix and would be useful for Windows users in
    some cases.

    The launcher will be capable of supporting implementations other than
    CPython, such as jython and IronPython, but given both the absence of
    common links on Unix (such as "/usr/bin/jython") and the inability for the
    launcher to automatically locate the installation location of these
    implementations on Windows, the launcher will support this via
    customization options.  Scripts taking advantage of this will not be
    portable (as these customization options must be set to reflect the
    configuration of the machine on which the launcher is running) but this
    ability is nonetheless considered worthwhile.

    On Unix, the user can control which specific version of Python is used
    by adjusting the links in /usr/bin to point to the desired version.  As
    the launcher on Windows will not use Windows links, cutomization options
    (exposed via both environment variables and INI files) will be used to
    override the semantics for determining what version of Python will be
    used.  For example, while a shebang line of "/usr/bin/python2" will
    automatically locate a Python 2.x implementation, an environment variable
    can override exactly which Python 2.x implementation will be chosen.
    Similarly for "/usr/bin/python" and "/usr/bin/python3".  This is
    specified in detail later in this PEP.

Shebang line parsing

    If the first command-line argument does not start with a dash ('-')
    character, an attempt will be made to open that argument as a file
    and parsed for a shebang line according to the rules in [1]::

        #! interpreter [optional-arg]

    Once parsed, the command will be categorized according to the following rules:

    * If the command starts with the definition of a customized command
      followed by a whitespace character (including a newline), the customized
      command will be used.  See below for a description of customized
      commands.

    * The launcher will define a set of prefixes which are considered Unix
      compatible commands to launch Python, namely "/usr/bin/python",
      "/usr/local/bin/python", "/usr/bin/env python", and "python".
      If a command starts with one of these strings will be treated as a
      'virtual command' and the rules described in Python Version Qualifiers
      (below) will be used to locate the executable to use.

    * Otherwise the command is assumed to be directly ready to execute - ie.
      a fully-qualified path (or a reference to an executable on the PATH)
      optionally followed by arguments.  The contents of the string will not
      be parsed - it will be passed directly to the Windows CreateProcess
      function after appending the name of the script and the launcher
      command-line arguments.  This means that the rules used by
      CreateProcess will be used, including how relative path names and
      executable references without extensions are treated.  Notably, the
      Windows command processor will not be used, so special rules used by the
      command processor (such as automatic appending of extensions other than
      '.exe', support for batch files, etc) will not be used.

    The use of 'virtual' shebang lines is encouraged as this should
    allow for portable shebang lines to be specified which work on
    multiple operating systems and different installations of the same
    operating system.

    If the first argument can not be opened as a file or if no valid
    shebang line can be found, the launcher will act as if a shebang line of
    '#!python' was found - ie., a default Python interpreter will be
    located and the arguments passed to that.  However, if a valid
    shebang line is found but the process specified by that line can not
    be started, the default interpreter will not be started - the error
    to create the specified child process will cause the launcher to display
    an appropriate message and terminate with a specific exit code.

Configuration file

    Two .ini files will be searched by the launcher - ``py.ini`` in the
    current user's "application data" directory (i.e. the directory returned
    by calling the Windows function SHGetFolderPath with CSIDL_LOCAL_APPDATA,
    %USERPROFILE%\AppData\Local on Vista+,
    %USERPROFILE%\Local Settings\Application Data on XP)
    and ``py.ini`` in the same directory as the launcher.  The same .ini
    files are used for both the 'console' version of the launcher (i.e.
    py.exe) and for the 'windows' version (i.e. pyw.exe)


    Customization specified in the "application directory" will have
    precedence over the one next to the executable, so a user, who may not
    have write access to the .ini file next to the launcher, can override
    commands in that global .ini file)
 

Virtual commands in shebang lines

    Virtual Commands are shebang lines which start with strings which would
    be expected to work on Unix platforms - examples include
    '/usr/bin/python', '/usr/bin/env python' and 'python'.  Optionally, the 
    virtual command may be suffixed with a version qualifier (see below),
    such as '/usr/bin/python2' or '/usr/bin/python3.2'.  The command executed
    is based on the rules described in Python Version Qualifiers
    below.

Customized Commands

    The launcher will support the ability to define "Customized Commands" in a
    Windows .ini file (ie, a file which can be parsed by the Windows function
    GetPrivateProfileString).  A section called '[commands]' can be created 
    with key names defining the virtual command and the value specifying the
    actual command-line to be used for this virtual command.

    For example, if an INI file has the contents:

    [commands]
    vpython=c:\bin\vpython.exe -foo

    Then a shebang line of '#! vpython' in a script named 'doit.py' will 
    result in the launcher using the command-line 'c:\bin\vpython.exe -foo 
    doit.py'

    The precise details about the names, locations and search order of the
    .ini files is in the launcher documentation [4]

Python Version Qualifiers

    Some of the features described allow an optional Python version qualifier 
    to be used.

    A version qualifier starts with a major version number and can optionally
    be followed by a period ('.') and a minor version specifier.  If the minor
    qualifier is specified, it may optionally be followed by "-32" to indicate
    the 32bit implementation of that version be used.  Note that no "-64"
    qualifier is necessary as this is the default implementation (see below).

    On 64bit Windows with both 32bit and 64bit implementations of the
    same (major.minor) Python version installed, the 64bit version will
    always be preferred.  This will be true for both 32bit and 64bit
    implementations of the launcher - a 32bit launcher will prefer to
    execute a 64bit Python installation of the specified version if
    available.  This is so the behavior of the launcher can be predicted
    knowing only what versions are installed on the PC and without
    regard to the order in which they were installed (ie, without knowing
    whether a 32 or 64bit version of Python and corresponding launcher was
    installed last).  As noted above, an optional "-32" suffix can be used
    on a version specifier to change this behaviour.

    If no version qualifiers are found in a command, the environment variable
    ``PY_PYTHON`` can be set to specify the default version qualifier - the default
    value is "2". Note this value could specify just a major version (e.g. "2") or
    a major.minor qualifier (e.g. "2.6"), or even major.minor-32.

    If no minor version qualifiers are found, the environment variable
    ``PY_PYTHON{major}`` (where ``{major}`` is the current major version qualifier
    as determined above) can be set to specify the full version. If no such option
    is found, the launcher will enumerate the installed Python versions and use
    the latest minor release found for the major version, which is likely,
    although not guaranteed, to be the most recently installed version in that
    family.

    In addition to environment variables, the same settings can be configured
    in the .INI file used by the launcher.  The section in the INI file is
    called ``[defaults]`` and the key name will be the same as the
    environment variables without the leading ``PY_`` prefix (and note that
    the key names in the INI file are case insensitive.)  The contents of
    an environment variable will override things specified in the INI file.

Command-line handling

    Only the first command-line argument will be checked for a shebang line
    and only if that argument does not start with a '-'.

    If the only command-line argument is "-h" or "--help", the launcher will
    print a small banner and command-line usage, then pass the argument to
    the default Python.  This will cause help for the launcher being printed
    followed by help for Python itself.  The output from the launcher will
    clearly indicate the extended help information is coming from the
    launcher and not Python.

    As a concession to interactively launching Python, the launcher will
    support the first command-line argument optionally being a dash ("-")
    followed by a version qualifier, as described above, to nominate a
    specific version be used.  For example, while "py.exe" may locate and
    launch the latest Python 2.x implementation installed, a command-line such
    as "py.exe -3" could specify the latest Python 3.x implementation be
    launched, while "py.exe -2.6-32" could specify a 32bit implementation
    Python 2.6 be located and launched.  If a Python 2.x implementation is
    desired to be launched with the -3 flag, the command-line would need to be
    similar to "py.exe -2 -3" (or the specific version of Python could
    obviously be launched manually without use of this launcher.)  Note that
    this feature can not be used with shebang processing as the file scanned
    for a shebang line and this argument must both be the first argument and
    therefore are mutually exclusive. 

    All other arguments will be passed untouched to the child Python process.

Process Launching

    The launcher offers some conveniences for Python developers working
    interactively - for example, starting the launcher with no command-line
    arguments will launch the default Python with no command-line arguments.
    Further, command-line arguments will be supported to allow a specific
    Python version to be launched interactively - however, these conveniences
    must not detract from the primary purpose of launching scripts and must
    be easy to avoid if desired.

    The launcher creates a subprocess to start the actual
    interpreter. See `Discussion´ below for the rationale.


Discussion

    It may be surprising that the launcher is installed into the
    Windows directory, and not the System32 directory. The reason is
    that the System32 directory is not on the Path of a 32-bit process
    running on a 64-bit system. However, the Windows directory is
    always on the path.

    The launcher that is installed into the Windows directory is a 32-bit
    executable so that the 32-bit CPython installer can provide the same
    binary for both 32-bit and 64-bit Windows installations.

    Ideally, the launcher process would execute Python directly inside
    the same process, primarily so the parent of the launcher process could
    terminate the launcher and have the Python interpreter terminate.  If the
    launcher executes Python as a sub-process and the parent of the launcher
    terminates the launcher, the Python process will be unaffected.

    However, there are a number of practical problems associated with this
    approach.  Windows does not support the execv* family of Unix functions,
    so this could only be done by the launcher dynamically loading the Python
    DLL, but this would have a number of side-effects.  The most serious
    side effect of this is that the value of sys.executable would refer to the
    launcher instead of the Python implementation.  Many Python scripts use the
    value of sys.executable to launch child processes, and these scripts may
    fail to work as expected if the launcher is used.  Consider a "parent"
    script with a shebang line of '#! /usr/bin/python3' which attempts to
    launch a child script (with no shebang) via sys.executable - currently the
    child is launched using the exact same version running the parent script.
    If sys.executable referred to the launcher the child would be likely
    executed using a Python 2.x version and would be likely to fail with a
    SyntaxError.

    Another hurdle is the support for alternative Python implementations
    using the "customized commands" feature described above, where loading
    the command dynamically into a running executable is not possible.

    The final hurdle is the rules above regarding 64bit and 32bit programs -
    a 32bit launcher would be unable to load the 64bit version of Python and
    vice-versa.

    Given these considerations, the launcher will execute its command in a
    child process, remaining alive while the child process is executing, then
    terminate with the same exit code as returned by the child.  To address
    concerns regarding the termination of the launcher not killing the child,
    the Win32 Job API will be used to arrange so that the child process is
    automatically killed when the parent is terminated (although children of
    that child process will continue as is the case now.)  As this Windows API
    is available in Windows XP and later, this launcher will not work on 
    Windows 2000 or earlier.

References

    [1] http://linux.die.net/man/2/execve

    [2] http://www.python.org/dev/peps/pep-0394/

    [3] https://bitbucket.org/vinay.sajip/pylauncher

    [4] https://bitbucket.org/vinay.sajip/pylauncher/src/tip/Doc/launcher.rst

Copyright

    This document has been placed in the public domain.



pep-0398 Python 3.3 Release Schedule

PEP:398
Title:Python 3.3 Release Schedule
Version:$Revision$
Last-Modified:$Date$
Author:Georg Brandl <georg at python.org>
Status:Active
Type:Informational
Content-Type:text/x-rst
Created:23-Mar-2011
Python-Version:3.3

Abstract

This document describes the development and release schedule for Python 3.3. The schedule primarily concerns itself with PEP-sized items.

Release Manager and Crew

  • 3.3 Release Manager: Georg Brandl
  • Windows installers: Martin v. Lรถwis
  • Mac installers: Ronald Oussoren/Ned Deily
  • Documentation: Georg Brandl

3.3 Lifespan

3.3 will receive bugfix updates approximately every 4-6 months for approximately 18 months. After the release of 3.4.0 final, a final 3.3 bugfix update will be released. After that, security updates (source only) will be released until 5 years after the release of 3.3 final, which will be September 2017.

Release Schedule

3.3.0 schedule

  • 3.3.0 alpha 1: March 5, 2012
  • 3.3.0 alpha 2: April 2, 2012
  • 3.3.0 alpha 3: May 1, 2012
  • 3.3.0 alpha 4: May 31, 2012
  • 3.3.0 beta 1: June 27, 2012

(No new features beyond this point.)

  • 3.3.0 beta 2: August 12, 2012
  • 3.3.0 candidate 1: August 24, 2012
  • 3.3.0 candidate 2: September 9, 2012
  • 3.3.0 candidate 3: September 24, 2012
  • 3.3.0 final: September 29, 2012

3.3.1 schedule

  • 3.3.1 candidate 1: March 23, 2013
  • 3.3.1 final: April 6, 2013

3.3.2 schedule

  • 3.3.2 final: May 13, 2013

3.3.3 schedule

  • 3.3.3 candidate 1: October 27, 2013
  • 3.3.3 candidate 2: November 9, 2013
  • 3.3.3 final: November 16, 2013

3.3.4 schedule

  • 3.3.4 candidate 1: January 26, 2014
  • 3.3.4 final: February 9, 2014

3.3.5 schedule

Python 3.3.5 was the last regular maintenance release before 3.3 entered security-fix only mode.

  • 3.3.5 candidate 1: February 22, 2014
  • 3.3.5 candidate 2: March 1, 2014
  • 3.3.5 final: March 8, 2014

3.3.6 schedule

  • 3.3.6 candidate 1 (source-only release): October 4, 2014
  • 3.3.6 final (source-only release): October 11, 2014

Features for 3.3

Implemented / Final PEPs:

  • PEP 362: Function Signature Object
  • PEP 380: Syntax for Delegating to a Subgenerator
  • PEP 393: Flexible String Representation
  • PEP 397: Python launcher for Windows
  • PEP 399: Pure Python/C Accelerator Module Compatibility Requirements
  • PEP 405: Python Virtual Environments
  • PEP 409: Suppressing exception context
  • PEP 412: Key-Sharing Dictionary
  • PEP 414: Explicit Unicode Literal for Python 3.3
  • PEP 415: Implement context suppression with exception attributes
  • PEP 417: Including mock in the Standard Library
  • PEP 418: Add monotonic time, performance counter, and process time functions
  • PEP 420: Implicit Namespace Packages
  • PEP 421: Adding sys.implementation
  • PEP 3118: Revising the buffer protocol (protocol semantics finalised)
  • PEP 3144: IP Address manipulation library
  • PEP 3151: Reworking the OS and IO exception hierarchy
  • PEP 3155: Qualified name for classes and functions

Other final large-scale changes:

  • Addition of the "faulthandler" module
  • Addition of the "lzma" module, and lzma/xz support in tarfile
  • Implementing __import__ using importlib
  • Addition of the C decimal implementation
  • Switch of Windows build toolchain to VS 2010

Candidate PEPs:

  • None

Other planned large-scale changes:

  • None

Deferred to post-3.3:

  • PEP 395: Qualified Names for Modules
  • PEP 3143: Standard daemon process library
  • PEP 3154: Pickle protocol version 4
  • Breaking out standard library and docs in separate repos
  • Addition of the "packaging" module, deprecating "distutils"
  • Addition of the "regex" module
  • Email version 6
  • A standard event-loop interface (PEP by Jim Fulton pending)

pep-0399 Pure Python/C Accelerator Module Compatibility Requirements

PEP:399
Title:Pure Python/C Accelerator Module Compatibility Requirements
Version:88219
Last-Modified:2011-01-27 13:47:00 -0800 (Thu, 27 Jan 2011)
Author:Brett Cannon <brett at python.org>
Status:Final
Type:Informational
Content-Type:text/x-rst
Created:04-Apr-2011
Python-Version:3.3
Post-History:04-Apr-2011, 12-Apr-2011, 17-Jul-2011, 15-Aug-2011, 01-Jan-2013

Abstract

The Python standard library under CPython contains various instances of modules implemented in both pure Python and C (either entirely or partially). This PEP requires that in these instances that the C code must pass the test suite used for the pure Python code so as to act as much as a drop-in replacement as reasonably possible (C- and VM-specific tests are exempt). It is also required that new C-based modules lacking a pure Python equivalent implementation get special permission to be added to the standard library.

Rationale

Python has grown beyond the CPython virtual machine (VM). IronPython [1], Jython [2], and PyPy [3] are all currently viable alternatives to the CPython VM. The VM ecosystem that has sprung up around the Python programming language has led to Python being used in many different areas where CPython cannot be used, e.g., Jython allowing Python to be used in Java applications.

A problem all of the VMs other than CPython face is handling modules from the standard library that are implemented (to some extent) in C. Since other VMs do not typically support the entire C API of CPython [4] they are unable to use the code used to create the module. Often times this leads these other VMs to either re-implement the modules in pure Python or in the programming language used to implement the VM itself (e.g., in C# for IronPython). This duplication of effort between CPython, PyPy, Jython, and IronPython is extremely unfortunate as implementing a module at least in pure Python would help mitigate this duplicate effort.

The purpose of this PEP is to minimize this duplicate effort by mandating that all new modules added to Python's standard library must have a pure Python implementation unless special dispensation is given. This makes sure that a module in the stdlib is available to all VMs and not just to CPython (pre-existing modules that do not meet this requirement are exempt, although there is nothing preventing someone from adding in a pure Python implementation retroactively).

Re-implementing parts (or all) of a module in C (in the case of CPython) is still allowed for performance reasons, but any such accelerated code must pass the same test suite (sans VM- or C-specific tests) to verify semantics and prevent divergence. To accomplish this, the test suite for the module must have comprehensive coverage of the pure Python implementation before the acceleration code may be added.

Details

Starting in Python 3.3, any modules added to the standard library must have a pure Python implementation. This rule can only be ignored if the Python development team grants a special exemption for the module. Typically the exemption will be granted only when a module wraps a specific C-based library (e.g., sqlite3 [5]). In granting an exemption it will be recognized that the module will be considered exclusive to CPython and not part of Python's standard library that other VMs are expected to support. Usage of ctypes to provide an API for a C library will continue to be frowned upon as ctypes lacks compiler guarantees that C code typically relies upon to prevent certain errors from occurring (e.g., API changes).

Even though a pure Python implementation is mandated by this PEP, it does not preclude the use of a companion acceleration module. If an acceleration module is provided it is to be named the same as the module it is accelerating with an underscore attached as a prefix, e.g., _warnings for warnings. The common pattern to access the accelerated code from the pure Python implementation is to import it with an import *, e.g., from _warnings import *. This is typically done at the end of the module to allow it to overwrite specific Python objects with their accelerated equivalents. This kind of import can also be done before the end of the module when needed, e.g., an accelerated base class is provided but is then subclassed by Python code. This PEP does not mandate that pre-existing modules in the stdlib that lack a pure Python equivalent gain such a module. But if people do volunteer to provide and maintain a pure Python equivalent (e.g., the PyPy team volunteering their pure Python implementation of the csv module and maintaining it) then such code will be accepted. In those instances the C version is considered the reference implementation in terms of expected semantics.

Any new accelerated code must act as a drop-in replacement as close to the pure Python implementation as reasonable. Technical details of the VM providing the accelerated code are allowed to differ as necessary, e.g., a class being a type when implemented in C. To verify that the Python and equivalent C code operate as similarly as possible, both code bases must be tested using the same tests which apply to the pure Python code (tests specific to the C code or any VM do not follow under this requirement). The test suite is expected to be extensive in order to verify expected semantics.

Acting as a drop-in replacement also dictates that no public API be provided in accelerated code that does not exist in the pure Python code. Without this requirement people could accidentally come to rely on a detail in the accelerated code which is not made available to other VMs that use the pure Python implementation. To help verify that the contract of semantic equivalence is being met, a module must be tested both with and without its accelerated code as thoroughly as possible.

As an example, to write tests which exercise both the pure Python and C accelerated versions of a module, a basic idiom can be followed:

from test.support import import_fresh_module
import unittest

c_heapq = import_fresh_module('heapq', fresh=['_heapq'])
py_heapq = import_fresh_module('heapq', blocked=['_heapq'])


class ExampleTest:

    def test_example(self):
        self.assertTrue(hasattr(self.module, 'heapify'))


class PyExampleTest(ExampleTest, unittest.TestCase):
    module = py_heapq


@unittest.skipUnless(c_heapq, 'requires the C _heapq module')
class CExampleTest(ExampleTest, unittest.TestCase):
    module = c_heapq


if __name__ == '__main__':
    unittest.main()

The test module defines a base class (ExampleTest) with test methods that access the heapq module through a self.heapq class attribute, and two subclasses that set this attribute to either the Python or the C version of the module. Note that only the two subclasses inherit from unittest.TestCase -- this prevents the ExampleTest class from being detected as a TestCase subclass by unittest test discovery. A skipUnless decorator can be added to the class that tests the C code in order to have these tests skipped when the C module is not available.

If this test were to provide extensive coverage for heapq.heappop() in the pure Python implementation then the accelerated C code would be allowed to be added to CPython's standard library. If it did not, then the test suite would need to be updated until proper coverage was provided before the accelerated C code could be added.

To also help with compatibility, C code should use abstract APIs on objects to prevent accidental dependence on specific types. For instance, if a function accepts a sequence then the C code should default to using PyObject_GetItem() instead of something like PyList_GetItem(). C code is allowed to have a fast path if the proper PyList_CheckExact() is used, but otherwise APIs should work with any object that duck types to the proper interface instead of a specific type.

pep-0400 Deprecate codecs.StreamReader and codecs.StreamWriter

PEP:400
Title:Deprecate codecs.StreamReader and codecs.StreamWriter
Version:$Revision$
Last-Modified:$Date$
Author:Victor Stinner <victor.stinner at gmail.com>
Status:Deferred
Type:Standards Track
Content-Type:text/x-rst
Created:28-May-2011
Python-Version:3.3

Abstract

io.TextIOWrapper and codecs.StreamReaderWriter offer the same API [1]. TextIOWrapper has more features and is faster than StreamReaderWriter. Duplicate code means that bugs should be fixed twice and that we may have subtle differences between the two implementations.

The codecs module was introduced in Python 2.0 (see the PEP 100). The io module was introduced in Python 2.6 and 3.0 (see the PEP 3116), and reimplemented in C in Python 2.7 and 3.1.

PEP Deferral

Further exploration of the concepts covered in this PEP has been deferred for lack of a current champion interested in promoting the goals of the PEP and collecting and incorporating feedback, and with sufficient available time to do so effectively.

Motivation

When the Python I/O model was updated for 3.0, the concept of a "stream-with-known-encoding" was introduced in the form of io.TextIOWrapper. As this class is critical to the performance of text-based I/O in Python 3, this module has an optimised C version which is used by CPython by default. Many corner cases in handling buffering, stateful codecs and universal newlines have been dealt with since the release of Python 3.0.

This new interface overlaps heavily with the legacy codecs.StreamReader, codecs.StreamWriter and codecs.StreamReaderWriter interfaces that were part of the original codec interface design in PEP 100. These interfaces are organised around the principle of an encoding with an associated stream (i.e. the reverse of arrangement in the io module), so the original PEP 100 design required that codec writers provide appropriate StreamReader and StreamWriter implementations in addition to the core codec encode() and decode() methods. This places a heavy burden on codec authors providing these specialised implementations to correctly handle many of the corner cases (see Appendix A) that have now been dealt with by io.TextIOWrapper. While deeper integration between the codec and the stream allows for additional optimisations in theory, these optimisations have in practice either not been carried out and else the associated code duplication means that the corner cases that have been fixed in io.TextIOWrapper are still not handled correctly in the various StreamReader and StreamWriter implementations.

Accordingly, this PEP proposes that:

  • codecs.open() be updated to delegate to the builtin open() in Python 3.3;
  • the legacy codecs.Stream* interfaces, including the streamreader and streamwriter attributes of codecs.CodecInfo be deprecated in Python 3.3.

Rationale

StreamReader and StreamWriter issues

  • StreamReader is unable to translate newlines.
  • StreamWriter doesn't support "line buffering" (flush if the input text contains a newline).
  • StreamReader classes of the CJK encodings (e.g. GB18030) only supports UNIX newlines ('\n').
  • StreamReader and StreamWriter are stateful codecs but don't expose functions to control their state (getstate() or setstate()). Each codec has to handle corner cases, see Appendix A.
  • StreamReader and StreamWriter are very similar to IncrementalReader and IncrementalEncoder, some code is duplicated for stateful codecs (e.g. UTF-16).
  • Each codec has to reimplement its own StreamReader and StreamWriter class, even if it's trivial (just call the encoder/decoder).
  • codecs.open(filename, "r") creates a io.TextIOWrapper object.
  • No codec implements an optimized method in StreamReader or StreamWriter based on the specificities of the codec.

Issues in the bug tracker:

  • Issue #5445 (2009-03-08): codecs.StreamWriter.writelines problem when passed generator
  • Issue #7262: (2009-11-04): codecs.open() + eol (windows)
  • Issue #8260 (2010-03-29): When I use codecs.open(...) and f.readline() follow up by f.read() return bad result
  • Issue #8630 (2010-05-05): Keepends param in codec readline(s)
  • Issue #10344 (2010-11-06): codecs.readline doesn't care buffering
  • Issue #11461 (2011-03-10): Reading UTF-16 with codecs.readline() breaks on surrogate pairs
  • Issue #12446 (2011-06-30): StreamReader Readlines behavior odd
  • Issue #12508 (2011-07-06): Codecs Anomaly
  • Issue #12512 (2011-07-07): codecs: StreamWriter issues with stateful codecs after a seek or with append mode
  • Issue #12513 (2011-07-07): codec.StreamReaderWriter: issues with interlaced read-write

TextIOWrapper features

  • TextIOWrapper supports any kind of newline, including translating newlines (to UNIX newlines), to read and write.
  • TextIOWrapper reuses codecs incremental encoders and decoders (no duplication of code).
  • The io module (TextIOWrapper) is faster than the codecs module (StreamReader). It is implemented in C, whereas codecs is implemented in Python.
  • TextIOWrapper has a readahead algorithm which speeds up small reads: read character by character or line by line (io is 10x through 25x faster than codecs on these operations).
  • TextIOWrapper has a write buffer.
  • TextIOWrapper.tell() is optimized.
  • TextIOWrapper supports random access (read+write) using a single class which permit to optimize interlaced read-write (but no such optimization is implemented).

TextIOWrapper issues

  • Issue #12215 (2011-05-30): TextIOWrapper: issues with interlaced read-write

Possible improvements of StreamReader and StreamWriter

By adding codec state read/write functions to the StreamReader and StreamWriter classes, it will become possible to fix issues with stateful codecs in a base class instead of in each stateful StreamReader and StreamWriter classes.

It would be possible to change StreamReader and StreamWriter to make them use IncrementalDecoder and IncrementalEncoder.

A codec can implement variants which are optimized for the specific encoding or intercept certain stream methods to add functionality or improve the encoding/decoding performance. TextIOWrapper cannot implement such optimization, but TextIOWrapper uses incremental encoders and decoders and uses read and write buffers, so the overhead of incomplete inputs is low or nul.

A lot more could be done for other variable length encoding codecs, e.g. UTF-8, since these often have problems near the end of a read due to missing bytes. The UTF-32-BE/LE codecs could simply multiply the character position by 4 to get the byte position.

Usage of StreamReader and StreamWriter

These classes are rarely used directly, but indirectly using codecs.open(). They are not used in Python 3 standard library (except in the codecs module).

Some projects implement their own codec with StreamReader and StreamWriter, but don't use these classes.

Backwards Compatibility

Keep the public API, codecs.open

codecs.open() can be replaced by the builtin open() function. open() has a similar API but has also more options. Both functions return file-like objects (same API).

codecs.open() was the only way to open a text file in Unicode mode until Python 2.6. Many Python 2 programs uses this function. Removing codecs.open() implies more work to port programs from Python 2 to Python 3, especially projets using the same code base for the two Python versions (without using 2to3 program).

codecs.open() is kept for backward compatibility with Python 2.

Deprecate StreamReader and StreamWriter

Instanciating StreamReader or StreamWriter must emit a DeprecationWarning in Python 3.3. Defining a subclass doesn't emit a DeprecationWarning.

codecs.open() will be changed to reuse the builtin open() function (TextIOWrapper) to read-write text files.

Alternative Approach

An alternative to the deprecation of the codecs.Stream* classes is to rename codecs.open() to codecs.open_stream(), and to create a new codecs.open() function reusing open() and so io.TextIOWrapper.

Appendix A: Issues with stateful codecs

It is difficult to use correctly a stateful codec with a stream. Some cases are supported by the codecs module, while io has no more known bug related to stateful codecs. The main difference between the codecs and the io module is that bugs have to be fixed in StreamReader and/or StreamWriter classes of each codec for the codecs module, whereas bugs can be fixed only once in io.TextIOWrapper. Here are some examples of issues with stateful codecs.

Stateful codecs

Python supports the following stateful codecs:

  • cp932
  • cp949
  • cp950
  • euc_jis_2004
  • euc_jisx2003
  • euc_jp
  • euc_kr
  • gb18030
  • gbk
  • hz
  • iso2022_jp
  • iso2022_jp_1
  • iso2022_jp_2
  • iso2022_jp_2004
  • iso2022_jp_3
  • iso2022_jp_ext
  • iso2022_kr
  • shift_jis
  • shift_jis_2004
  • shift_jisx0213
  • utf_8_sig
  • utf_16
  • utf_32

Read and seek(0)

with open(filename, 'w', encoding='utf-16') as f:
    f.write('abc')
    f.write('def')
    f.seek(0)
    assert f.read() == 'abcdef'
    f.seek(0)
    assert f.read() == 'abcdef'

The io and codecs modules support this usecase correctly.

seek(n)

with open(filename, 'w', encoding='utf-16') as f:
    f.write('abc')
    pos = f.tell()
with open(filename, 'w', encoding='utf-16') as f:
    f.seek(pos)
    f.write('def')
    f.seek(0)
    f.write('###')
with open(filename, 'r', encoding='utf-16') as f:
    assert f.read() == '###def'

The io module supports this usecase, whereas codecs fails because it writes a new BOM on the second write (issue #12512).

Append mode

with open(filename, 'w', encoding='utf-16') as f:
    f.write('abc')
with open(filename, 'a', encoding='utf-16') as f:
    f.write('def')
with open(filename, 'r', encoding='utf-16') as f:
    assert f.read() == 'abcdef'

The io module supports this usecase, whereas codecs fails because it writes a new BOM on the second write (issue #12512).

Footnotes

[1]StreamReaderWriter has two more attributes than TextIOWrapper, reader and writer.

pep-0401 BDFL Retirement

PEP:401
Title:BDFL Retirement
Version:$Revision$
Last-Modified:$Date: 2009-04-01 00:00:00 -0400 (Wed, 1 Apr 2009)$
Author:Barry Warsaw, Brett Cannon
Status:April Fool!
Type:Process
Content-Type:text/x-rst
Created:01-Apr-2009
Post-History:01-Apr-2009

Abstract

The BDFL, having shepherded Python development for 20 years, officially announces his retirement, effective immediately. Following a unanimous vote, his replacement is named.

Rationale

Guido wrote the original implementation of Python in 1989, and after nearly 20 years of leading the community, has decided to step aside as its Benevolent Dictator For Life. His official title is now Benevolent Dictator Emeritus Vacationing Indefinitely from the Language (BDEVIL). Guido leaves Python in the good hands of its new leader and its vibrant community, in order to train for his lifelong dream of climbing Mount Everest.

After unanimous vote of the Python Steering Union (not to be confused with the Python Secret Underground, which emphatically does not exist) at the 2009 Python Conference (PyCon [7] 2009), Guido's successor has been chosen: Barry Warsaw, or as he is affectionately known, Uncle Barry. Uncle Barry's official title is Friendly Language Uncle For Life (FLUFL).

Official Acts of the FLUFL

FLUFL Uncle Barry enacts the following decisions, in order to demonstrate his intention to lead the community in the same responsible and open manner as his predecessor, whose name escapes him:

  • Recognized that the selection of Hg as the DVCS of choice was clear proof of the onset of the BDEVIL's insanity, and reverting this decision to switch to Bzr instead, the only true choice.
  • Recognized that the != inequality operator in Python 3.0 was a horrible, finger pain inducing mistake, the FLUFL reinstates the <> diamond operator as the sole spelling. This change is important enough to be implemented for, and released in Python 3.1. To help transition to this feature, a new future statement, from __future__ import barry_as_FLUFL has been added.
  • Recognized that the print function in Python 3.0 was a horrible, pain-inducing mistake, the FLUFL reinstates the print statement. This change is important enough to be implemented for, and released in Python 3.0.2.
  • Recognized that the disappointing adoption curve of Python 3.0 signals its abject failure, all work on Python 3.1 and subsequent Python 3.x versions is hereby terminated. All features in Python 3.0 shall be back ported to Python 2.7 which will be the official and sole next release. The Python 3.0 string and bytes types will be back ported to Python 2.6.2 for the convenience of developers.
  • Recognized that C is a 20th century language with almost universal rejection by programmers under the age of 30, the CPython implementation will terminate with the release of Python 2.6.2 and 3.0.2. Thereafter, the reference implementation of Python will target the Parrot [1] virtual machine. Alternative implementations of Python (e.g. Jython [2], IronPython [3], and PyPy [4]) are officially discouraged but tolerated.
  • Recognized that the Python Software Foundation [5] having fulfilled its mission admirably, is hereby disbanded. The Python Steering Union [6] (not to be confused with the Python Secret Underground, which emphatically does not exist), is now the sole steward for all of Python's intellectual property. All PSF funds are hereby transferred to the PSU (not that PSU, the other PSU).

pep-0402 Simplified Package Layout and Partitioning

PEP:402
Title:Simplified Package Layout and Partitioning
Version:$Revision$
Last-Modified:$Date$
Author:P.J. Eby
Status:Rejected
Type:Standards Track
Content-Type:text/x-rst
Created:12-Jul-2011
Python-Version:3.3
Post-History:20-Jul-2011
Replaces:382

Rejection Notice

On the first day of sprints at US PyCon 2012 we had a long and fruitful discussion about PEP 382 and PEP 402. We ended up rejecting both but a new PEP will be written to carry on in the spirit of PEP 402. Martin von Lรถwis wrote up a summary: [3].

Abstract

This PEP proposes an enhancement to Python's package importing to:

  • Surprise users of other languages less,
  • Make it easier to convert a module into a package, and
  • Support dividing packages into separately installed components (ala "namespace packages", as described in PEP 382)

The proposed enhancements do not change the semantics of any currently-importable directory layouts, but make it possible for packages to use a simplified directory layout (that is not importable currently).

However, the proposed changes do NOT add any performance overhead to the importing of existing modules or packages, and performance for the new directory layout should be about the same as that of previous "namespace package" solutions (such as pkgutil.extend_path()).

The Problem

"Most packages are like modules. Their contents are highly interdependent and can't be pulled apart. [However,] some packages exist to provide a separate namespace. ... It should be possible to distribute sub-packages or submodules of these [namespace packages] independently."

—Jim Fulton, shortly before the release of Python 2.3 [1]

When new users come to Python from other languages, they are often confused by Python's package import semantics. At Google, for example, Guido received complaints from "a large crowd with pitchforks" [2] that the requirement for packages to contain an __init__ module was a "misfeature", and should be dropped.

In addition, users coming from languages like Java or Perl are sometimes confused by a difference in Python's import path searching.

In most other languages that have a similar path mechanism to Python's sys.path, a package is merely a namespace that contains modules or classes, and can thus be spread across multiple directories in the language's path. In Perl, for instance, a Foo::Bar module will be searched for in Foo/ subdirectories all along the module include path, not just in the first such subdirectory found.

Worse, this is not just a problem for new users: it prevents anyone from easily splitting a package into separately-installable components. In Perl terms, it would be as if every possible Net:: module on CPAN had to be bundled up and shipped in a single tarball!

For that reason, various workarounds for this latter limitation exist, circulated under the term "namespace packages". The Python standard library has provided one such workaround since Python 2.3 (via the pkgutil.extend_path() function), and the "setuptools" package provides another (via pkg_resources.declare_namespace()).

The workarounds themselves, however, fall prey to a third issue with Python's way of laying out packages in the filesystem.

Because a package must contain an __init__ module, any attempt to distribute modules for that package must necessarily include that __init__ module, if those modules are to be importable.

However, the very fact that each distribution of modules for a package must contain this (duplicated) __init__ module, means that OS vendors who package up these module distributions must somehow handle the conflict caused by several module distributions installing that __init__ module to the same location in the filesystem.

This led to the proposing of PEP 382 ("Namespace Packages") - a way to signal to Python's import machinery that a directory was importable, using unique filenames per module distribution.

However, there was more than one downside to this approach. Performance for all import operations would be affected, and the process of designating a package became even more complex. New terminology had to be invented to explain the solution, and so on.

As terminology discussions continued on the Import-SIG, it soon became apparent that the main reason it was so difficult to explain the concepts related to "namespace packages" was because Python's current way of handling packages is somewhat underpowered, when compared to other languages.

That is, in other popular languages with package systems, no special term is needed to describe "namespace packages", because all packages generally behave in the desired fashion.

Rather than being an isolated single directory with a special marker module (as in Python), packages in other languages are typically just the union of appropriately-named directories across the entire import or inclusion path.

In Perl, for example, the module Foo is always found in a Foo.pm file, and a module Foo::Bar is always found in a Foo/Bar.pm file. (In other words, there is One Obvious Way to find the location of a particular module.)

This is because Perl considers a module to be different from a package: the package is purely a namespace in which other modules may reside, and is only coincidentally the name of a module as well.

In current versions of Python, however, the module and the package are more tightly bound together. Foo is always a module -- whether it is found in Foo.py or Foo/__init__.py -- and it is tightly linked to its submodules (if any), which must reside in the exact same directory where the __init__.py was found.

On the positive side, this design choice means that a package is quite self-contained, and can be installed, copied, etc. as a unit just by performing an operation on the package's root directory.

On the negative side, however, it is non-intuitive for beginners, and requires a more complex step to turn a module into a package. If Foo begins its life as Foo.py, then it must be moved and renamed to Foo/__init__.py.

Conversely, if you intend to create a Foo.Bar module from the start, but have no particular module contents to put in Foo itself, then you have to create an empty and seemingly-irrelevant Foo/__init__.py file, just so that Foo.Bar can be imported.

(And these issues don't just confuse newcomers to the language, either: they annoy many experienced developers as well.)

So, after some discussion on the Import-SIG, this PEP was created as an alternative to PEP 382, in an attempt to solve all of the above problems, not just the "namespace package" use cases.

And, as a delightful side effect, the solution proposed in this PEP does not affect the import performance of ordinary modules or self-contained (i.e. __init__-based) packages.

The Solution

In the past, various proposals have been made to allow more intuitive approaches to package directory layout. However, most of them failed because of an apparent backward-compatibility problem.

That is, if the requirement for an __init__ module were simply dropped, it would open up the possibility for a directory named, say, string on sys.path, to block importing of the standard library string module.

Paradoxically, however, the failure of this approach does not arise from the elimination of the __init__ requirement!

Rather, the failure arises because the underlying approach takes for granted that a package is just ONE thing, instead of two.

In truth, a package comprises two separate, but related entities: a module (with its own, optional contents), and a namespace where other modules or packages can be found.

In current versions of Python, however, the module part (found in __init__) and the namespace for submodule imports (represented by the __path__ attribute) are both initialized at the same time, when the package is first imported.

And, if you assume this is the only way to initialize these two things, then there is no way to drop the need for an __init__ module, while still being backwards-compatible with existing directory layouts.

After all, as soon as you encounter a directory on sys.path matching the desired name, that means you've "found" the package, and must stop searching, right?

Well, not quite.

A Thought Experiment

Let's hop into the time machine for a moment, and pretend we're back in the early 1990s, shortly before Python packages and __init__.py have been invented. But, imagine that we are familiar with Perl-like package imports, and we want to implement a similar system in Python.

We'd still have Python's module imports to build on, so we could certainly conceive of having Foo.py as a parent Foo module for a Foo package. But how would we implement submodule and subpackage imports?

Well, if we didn't have the idea of __path__ attributes yet, we'd probably just search sys.path looking for Foo/Bar.py.

But we'd only do it when someone actually tried to import Foo.Bar.

NOT when they imported Foo.

And that lets us get rid of the backwards-compatibility problem of dropping the __init__ requirement, back here in 2011.

How?

Well, when we import Foo, we're not even looking for Foo/ directories on sys.path, because we don't care yet. The only point at which we care, is the point when somebody tries to actually import a submodule or subpackage of Foo.

That means that if Foo is a standard library module (for example), and I happen to have a Foo directory on sys.path (without an __init__.py, of course), then nothing breaks. The Foo module is still just a module, and it's still imported normally.

Self-Contained vs. "Virtual" Packages

Of course, in today's Python, trying to import Foo.Bar will fail if Foo is just a Foo.py module (and thus lacks a __path__ attribute).

So, this PEP proposes to dynamically create a __path__, in the case where one is missing.

That is, if I try to import Foo.Bar the proposed change to the import machinery will notice that the Foo module lacks a __path__, and will therefore try to build one before proceeding.

And it will do this by making a list of all the existing Foo/ subdirectories of the directories listed in sys.path.

If the list is empty, the import will fail with ImportError, just like today. But if the list is not empty, then it is saved in a new Foo.__path__ attribute, making the module a "virtual package".

That is, because it now has a valid __path__, we can proceed to import submodules or subpackages in the normal way.

Now, notice that this change does not affect "classic", self-contained packages that have an __init__ module in them. Such packages already have a __path__ attribute (initialized at import time) so the import machinery won't try to create another one later.

This means that (for example) the standard library email package will not be affected in any way by you having a bunch of unrelated directories named email on sys.path. (Even if they contain *.py files.)

But it does mean that if you want to turn your Foo module into a Foo package, all you have to do is add a Foo/ directory somewhere on sys.path, and start adding modules to it.

But what if you only want a "namespace package"? That is, a package that is only a namespace for various separately-distributed submodules and subpackages?

For example, if you're Zope Corporation, distributing dozens of separate tools like zc.buildout, each in packages under the zc namespace, you don't want to have to make and include an empty zc.py in every tool you ship. (And, if you're a Linux or other OS vendor, you don't want to deal with the package installation conflicts created by trying to install ten copies of zc.py to the same location!)

No problem. All we have to do is make one more minor tweak to the import process: if the "classic" import process fails to find a self-contained module or package (e.g., if import zc fails to find a zc.py or zc/__init__.py), then we once more try to build a __path__ by searching for all the zc/ directories on sys.path, and putting them in a list.

If this list is empty, we raise ImportError. But if it's non-empty, we create an empty zc module, and put the list in zc.__path__. Congratulations: zc is now a namespace-only, "pure virtual" package! It has no module contents, but you can still import submodules and subpackages from it, regardless of where they're located on sys.path.

(By the way, both of these additions to the import protocol (i.e. the dynamically-added __path__, and dynamically-created modules) apply recursively to child packages, using the parent package's __path__ in place of sys.path as a basis for generating a child __path__. This means that self-contained and virtual packages can contain each other without limitation, with the caveat that if you put a virtual package inside a self-contained one, it's gonna have a really short __path__!)

Backwards Compatibility and Performance

Notice that these two changes only affect import operations that today would result in ImportError. As a result, the performance of imports that do not involve virtual packages is unaffected, and potential backward compatibility issues are very restricted.

Today, if you try to import submodules or subpackages from a module with no __path__, it's an immediate error. And of course, if you don't have a zc.py or zc/__init__.py somewhere on sys.path today, import zc would likewise fail.

Thus, the only potential backwards-compatibility issues are:

  1. Tools that expect package directories to have an __init__ module, that expect directories without an __init__ module to be unimportable, or that expect __path__ attributes to be static, will not recognize virtual packages as packages.

    (In practice, this just means that tools will need updating to support virtual packages, e.g. by using pkgutil.walk_modules() instead of using hardcoded filesystem searches.)

  2. Code that expects certain imports to fail may now do something unexpected. This should be fairly rare in practice, as most sane, non-test code does not import things that are expected not to exist!

The biggest likely exception to the above would be when a piece of code tries to check whether some package is installed by importing it. If this is done only by importing a top-level module (i.e., not checking for a __version__ or some other attribute), and there is a directory of the same name as the sought-for package on sys.path somewhere, and the package is not actually installed, then such code could be fooled into thinking a package is installed that really isn't.

For example, suppose someone writes a script (datagen.py) containing the following code:

try:
    import json
except ImportError:
    import simplejson as json

And runs it in a directory laid out like this:

datagen.py
json/
    foo.js
    bar.js

If import json succeeded due to the mere presence of the json/ subdirectory, the code would incorrectly believe that the json module was available, and proceed to fail with an error.

However, we can prevent corner cases like these from arising, simply by making one small change to the algorithm presented so far. Instead of allowing you to import a "pure virtual" package (like zc), we allow only importing of the contents of virtual packages.

That is, a statement like import zc should raise ImportError if there is no zc.py or zc/__init__.py on sys.path. But, doing import zc.buildout should still succeed, as long as there's a zc/buildout.py or zc/buildout/__init__.py on sys.path.

In other words, we don't allow pure virtual packages to be imported directly, only modules and self-contained packages. (This is an acceptable limitation, because there is no functional value to importing such a package by itself. After all, the module object will have no contents until you import at least one of its subpackages or submodules!)

Once zc.buildout has been successfully imported, though, there will be a zc module in sys.modules, and trying to import it will of course succeed. We are only preventing an initial import from succeeding, in order to prevent false-positive import successes when clashing subdirectories are present on sys.path.

So, with this slight change, the datagen.py example above will work correctly. When it does import json, the mere presence of a json/ directory will simply not affect the import process at all, even if it contains .py files. The json/ directory will still only be searched in the case where an import like import json.converter is attempted.

Meanwhile, tools that expect to locate packages and modules by walking a directory tree can be updated to use the existing pkgutil.walk_modules() API, and tools that need to inspect packages in memory should use the other APIs described in the Standard Library Changes/Additions section below.

Specification

A change is made to the existing import process, when importing names containing at least one . -- that is, imports of modules that have a parent package.

Specifically, if the parent package does not exist, or exists but lacks a __path__ attribute, an attempt is first made to create a "virtual path" for the parent package (following the algorithm described in the section on virtual paths, below).

If the computed "virtual path" is empty, an ImportError results, just as it would today. However, if a non-empty virtual path is obtained, the normal import of the submodule or subpackage proceeds, using that virtual path to find the submodule or subpackage. (Just as it would have with the parent's __path__, if the parent package had existed and had a __path__.)

When a submodule or subpackage is found (but not yet loaded), the parent package is created and added to sys.modules (if it didn't exist before), and its __path__ is set to the computed virtual path (if it wasn't already set).

In this way, when the actual loading of the submodule or subpackage occurs, it will see a parent package existing, and any relative imports will work correctly. However, if no submodule or subpackage exists, then the parent package will not be created, nor will a standalone module be converted into a package (by the addition of a spurious __path__ attribute).

Note, by the way, that this change must be applied recursively: that is, if foo and foo.bar are pure virtual packages, then import foo.bar.baz must wait until foo.bar.baz is found before creating module objects for both foo and foo.bar, and then create both of them together, properly setting the foo module's .bar attribute to point to the foo.bar module.

In this way, pure virtual packages are never directly importable: an import foo or import foo.bar by itself will fail, and the corresponding modules will not appear in sys.modules until they are needed to point to a successfully imported submodule or self-contained subpackage.

Virtual Paths

A virtual path is created by obtaining a PEP 302 "importer" object for each of the path entries found in sys.path (for a top-level module) or the parent __path__ (for a submodule).

(Note: because sys.meta_path importers are not associated with sys.path or __path__ entry strings, such importers do not participate in this process.)

Each importer is checked for a get_subpath() method, and if present, the method is called with the full name of the module/package the path is being constructed for. The return value is either a string representing a subdirectory for the requested package, or None if no such subdirectory exists.

The strings returned by the importers are added to the path list being built, in the same order as they are found. (None values and missing get_subpath() methods are simply skipped.)

The resulting list (whether empty or not) is then stored in a sys.virtual_package_paths dictionary, keyed by module name.

This dictionary has two purposes. First, it serves as a cache, in the event that more than one attempt is made to import a submodule of a virtual package.

Second, and more importantly, the dictionary can be used by code that extends sys.path at runtime to update imported packages' __path__ attributes accordingly. (See Standard Library Changes/Additions below for more details.)

In Python code, the virtual path construction algorithm would look something like this:

def get_virtual_path(modulename, parent_path=None):

    if modulename in sys.virtual_package_paths:
        return sys.virtual_package_paths[modulename]

    if parent_path is None:
        parent_path = sys.path

    path = []

    for entry in parent_path:
        # Obtain a PEP 302 importer object - see pkgutil module
        importer = pkgutil.get_importer(entry)

        if hasattr(importer, 'get_subpath'):
            subpath = importer.get_subpath(modulename)
            if subpath is not None:
                path.append(subpath)

    sys.virtual_package_paths[modulename] = path
    return path

And a function like this one should be exposed in the standard library as e.g. imp.get_virtual_path(), so that people creating __import__ replacements or sys.meta_path hooks can reuse it.

Standard Library Changes/Additions

The pkgutil module should be updated to handle this specification appropriately, including any necessary changes to extend_path(), iter_modules(), etc.

Specifically the proposed changes and additions to pkgutil are:

  • A new extend_virtual_paths(path_entry) function, to extend existing, already-imported virtual packages' __path__ attributes to include any portions found in a new sys.path entry. This function should be called by applications extending sys.path at runtime, e.g. when adding a plugin directory or an egg to the path.

    The implementation of this function does a simple top-down traversal of sys.virtual_package_paths, and performs any necessary get_subpath() calls to identify what path entries need to be added to the virtual path for that package, given that path_entry has been added to sys.path. (Or, in the case of sub-packages, adding a derived subpath entry, based on their parent package's virtual path.)

    (Note: this function must update both the path values in sys.virtual_package_paths as well as the __path__ attributes of any corresponding modules in sys.modules, even though in the common case they will both be the same list object.)

  • A new iter_virtual_packages(parent='') function to allow top-down traversal of virtual packages from sys.virtual_package_paths, by yielding the child virtual packages of parent. For example, calling iter_virtual_packages("zope") might yield zope.app and zope.products (if they are virtual packages listed in sys.virtual_package_paths), but not zope.foo.bar. (This function is needed to implement extend_virtual_paths(), but is also potentially useful for other code that needs to inspect imported virtual packages.)

  • ImpImporter.iter_modules() should be changed to also detect and yield the names of modules found in virtual packages.

In addition to the above changes, the zipimport importer should have its iter_modules() implementation similarly changed. (Note: current versions of Python implement this via a shim in pkgutil, so technically this is also a change to pkgutil.)

Last, but not least, the imp module (or importlib, if appropriate) should expose the algorithm described in the virtual paths section above, as a get_virtual_path(modulename, parent_path=None) function, so that creators of __import__ replacements can use it.

Implementation Notes

For users, developers, and distributors of virtual packages:

  • While virtual packages are easy to set up and use, there is still a time and place for using self-contained packages. While it's not strictly necessary, adding an __init__ module to your self-contained packages lets users of the package (and Python itself) know that all of the package's code will be found in that single subdirectory. In addition, it lets you define __all__, expose a public API, provide a package-level docstring, and do other things that make more sense for a self-contained project than for a mere "namespace" package.

  • sys.virtual_package_paths is allowed to contain entries for non-existent or not-yet-imported package names; code that uses its contents should not assume that every key in this dictionary is also present in sys.modules or that importing the name will necessarily succeed.

  • If you are changing a currently self-contained package into a virtual one, it's important to note that you can no longer use its __file__ attribute to locate data files stored in a package directory. Instead, you must search __path__ or use the __file__ of a submodule adjacent to the desired files, or of a self-contained subpackage that contains the desired files.

    (Note: this caveat is already true for existing users of "namespace packages" today. That is, it is an inherent result of being able to partition a package, that you must know which partition the desired data file lives in. We mention it here simply so that new users converting from self-contained to virtual packages will also be aware of it.)

  • XXX what is the __file__ of a "pure virtual" package? None? Some arbitrary string? The path of the first directory with a trailing separator? No matter what we put, some code is going to break, but the last choice might allow some code to accidentally work. Is that good or bad?

For those implementing PEP 302 importer objects:

  • Importers that support the iter_modules() method (used by pkgutil to locate importable modules and packages) and want to add virtual package support should modify their iter_modules() method so that it discovers and lists virtual packages as well as standard modules and packages. To do this, the importer should simply list all immediate subdirectory names in its jurisdiction that are valid Python identifiers.

    XXX This might list a lot of not-really-packages. Should we require importable contents to exist? If so, how deep do we search, and how do we prevent e.g. link loops, or traversing onto different filesystems, etc.? Ick. Also, if virtual packages are listed, they still can't be imported, which is a problem for the way that pkgutil.walk_modules() is currently implemented.

  • "Meta" importers (i.e., importers placed on sys.meta_path) do not need to implement get_subpath(), because the method is only called on importers corresponding to sys.path entries and __path__ entries. If a meta importer wishes to support virtual packages, it must do so entirely within its own find_module() implementation.

    Unfortunately, it is unlikely that any such implementation will be able to merge its package subpaths with those of other meta importers or sys.path importers, so the meaning of "supporting virtual packages" for a meta importer is currently undefined!

    (However, since the intended use case for meta importers is to replace Python's normal import process entirely for some subset of modules, and the number of such importers currently implemented is quite small, this seems unlikely to be a big issue in practice.)

References

[1]"namespace" vs "module" packages (mailing list thread) (http://mail.zope.org/pipermail/zope3-dev/2002-December/004251.html)
[2]"Dropping __init__.py requirement for subpackages" (http://mail.python.org/pipermail/python-dev/2006-April/064400.html)
[3]Namespace Packages resolution (http://mail.python.org/pipermail/import-sig/2012-March/000421.html)

pep-0403 General purpose decorator clause (aka "@in" clause)

PEP:403
Title:General purpose decorator clause (aka "@in" clause)
Version:$Revision$
Last-Modified:$Date$
Author:Nick Coghlan <ncoghlan at gmail.com>
Status:Deferred
Type:Standards Track
Content-Type:text/x-rst
Created:2011-10-13
Python-Version:3.4
Post-History:2011-10-13
Resolution:TBD

Abstract

This PEP proposes the addition of a new @in decorator clause that makes it possible to override the name binding step of a function or class definition.

The new clause accepts a single simple statement that can make a forward reference to decorated function or class definition.

This new clause is designed to be used whenever a "one-shot" function or class is needed, and placing the function or class definition before the statement that uses it actually makes the code harder to read. It also avoids any name shadowing concerns by making sure the new name is visible only to the statement in the @in clause.

This PEP is based heavily on many of the ideas in PEP 3150 (Statement Local Namespaces) so some elements of the rationale will be familiar to readers of that PEP. Both PEPs remain deferred for the time being, primarily due to the lack of compelling real world use cases in either PEP.

Basic Examples

Before diving into the long history of this problem and the detailed rationale for this specific proposed solution, here are a few simple examples of the kind of code it is designed to simplify.

As a trivial example, a weakref callback could be defined as follows:

@in x = weakref.ref(target, report_destruction)
def report_destruction(obj):
    print("{} is being destroyed".format(obj))

This contrasts with the current (conceptually) "out of order" syntax for this operation:

def report_destruction(obj):
    print("{} is being destroyed".format(obj))

x = weakref.ref(target, report_destruction)

That structure is OK when you're using the callable multiple times, but it's irritating to be forced into it for one-off operations.

If the repetition of the name seems especially annoying, then a throwaway name like f can be used instead:

@in x = weakref.ref(target, f)
def f(obj):
    print("{} is being destroyed".format(obj))

Similarly, a sorted operation on a particularly poorly defined type could now be defined as:

@in sorted_list = sorted(original, key=f)
def f(item):
    try:
        return item.calc_sort_order()
    except NotSortableError:
        return float('inf')

Rather than:

def force_sort(item):
    try:
        return item.calc_sort_order()
    except NotSortableError:
        return float('inf')

sorted_list = sorted(original, key=force_sort)

And early binding semantics in a list comprehension could be attained via:

@in funcs = [adder(i) for i in range(10)]
def adder(i):
    return lambda x: x + i

Proposal

This PEP proposes the addition of a new @in clause that is a variant of the existing class and function decorator syntax.

The new @in clause precedes the decorator lines, and allows forward references to the trailing function or class definition.

The trailing function or class definition is always named - the name of the trailing definition is then used to make the forward reference from the @in clause.

The @in clause is allowed to contain any simple statement (including those that don't make any sense in that context, such as pass - while such code would be legal, there wouldn't be any point in writing it). This permissive structure is easier to define and easier to explain, but a more restrictive approach that only permits operations that "make sense" would also be possible (see PEP 3150 for a list of possible candidates).

The @in clause will not create a new scope - all name binding operations aside from the trailing function or class definition will affect the containing scope.

The name used in the trailing function or class definition is only visible from the associated @in clause, and behaves as if it was an ordinary variable defined in that scope. If any nested scopes are created in either the @in clause or the trailing function or class definition, those scopes will see the trailing function or class definition rather than any other bindings for that name in the containing scope.

In a very real sense, this proposal is about making it possible to override the implicit "name = <defined function or class>" name binding operation that is part of every function or class definition, specifically in those cases where the local name binding isn't actually needed.

Under this PEP, an ordinary class or function definition:

@deco2
@deco1
def name():
    ...

can be explained as being roughly equivalent to:

@in name = deco2(deco1(name))
def name():
    ...

Syntax Change

Syntactically, only one new grammar rule is needed:

in_stmt: '@in' simple_stmt decorated

Grammar: http://hg.python.org/cpython/file/default/Grammar/Grammar

Design Discussion

Background

The question of "multi-line lambdas" has been a vexing one for many Python users for a very long time, and it took an exploration of Ruby's block functionality for me to finally understand why this bugs people so much: Python's demand that the function be named and introduced before the operation that needs it breaks the developer's flow of thought. They get to a point where they go "I need a one-shot operation that does <X>", and instead of being able to just say that directly, they instead have to back up, name a function to do <X>, then call that function from the operation they actually wanted to do in the first place. Lambda expressions can help sometimes, but they're no substitute for being able to use a full suite.

Ruby's block syntax also heavily inspired the style of the solution in this PEP, by making it clear that even when limited to one anonymous function per statement, anonymous functions could still be incredibly useful. Consider how many constructs Python has where one expression is responsible for the bulk of the heavy lifting:

  • comprehensions, generator expressions, map(), filter()
  • key arguments to sorted(), min(), max()
  • partial function application
  • provision of callbacks (e.g. for weak references or aysnchronous IO)
  • array broadcast operations in NumPy

However, adopting Ruby's block syntax directly won't work for Python, since the effectiveness of Ruby's blocks relies heavily on various conventions in the way functions are defined (specifically, using Ruby's yield syntax to call blocks directly and the &arg mechanism to accept a block as a function's final argument).

Since Python has relied on named functions for so long, the signatures of APIs that accept callbacks are far more diverse, thus requiring a solution that allows one-shot functions to be slotted in at the appropriate location.

The approach taken in this PEP is to retain the requirement to name the function explicitly, but allow the relative order of the definition and the statement that references it to be changed to match the developer's flow of thought. The rationale is essentially the same as that used when introducing decorators, but covering a broader set of applications.

Relation to PEP 3150

PEP 3150 (Statement Local Namespaces) describes its primary motivation as being to elevate ordinary assignment statements to be on par with class and def statements where the name of the item to be defined is presented to the reader in advance of the details of how the value of that item is calculated. This PEP achieves the same goal in a different way, by allowing the simple name binding of a standard function definition to be replaced with something else (like assigning the result of the function to a value).

Despite having the same author, the two PEPs are in direct competition with each other. PEP 403 represents a minimalist approach that attempts to achieve useful functionality with a minimum of change from the status quo. This PEP instead aims for a more flexible standalone statement design, which requires a larger degree of change to the language.

Note that where PEP 403 is better suited to explaining the behaviour of generator expressions correctly, this PEP is better able to explain the behaviour of decorator clauses in general. Both PEPs support adequate explanations for the semantics of container comprehensions.

Keyword Choice

The proposal definitely requires some kind of prefix to avoid parsing ambiguity and backwards compatibility problems with existing constructs. It also needs to be clearly highlighted to readers, since it declares that the following piece of code is going to be executed only after the trailing function or class definition has been executed.

The in keyword was chosen as an existing keyword that can be used to denote the concept of a forward reference.

The @ prefix was included in order to exploit the fact that Python programmers are already used to decorator syntax as an indication of out of order execution, where the function or class is actually defined first and then decorators are applied in reverse order.

For functions, the construct is intended to be read as "in <this statement that references NAME> define NAME as a function that does <operation>".

The mapping to English prose isn't as obvious for the class definition case, but the concept remains the same.

Better Debugging Support for Functions and Classes with Short Names

One of the objections to widespread use of lambda expressions is that they have a negative effect on traceback intelligibility and other aspects of introspection. Similar objections are raised regarding constructs that promote short, cryptic function names (including this one, which requires that the name of the trailing definition be supplied at least twice, encouraging the use of shorthand placeholder names like f).

However, the introduction of qualified names in PEP 3155 means that even anonymous classes and functions will now have different representations if they occur in different scopes. For example:

>>> def f():
...     return lambda: y
...
>>> f()
<function f.<locals>.<lambda> at 0x7f6f46faeae0>

Anonymous functions (or functions that share a name) within the same scope will still share representations (aside from the object ID), but this is still a major improvement over the historical situation where everything except the object ID was identical.

Possible Implementation Strategy

This proposal has at least one titanic advantage over PEP 3150: implementation should be relatively straightforward.

The @in clause will be included in the AST for the associated function or class definition and the statement that references it. When the @in clause is present, it will be emitted in place of the local name binding operation normally implied by a function or class definition.

The one potentially tricky part is changing the meaning of the references to the statement local function or namespace while within the scope of the in statement, but that shouldn't be too hard to address by maintaining some additional state within the compiler (it's much easier to handle this for a single name than it is for an unknown number of names in a full nested suite).

Explaining Container Comprehensions and Generator Expressions

One interesting feature of the proposed construct is that it can be used as a primitive to explain the scoping and execution order semantics of both generator expressions and container comprehensions:

seq2 = [x for x in y if q(x) for y in seq if p(y)]

# would be equivalent to

@in seq2 = f(seq):
def f(seq)
    result = []
    for y in seq:
        if p(y):
            for x in y:
                if q(x):
                    result.append(x)
    return result

The important point in this expansion is that it explains why comprehensions appear to misbehave at class scope: only the outermost iterator is evaluated at class scope, while all predicates, nested iterators and value expressions are evaluated inside a nested scope.

An equivalent expansion is possible for generator expressions:

gen = (x for x in y if q(x) for y in seq if p(y))

# would be equivalent to

@in gen = g(seq):
def g(seq)
    for y in seq:
        if p(y):
            for x in y:
                if q(x):
                    yield x

More Examples

Calculating attributes without polluting the local namespace (from os.py):

# Current Python (manual namespace cleanup)
def _createenviron():
    ... # 27 line function

environ = _createenviron()
del _createenviron

# Becomes:
@in environ = _createenviron()
def _createenviron():
    ... # 27 line function

Loop early binding:

# Current Python (default argument hack)
funcs = [(lambda x, i=i: x + i) for i in range(10)]

# Becomes:
@in funcs = [adder(i) for i in range(10)]
def adder(i):
    return lambda x: x + i

# Or even:
@in funcs = [adder(i) for i in range(10)]
def adder(i):
    @in return incr
    def incr(x):
        return x + i

A trailing class can be used as a statement local namespace:

# Evaluate subexpressions only once
@in c = math.sqrt(x.a*x.a + x.b*x.b)
class x:
    a = calculate_a()
    b = calculate_b()

A function can be bound directly to a location which isn't a valid identifier:

@in dispatch[MyClass] = f
def f():
    ...

Constructs that verge on decorator abuse can be eliminated:

# Current Python
@call
def f():
    ...

# Becomes:
@in f()
def f():
    ...

Acknowledgements

Huge thanks to Gary Bernhardt for being blunt in pointing out that I had no idea what I was talking about in criticising Ruby's blocks, kicking off a rather enlightening process of investigation.

Rejected Concepts

To avoid retreading previously covered ground, some rejected alternatives are documented in this section.

Omitting the decorator prefix character

Earlier versions of this proposal omitted the @ prefix. However, without that prefix, the bare in keyword didn't associate the clause strongly enough with the subsequent function or class definition. Reusing the decorator prefix and explicitly characterising the new construct as a kind of decorator clause is intended to help users link the two concepts and see them as two variants of the same idea.

Anonymous Forward References

A previous incarnation of this PEP (see [1]) proposed a syntax where the new clause was introduced with : and the forward reference was written using @. Feedback on this variant was almost universally negative, as it was considered both ugly and excessively magical:

:x = weakref.ref(target, @)
def report_destruction(obj):
    print("{} is being destroyed".format(obj))

A more recent variant always used ... for forward references, along with genuinely anonymous function and class definitions. However, this degenerated quickly into a mass of unintelligible dots in more complex cases:

in funcs = [...(i) for i in range(10)]
def ...(i):
  in return ...
  def ...(x):
      return x + i

in c = math.sqrt(....a*....a + ....b*....b)
class ...:
  a = calculate_a()
  b = calculate_b()

Using a nested suite

The problems with using a full nested suite are best described in PEP 3150. It's comparatively difficult to implement properly, the scoping semantics are harder to explain and it creates quite a few situations where there are two ways to do it without clear guidelines for choosing between them (as almost any construct that can be expressed with ordinary imperative code could instead be expressed using a given statement). While the PEP does propose some new PEP 8 guidelines to help address that last problem, the difficulties in implementation are not so easily dealt with.

By contrast, the decorator inspired syntax in this PEP explicitly limits the new feature to cases where it should actually improve readability, rather than harming it. As in the case of the original introduction of decorators, the idea of this new syntax is that if it can be used (i.e. the local name binding of the function is completely unnecessary) then it probably should be used.

Another possible variant of this idea is to keep the decorator based semantics of this PEP, while adopting the prettier syntax from PEP 3150:

x = weakref.ref(target, report_destruction) given:
    def report_destruction(obj):
        print("{} is being destroyed".format(obj))

There are a couple of problems with this approach. The main issue is that this syntax variant uses something that looks like a suite, but really isn't one. A secondary concern is that it's not clear how the compiler will know which name(s) in the leading expression are forward references (although that could potentially be addressed through a suitable definition of the suite-that-is-not-a-suite in the language grammar).

However, a nested suite has not yet been ruled out completely. The latest version of PEP 3150 uses explicit forward reference and name binding schemes that greatly simplify the semantics of the statement, and it does offer the advantage of allowing the definition of arbitrary subexpressions rather than being restricted to a single function or class definition.

pep-0404 Python 2.8 Un-release Schedule

PEP:404
Title:Python 2.8 Un-release Schedule
Version:$Revision$
Last-Modified:$Date$
Author:Barry Warsaw <barry at python.org>
Status:Final
Type:Informational
Content-Type:text/x-rst
Created:2011-11-09
Python-Version:2.8

Abstract

This document describes the un-development and un-release schedule for Python 2.8.

Un-release Manager and Crew

Position Name
2.8 Un-release Manager Cardinal Biggles

Un-release Schedule

The current un-schedule is:

  • 2.8 final Never

Official pronouncement

Rule number six: there is no official Python 2.8 release. There never will be an official Python 2.8 release. It is an ex-release. Python 2.7 is the end of the Python 2 line of development.

Upgrade path

The official upgrade path from Python 2.7 is to Python 3.

And Now For Something Completely Different

In all seriousness, there are important reasons why there won't be an official Python 2.8 release, and why you should plan to migrate instead to Python 3.

Python is (as of this writing) more than 20 years old, and Guido and the community have learned a lot in those intervening years. Guido's original concept for Python 3 was to make changes to the language primarily to remove the warts that had grown in the preceding versions. Python 3 was not to be a complete redesign, but instead an evolution of the language, and while maintaining full backward compatibility with Python 2 was explicitly off-the-table, neither were gratuitous changes in syntax or semantics acceptable. In most cases, Python 2 code can be translated fairly easily to Python 3, sometimes entirely mechanically by such tools as 2to3 [1] (there's also a non-trivial subset of the language that will run without modification on both 2.7 and 3.x).

Because maintaining multiple versions of Python is a significant drag on the resources of the Python developers, and because the improvements to the language and libraries embodied in Python 3 are so important, it was decided to end the Python 2 lineage with Python 2.7. Thus, all new development occurs in the Python 3 line of development, and there will never be an official Python 2.8 release. Python 2.7 will however be maintained for longer than the usual period of time.

Here are some highlights of the significant improvements in Python 3. You can read in more detail on the differences [2] between Python 2 and Python 3. There are also many good guides on porting [3] from Python 2 to Python 3.

Strings and bytes

Python 2's basic original strings are called 8-bit strings, and they play a dual role in Python 2 as both ASCII text and as byte sequences. While Python 2 also has a unicode string type, the fundamental ambiguity of the core string type, coupled with Python 2's default behavior of supporting automatic coercion from 8-bit strings to unicode objects when the two are combined, often leads to UnicodeErrors. Python 3's standard string type is Unicode based, and Python 3 adds a dedicated bytes type, but critically, no automatic coercion between bytes and unicode strings is provided. The closest the language gets to implicit coercion are a few text-based APIs that assume a default encoding (usually UTF-8) if no encoding is explicitly stated. Thus, the core interpreter, its I/O libraries, module names, etc. are clear in their distinction between unicode strings and bytes. Python 3's unicode support even extends to the filesystem, so that non-ASCII file names are natively supported.

This string/bytes clarity is often a source of difficulty in transitioning existing code to Python 3, because many third party libraries and applications are themselves ambiguous in this distinction. Once migrated though, most UnicodeErrors can be eliminated.

Numbers

Python 2 has two basic integer types, a native machine-sized int type, and an arbitrary length long type. These have been merged in Python 3 into a single int type analogous to Python 2's long type.

In addition, integer division now produces floating point numbers for non-integer results.

Classes

Python 2 has two core class hierarchies, often called classic classes and new-style classes. The latter allow for such things as inheriting from the builtin basic types, support descriptor based tools like the property builtin and provide a generally more sane and coherent system for dealing with multiple inheritance. Python 3 provided the opportunity to completely drop support for classic classes, so all classes in Python 3 automatically use the new-style semantics (although that's a misnomer now). There is no need to explicitly inherit from object or set the default metatype to enable them (in fact, setting a default metatype at the module level is no longer supported - the default metatype is always object).

The mechanism for explicitly specifying a metaclass has also changed to use a metaclass keyword argument in the class header line rather than a __metaclass__ magic attribute in the class body.

Multiple spellings

There are many cases in Python 2 where multiple spellings of some constructs exist, such as repr() and backticks, or the two inequality operators != and <>. In all cases, Python 3 has chosen exactly one spelling and removed the other (e.g. repr() and != were kept).

Imports

In Python 3, implicit relative imports within packages are no longer available - only absolute imports and explicit relative imports are supported. In addition, star imports (e.g. from x import *) are only permitted in module level code.

Also, some areas of the standard library have been reorganized to make the naming scheme more intuitive. Some rarely used builtins have been relocated to standard library modules.

Iterators and views

Many APIs, which in Python 2 returned concrete lists, in Python 3 now return iterators or lightweight views.

pep-0405 Python Virtual Environments

PEP:405
Title:Python Virtual Environments
Version:$Revision$
Last-Modified:$Date$
Author:Carl Meyer <carl at oddbird.net>
BDFL-Delegate:Nick Coghlan
Status:Final
Type:Standards Track
Content-Type:text/x-rst
Created:13-Jun-2011
Python-Version:3.3
Post-History:24-Oct-2011, 28-Oct-2011, 06-Mar-2012, 24-May-2012
Resolution:http://mail.python.org/pipermail/python-dev/2012-May/119668.html

Abstract

This PEP proposes to add to Python a mechanism for lightweight "virtual environments" with their own site directories, optionally isolated from system site directories. Each virtual environment has its own Python binary (allowing creation of environments with various Python versions) and can have its own independent set of installed Python packages in its site directories, but shares the standard library with the base installed Python.

Motivation

The utility of Python virtual environments has already been well established by the popularity of existing third-party virtual-environment tools, primarily Ian Bicking's virtualenv [1]. Virtual environments are already widely used for dependency management and isolation, ease of installing and using Python packages without system-administrator access, and automated testing of Python software across multiple Python versions, among other uses.

Existing virtual environment tools suffer from lack of support from the behavior of Python itself. Tools such as rvirtualenv [2], which do not copy the Python binary into the virtual environment, cannot provide reliable isolation from system site directories. Virtualenv, which does copy the Python binary, is forced to duplicate much of Python's site module and manually symlink/copy an ever-changing set of standard-library modules into the virtual environment in order to perform a delicate boot-strapping dance at every startup. (Virtualenv must copy the binary in order to provide isolation, as Python dereferences a symlinked executable before searching for sys.prefix.)

The PYTHONHOME environment variable, Python's only existing built-in solution for virtual environments, requires copying/symlinking the entire standard library into every environment. Copying the whole standard library is not a lightweight solution, and cross-platform support for symlinks remains inconsistent (even on Windows platforms that do support them, creating them often requires administrator privileges).

A virtual environment mechanism integrated with Python and drawing on years of experience with existing third-party tools can lower maintenance, raise reliability, and be more easily available to all Python users.

Specification

When the Python binary is executed, it attempts to determine its prefix (which it stores in sys.prefix), which is then used to find the standard library and other key files, and by the site module to determine the location of the site-package directories. Currently the prefix is found (assuming PYTHONHOME is not set) by first walking up the filesystem tree looking for a marker file (os.py) that signifies the presence of the standard library, and if none is found, falling back to the build-time prefix hardcoded in the binary.

This PEP proposes to add a new first step to this search. If a pyvenv.cfg file is found either adjacent to the Python executable or one directory above it (if the executable is a symlink, it is not dereferenced), this file is scanned for lines of the form key = value. If a home key is found, this signifies that the Python binary belongs to a virtual environment, and the value of the home key is the directory containing the Python executable used to create this virtual environment.

In this case, prefix-finding continues as normal using the value of the home key as the effective Python binary location, which finds the prefix of the base installation. sys.base_prefix is set to this value, while sys.prefix is set to the directory containing pyvenv.cfg.

(If pyvenv.cfg is not found or does not contain the home key, prefix-finding continues normally, and sys.prefix will be equal to sys.base_prefix.)

Also, sys.base_exec_prefix is added, and handled similarly with regard to sys.exec_prefix. (sys.exec_prefix is the equivalent of sys.prefix, but for platform-specific files; by default it has the same value as sys.prefix.)

The site and sysconfig standard-library modules are modified such that the standard library and header files are are found relative to sys.base_prefix / sys.base_exec_prefix, while site-package directories ("purelib" and "platlib", in sysconfig terms) are still found relative to sys.prefix / sys.exec_prefix.

Thus, a Python virtual environment in its simplest form would consist of nothing more than a copy or symlink of the Python binary accompanied by a pyvenv.cfg file and a site-packages directory.

Isolation from system site-packages

By default, a virtual environment is entirely isolated from the system-level site-packages directories.

If the pyvenv.cfg file also contains a key include-system-site-packages with a value of true (not case sensitive), the site module will also add the system site directories to sys.path after the virtual environment site directories. Thus system-installed packages will still be importable, but a package of the same name installed in the virtual environment will take precedence.

PEP 370 user-level site-packages are considered part of the system site-packages for venv purposes: they are not available from an isolated venv, but are available from an include-system-site-packages = true venv.

Creating virtual environments

This PEP also proposes adding a new venv module to the standard library which implements the creation of virtual environments. This module can be executed using the -m flag:

python3 -m venv /path/to/new/virtual/environment

A pyvenv installed script is also provided to make this more convenient:

pyvenv /path/to/new/virtual/environment

Running this command creates the target directory (creating any parent directories that don't exist already) and places a pyvenv.cfg file in it with a home key pointing to the Python installation the command was run from. It also creates a bin/ (or Scripts on Windows) subdirectory containing a copy (or symlink) of the python3 executable, and the pysetup3 script from the packaging standard library module (to facilitate easy installation of packages from PyPI into the new venv). And it creates an (initially empty) lib/pythonX.Y/site-packages (or Lib\site-packages on Windows) subdirectory.

If the target directory already exists an error will be raised, unless the --clear option was provided, in which case the target directory will be deleted and virtual environment creation will proceed as usual.

The created pyvenv.cfg file also includes the include-system-site-packages key, set to true if pyvenv is run with the --system-site-packages option, false by default.

Multiple paths can be given to pyvenv, in which case an identical venv will be created, according to the given options, at each provided path.

The venv module also places "shell activation scripts" for POSIX and Windows systems in the bin or Scripts directory of the venv. These scripts simply add the virtual environment's bin (or Scripts) directory to the front of the user's shell PATH. This is not strictly necessary for use of a virtual environment (as an explicit path to the venv's python binary or scripts can just as well be used), but it is convenient.

In order to allow pysetup and other Python package managers to install packages into the virtual environment the same way they would install into a normal Python installation, and avoid special-casing virtual environments in sysconfig beyond using sys.base_prefix in place of sys.prefix where appropriate, the internal virtual environment layout mimics the layout of the Python installation itself on each platform. So a typical virtual environment layout on a POSIX system would be:

pyvenv.cfg
bin/python3
bin/python
bin/pysetup3
include/
lib/python3.3/site-packages/

While on a Windows system:

pyvenv.cfg
Scripts/python.exe
Scripts/python3.dll
Scripts/pysetup3.exe
Scripts/pysetup3-script.py
        ... other DLLs and pyds...
Include/
Lib/site-packages/

Third-party packages installed into the virtual environment will have their Python modules placed in the site-packages directory, and their executables placed in bin/ or Scripts.

Note

On a normal Windows system-level installation, the Python binary itself wouldn't go inside the "Scripts/" subdirectory, as it does in the default venv layout. This is useful in a virtual environment so that a user only has to add a single directory to their shell PATH in order to effectively "activate" the virtual environment.

Note

On Windows, it is necessary to also copy or symlink DLLs and pyd files from compiled stdlib modules into the env, because if the venv is created from a non-system-wide Python installation, Windows won't be able to find the Python installation's copies of those files when Python is run from the venv.

Sysconfig install schemes and user-site

This approach explicitly chooses not to introduce a new sysconfig install scheme for venvs. Rather, by modifying sys.prefix we ensure that existing install schemes which base locations on sys.prefix will simply work in a venv. Installation to other install schemes (for instance, the user-site schemes) whose paths are not relative to sys.prefix, will not be affected by a venv at all.

It may be feasible to create an alternative implementation of Python virtual environments based on a virtual-specific sysconfig scheme, but it would be less robust, as it would require more code to be aware of whether it is operating within a virtual environment or not.

Include files

Current virtualenv handles include files in this way:

On POSIX systems where the installed Python's include files are found in ${base_prefix}/include/pythonX.X, virtualenv creates ${venv}/include/ and symlinks ${base_prefix}/include/pythonX.X to ${venv}/include/pythonX.X. On Windows, where Python's include files are found in {{ sys.prefix }}/Include and symlinks are not reliably available, virtualenv copies {{ sys.prefix }}/Include to ${venv}/Include. This ensures that extension modules built and installed within the virtualenv will always find the Python header files they need in the expected location relative to sys.prefix.

This solution is not ideal when an extension module installs its own header files, as the default installation location for those header files may be a symlink to a system directory that may not be writable. One installer, pip, explicitly works around this by installing header files to a nonstandard location ${venv}/include/site/pythonX.X/, as in Python there's currently no standard abstraction for a site-specific include directory.

This PEP proposes a slightly different approach, though one with essentially the same effect and the same set of advantages and disadvantages. Rather than symlinking or copying include files into the venv, we simply modify the sysconfig schemes so that header files are always sought relative to base_prefix rather than prefix. (We also create an include/ directory within the venv, so installers have somewhere to put include files installed within the env).

Better handling of include files in distutils/packaging and, by extension, pyvenv, is an area that may deserve its own future PEP. For now, we propose that the behavior of virtualenv has thus far proved itself to be at least "good enough" in practice.

API

The high-level method described above makes use of a simple API which provides mechanisms for third-party virtual environment creators to customize environment creation according to their needs.

The venv module contains an EnvBuilder class which accepts the following keyword arguments on instantiation:

  • system_site_packages - A Boolean value indicating that the system Python site-packages should be available to the environment. Defaults to False.
  • clear - A Boolean value which, if true, will delete any existing target directory instead of raising an exception. Defaults to False.
  • symlinks - A Boolean value indicating whether to attempt to symlink the Python binary (and any necessary DLLs or other binaries, e.g. pythonw.exe), rather than copying. Defaults to False.

The instantiated env-builder has a create method, which takes as required argument the path (absolute or relative to the current directory) of the target directory which is to contain the virtual environment. The create method either creates the environment in the specified directory, or raises an appropriate exception.

The venv module also provides a module-level create function as a convenience:

def create(env_dir,
           system_site_packages=False, clear=False, use_symlinks=False):
    builder = EnvBuilder(
        system_site_packages=system_site_packages,
        clear=clear,
        use_symlinks=use_symlinks)
    builder.create(env_dir)

Creators of third-party virtual environment tools are free to use the provided EnvBuilder class as a base class.

The create method of the EnvBuilder class illustrates the hooks available for customization:

def create(self, env_dir):
    """
    Create a virtualized Python environment in a directory.

    :param env_dir: The target directory to create an environment in.

    """
    env_dir = os.path.abspath(env_dir)
    context = self.create_directories(env_dir)
    self.create_configuration(context)
    self.setup_python(context)
    self.post_setup(context)

Each of the methods create_directories, create_configuration, setup_python, and post_setup can be overridden. The functions of these methods are:

  • create_directories - creates the environment directory and all necessary directories, and returns a context object. This is just a holder for attributes (such as paths), for use by the other methods.
  • create_configuration - creates the pyvenv.cfg configuration file in the environment.
  • setup_python - creates a copy of the Python executable (and, under Windows, DLLs) in the environment.
  • post_setup - A (no-op by default) hook method which can be overridden in third party subclasses to pre-install packages or install scripts in the virtual environment.

In addition, EnvBuilder provides a utility method that can be called from post_setup in subclasses to assist in installing custom scripts into the virtual environment. The method install_scripts accepts as arguments the context object (see above) and a path to a directory. The directory should contain subdirectories "common", "posix", "nt", each containing scripts destined for the bin directory in the environment. The contents of "common" and the directory corresponding to os.name are copied after doing some text replacement of placeholders:

  • __VENV_DIR__ is replaced with absolute path of the environment directory.
  • __VENV_NAME__ is replaced with the environment name (final path segment of environment directory).
  • __VENV_BIN_NAME__ is replaced with the name of the bin directory (either bin or Scripts).
  • __VENV_PYTHON__ is replaced with the absolute path of the environment's executable.

The DistributeEnvBuilder subclass in the reference implementation illustrates how the customization hook can be used in practice to pre-install Distribute into the virtual environment. It's not envisaged that DistributeEnvBuilder will be actually added to Python core, but it makes the reference implementation more immediately useful for testing and exploratory purposes.

Backwards Compatibility

Splitting the meanings of sys.prefix

Any virtual environment tool along these lines (which attempts to isolate site-packages, while still making use of the base Python's standard library with no need for it to be symlinked into the virtual environment) is proposing a split between two different meanings (among others) that are currently both wrapped up in sys.prefix: the answers to the questions "Where is the standard library?" and "Where is the site-packages location where third-party modules should be installed?"

This split could be handled by introducing a new sys attribute for either the former prefix or the latter prefix. Either option potentially introduces some backwards-incompatibility with software written to assume the other meaning for sys.prefix. (Such software should preferably be using the APIs in the site and sysconfig modules to answer these questions rather than using sys.prefix directly, in which case there is no backwards-compatibility issue, but in practice sys.prefix is sometimes used.)

The documentation [7] for sys.prefix describes it as "A string giving the site-specific directory prefix where the platform independent Python files are installed," and specifically mentions the standard library and header files as found under sys.prefix. It does not mention site-packages.

Maintaining this documented definition would mean leaving sys.prefix pointing to the base system installation (which is where the standard library and header files are found), and introducing a new value in sys (something like sys.site_prefix) to point to the prefix for site-packages. This would maintain the documented semantics of sys.prefix, but risk breaking isolation if third-party code uses sys.prefix rather than sys.site_prefix or the appropriate site API to find site-packages directories.

The most notable case is probably setuptools [3] and its fork distribute [4], which mostly use distutils and sysconfig APIs, but do use sys.prefix directly to build up a list of site directories for pre-flight checking where pth files can usefully be placed.

Otherwise, a Google Code Search [5] turns up what appears to be a roughly even mix of usage between packages using sys.prefix to build up a site-packages path and packages using it to e.g. eliminate the standard-library from code-execution tracing.

Although it requires modifying the documented definition of sys.prefix, this PEP prefers to have sys.prefix point to the virtual environment (where site-packages is found), and introduce sys.base_prefix to point to the standard library and Python header files. Rationale for this choice:

  • It is preferable to err on the side of greater isolation of the virtual environment.
  • Virtualenv already modifies sys.prefix to point at the virtual environment, and in practice this has not been a problem.
  • No modification is required to setuptools/distribute.

Impact on other Python implementations

The majority of this PEP's changes occur in the standard library, which is shared by other Python implementations and should not present any problem.

Other Python implementations will need to replicate the new sys.prefix-finding behavior of the interpreter bootstrap, including locating and parsing the pyvenv.cfg file, if it is present.

Reference Implementation

The reference implementation is found in a clone of the CPython Mercurial repository [6]. To test it, build and run bin/pyvenv /path/to/new/venv to create a virtual environment.

pep-0406 Improved Encapsulation of Import State

PEP:406
Title:Improved Encapsulation of Import State
Version:$Revision$
Last-Modified:$Date$
Author:Nick Coghlan <ncoghlan at gmail.com>, Greg Slodkowicz <jergosh at gmail.com>
Status:Withdrawn
Type:Standards Track
Content-Type:text/x-rst
Created:4-Jul-2011
Python-Version:3.4
Post-History:31-Jul-2011, 13-Nov-2011, 4-Dec-2011

Abstract

This PEP proposes the introduction of a new 'ImportEngine' class as part of importlib which would encapsulate all state related to importing modules into a single object. Creating new instances of this object would then provide an alternative to completely replacing the built-in implementation of the import statement, by overriding the __import__() function. To work with the builtin import functionality and importing via import engine objects, this PEP proposes a context management based approach to temporarily replacing the global import state.

The PEP also proposes inclusion of a GlobalImportEngine subclass and a globally accessible instance of that class, which "writes through" to the process global state. This provides a backwards compatible bridge between the proposed encapsulated API and the legacy process global state, and allows straightforward support for related state updates (e.g. selectively invalidating path cache entries when sys.path is modified).

PEP Withdrawal

The import system has seen substantial changes since this PEP was originally written, as part of PEP 420 in Python 3.3 and PEP 451 in Python 3.4.

While providing an encapsulation of the import state is still highly desirable, it is better tackled in a new PEP using PEP 451 as a foundation, and permitting only the use of PEP 451 compatible finders and loaders (as those avoid many of the issues of direct manipulation of global state associated with the previous loader API).

Rationale

Currently, most state related to the import system is stored as module level attributes in the sys module. The one exception is the import lock, which is not accessible directly, but only via the related functions in the imp module. The current process global import state comprises:

  • sys.modules
  • sys.path
  • sys.path_hooks
  • sys.meta_path
  • sys.path_importer_cache
  • the import lock (imp.lock_held()/acquire_lock()/release_lock())

Isolating this state would allow multiple import states to be conveniently stored within a process. Placing the import functionality in a self-contained object would also allow subclassing to add additional features (e.g. module import notifications or fine-grained control over which modules can be imported). The engine would also be subclassed to make it possible to use the import engine API to interact with the existing process-global state.

The namespace PEPs (especially PEP 402) raise a potential need for additional process global state, in order to correctly update package paths as sys.path is modified.

Finally, providing a coherent object for all this state makes it feasible to also provide context management features that allow the import state to be temporarily substituted.

Proposal

We propose introducing an ImportEngine class to encapsulate import functionality. This includes an __import__() method which can be used as an alternative to the built-in __import__() when desired and also an import_module() method, equivalent to importlib.import_module() [3].

Since there are global import state invariants that are assumed and should be maintained, we introduce a GlobalImportState class with an interface identical to ImportEngine but directly accessing the current global import state. This can be easily implemented using class properties.

Specification

ImportEngine API

The proposed extension consists of the following objects:

importlib.engine.ImportEngine

from_engine(self, other)

Create a new import object from another ImportEngine instance. The new object is initialised with a copy of the state in other. When called on importlib engine.sysengine, from_engine() can be used to create an ImportEngine object with a copy of the global import state.

__import__(self, name, globals={}, locals={}, fromlist=[], level=0)

Reimplementation of the builtin __import__() function. The import of a module will proceed using the state stored in the ImportEngine instance rather than the global import state. For full documentation of __import__ funtionality, see [2] . __import__() from ImportEngine and its subclasses can be used to customise the behaviour of the import statement by replacing __builtin__.__import__ with ImportEngine().__import__.

import_module(name, package=None)

A reimplementation of importlib.import_module() which uses the import state stored in the ImportEngine instance. See [3] for a full reference.

modules, path, path_hooks, meta_path, path_importer_cache

Instance-specific versions of their process global sys equivalents

importlib.engine.GlobalImportEngine(ImportEngine)

Convenience class to provide engine-like access to the global state. Provides __import__(), import_module() and from_engine() methods like ImportEngine but writes through to the global state in sys.

To support various namespace package mechanisms, when sys.path is altered, tools like pkgutil.extend_path should be used to also modify other parts of the import state (in this case, package __path__ attributes). The path importer cache should also be invalidated when a variety of changes are made.

The ImportEngine API will provide convenience methods that automatically make related import state updates as part of a single operation.

Global variables

importlib.engine.sysengine

A precreated instance of GlobalImportEngine. Intended for use by importers and loaders that have been updated to accept optional engine parameters and with ImportEngine.from_engine(sysengine) to start with a copy of the process global import state.

No changes to finder/loader interfaces

Rather than attempting to update the PEP 302 APIs to accept additional state, this PEP proposes that ImportEngine support the content management protocol (similar to the context substitution mechanisms in the decimal module).

The context management mechanism for ImportEngine would:

  • On entry: * Acquire the import lock * Substitute the global import state with the import engine's own state
  • On exit: * Restore the previous global import state * Release the import lock

The precise API for this is TBD (but will probably use a distinct context management object, along the lines of that created by decimal.localcontext).

Open Issues

API design for falling back to global import state

The current proposal relies on the from_engine() API to fall back to the global import state. It may be desirable to offer a variant that instead falls back to the global import state dynamically.

However, one big advantage of starting with an "as isolated as possible" design is that it becomes possible to experiment with subclasses that blur the boundaries between the engine instance state and the process global state in various ways.

Builtin and extension modules must be process global

Due to platform limitations, only one copy of each builtin and extension module can readily exist in each process. Accordingly, it is impossible for each ImportEngine instance to load such modules independently.

The simplest solution is for ImportEngine to refuse to load such modules, raising ImportError. GlobalImportEngine would be able to load them normally.

ImportEngine will still return such modules from a prepopulated module cache - it's only loading them directly which causes problems.

Scope of substitution

Related to the previous open issue is the question of what state to substitute when using the context management API. It is currently the case that replacing sys.modules can be unreliable due to cached references and there's the underlying fact that having independent copies of some modules is simply impossible due to platform limitations.

As part of this PEP, it will be necessary to document explicitly:

  • Which parts of the global import state can be substituted (and declare code which caches references to that state without dealing with the substitution case buggy)
  • Which parts must be modified in-place (and hence are not substituted by the ImportEngine context management API, or otherwise scoped to ImportEngine instances)

Reference Implementation

A reference implementation [4] for an earlier draft of this PEP, based on Brett Cannon's importlib has been developed by Greg Slodkowicz as part of the 2011 Google Summer of Code. Note that the current implementation avoids modifying existing code, and hence duplicates a lot of things unnecessarily. An actual implementation would just modify any such affected code in place.

That earlier draft of the PEP proposed change the PEP 302 APIs to support passing in an optional engine instance. This had the (serious) downside of not correctly affecting further imports from the imported module, hence the change to the context management based proposal for substituting the global state.

References

[1]PEP 302, New Import Hooks, J van Rossum, Moore (http://www.python.org/dev/peps/pep-0302)
[2]__import__() builtin function, The Python Standard Library documentation (http://docs.python.org/library/functions.html#__import__)
[3](1, 2) Importlib documentation, Cannon (http://docs.python.org/dev/library/importlib)
[4]Reference implentation (https://bitbucket.org/jergosh/gsoc_import_engine/src/default/Lib/importlib/engine.py)

pep-0407 New release cycle and introducing long-term support versions

PEP:407
Title:New release cycle and introducing long-term support versions
Version:$Revision$
Last-Modified:$Date$
Author:Antoine Pitrou <solipsis at pitrou.net>, Georg Brandl <georg at python.org>, Barry Warsaw <barry at python.org>
Status:Deferred
Type:Process
Content-Type:text/x-rst
Created:2012-01-12
Post-History:http://mail.python.org/pipermail/python-dev/2012-January/115838.html
Resolution:TBD

Abstract

Finding a release cycle for an open-source project is a delicate exercise in managing mutually contradicting constraints: developer manpower, availability of release management volunteers, ease of maintenance for users and third-party packagers, quick availability of new features (and behavioural changes), availability of bug fixes without pulling in new features or behavioural changes.

The current release cycle errs on the conservative side. It is adequate for people who value stability over reactivity. This PEP is an attempt to keep the stability that has become a Python trademark, while offering a more fluid release of features, by introducing the notion of long-term support versions.

Scope

This PEP doesn't try to change the maintenance period or release scheme for the 2.7 branch. Only 3.x versions are considered.

Proposal

Under the proposed scheme, there would be two kinds of feature versions (sometimes dubbed "minor versions", for example 3.2 or 3.3): normal feature versions and long-term support (LTS) versions.

Normal feature versions would get either zero or at most one bugfix release; the latter only if needed to fix critical issues. Security fix handling for these branches needs to be decided.

LTS versions would get regular bugfix releases until the next LTS version is out. They then would go into security fixes mode, up to a termination date at the release manager's discretion.

Periodicity

A new feature version would be released every X months. We tentatively propose X = 6 months.

LTS versions would be one out of N feature versions. We tentatively propose N = 4.

With these figures, a new LTS version would be out every 24 months, and remain supported until the next LTS version 24 months later. This is mildly similar to today's 18 months bugfix cycle for every feature version.

Pre-release versions

More frequent feature releases imply a smaller number of disruptive changes per release. Therefore, the number of pre-release builds (alphas and betas) can be brought down considerably. Two alpha builds and a single beta build would probably be enough in the regular case. The number of release candidates depends, as usual, on the number of last-minute fixes before final release.

Effects

Effect on development cycle

More feature releases might mean more stress on the development and release management teams. This is quantitatively alleviated by the smaller number of pre-release versions; and qualitatively by the lesser amount of disruptive changes (meaning less potential for breakage). The shorter feature freeze period (after the first beta build until the final release) is easier to accept. The rush for adding features just before feature freeze should also be much smaller.

Effect on bugfix cycle

The effect on fixing bugs should be minimal with the proposed figures. The same number of branches would be simultaneously open for bugfix maintenance (two until 2.x is terminated, then one).

Effect on workflow

The workflow for new features would be the same: developers would only commit them on the default branch.

The workflow for bug fixes would be slightly updated: developers would commit bug fixes to the current LTS branch (for example 3.3) and then merge them into default.

If some critical fixes are needed to a non-LTS version, they can be grafted from the current LTS branch to the non-LTS branch, just like fixes are ported from 3.x to 2.7 today.

Effect on the community

People who value stability can just synchronize on the LTS releases which, with the proposed figures, would give a similar support cycle (both in duration and in stability).

People who value reactivity and access to new features (without taking the risk to install alpha versions or Mercurial snapshots) would get much more value from the new release cycle than currently.

People who want to contribute new features or improvements would be more motivated to do so, knowing that their contributions will be more quickly available to normal users. Also, a smaller feature freeze period makes it less cumbersome to interact with contributors of features.

Discussion

These are open issues that should be worked out during discussion:

  • Decide on X (months between feature releases) and N (feature releases per LTS release) as defined above.
  • For given values of X and N, is the no-bugfix-releases policy for non-LTS versions feasible?
  • What is the policy for security fixes?
  • Restrict new syntax and similar changes (i.e. everything that was prohibited by PEP 3003) to LTS versions?
  • What is the effect on packagers such as Linux distributions?
  • How will release version numbers or other identifying and marketing material make it clear to users which versions are normal feature releases and which are LTS releases? How do we manage user expectations?
  • Does the faster release cycle mean we could some day reach 3.10 and above? Some people expressed a tacit expectation that version numbers always fit in one decimal digit.

A community poll or survey to collect opinions from the greater Python community would be valuable before making a final decision.

pep-0408 Standard library __preview__ package

PEP:408
Title:Standard library __preview__ package
Version:$Revision$
Last-Modified:$Date$
Author:Nick Coghlan <ncoghlan at gmail.com>, Eli Bendersky <eliben at gmail.com>
Status:Rejected
Type:Standards Track
Content-Type:text/x-rst
Created:2012-01-07
Python-Version:3.3
Post-History:2012-01-27
Resolution:http://mail.python.org/pipermail/python-dev/2012-January/115962.html

Abstract

The process of including a new module into the Python standard library is hindered by the API lock-in and promise of backward compatibility implied by a module being formally part of Python. This PEP proposes a transitional state for modules - inclusion in a special __preview__ package for the duration of a minor release (roughly 18 months) prior to full acceptance into the standard library. On one hand, this state provides the module with the benefits of being formally part of the Python distribution. On the other hand, the core development team explicitly states that no promises are made with regards to the module's eventual full inclusion into the standard library, or to the stability of its API, which may change for the next release.

PEP Rejection

Based on his experience with a similar "labs" namespace in Google App Engine, Guido has rejected this PEP [3] in favour of the simpler alternative of explicitly marking provisional modules as such in their documentation.

If a module is otherwise considered suitable for standard library inclusion, but some concerns remain regarding maintainability or certain API details, then the module can be accepted on a provisional basis. While it is considered an unlikely outcome, such modules may be removed from the standard library without a deprecation period if the lingering concerns prove well-founded.

As part of the same announcement, Guido explicitly accepted Matthew Barnett's 'regex' module [4] as a provisional addition to the standard library for Python 3.3 (using the 'regex' name, rather than as a drop-in replacement for the existing 're' module).

Proposal - the __preview__ package

Whenever the Python core development team decides that a new module should be included into the standard library, but isn't entirely sure about whether the module's API is optimal, the module can be placed in a special package named __preview__ for a single minor release.

In the next minor release, the module may either be "graduated" into the standard library (and occupy its natural place within its namespace, leaving the __preview__ package), or be rejected and removed entirely from the Python source tree. If the module ends up graduating into the standard library after spending a minor release in __preview__, its API may be changed according to accumulated feedback. The core development team explicitly makes no guarantees about API stability and backward compatibility of modules in __preview__.

Entry into the __preview__ package marks the start of a transition of the module into the standard library. It means that the core development team assumes responsibility of the module, similarly to any other module in the standard library.

Which modules should go through __preview__

We expect most modules proposed for addition into the Python standard library to go through a minor release in __preview__. There may, however, be some exceptions, such as modules that use a pre-defined API (for example lzma, which generally follows the API of the existing bz2 module), or modules with an API that has wide acceptance in the Python development community.

In any case, modules that are proposed to be added to the standard library, whether via __preview__ or directly, must fulfill the acceptance conditions set by PEP 2.

It is important to stress that the aim of of this proposal is not to make the process of adding new modules to the standard library more difficult. On the contrary, it tries to provide a means to add more useful libraries. Modules which are obvious candidates for entry can be added as before. Modules which due to uncertainties about the API could be stalled for a long time now have a means to still be distributed with Python, via an incubation period in the __preview__ package.

Criteria for "graduation"

In principle, most modules in the __preview__ package should eventually graduate to the stable standard library. Some reasons for not graduating are:

  • The module may prove to be unstable or fragile, without sufficient developer support to maintain it.
  • A much better alternative module may be found during the preview release

Essentially, the decision will be made by the core developers on a per-case basis. The point to emphasize here is that a module's appearance in the __preview__ package in some release does not guarantee it will continue being part of Python in the next release.

Example

Suppose the example module is a candidate for inclusion in the standard library, but some Python developers aren't convinced that it presents the best API for the problem it intends to solve. The module can then be added to the __preview__ package in release 3.X, importable via:

from __preview__ import example

Assuming the module is then promoted to the the standard library proper in release 3.X+1, it will be moved to a permanent location in the library:

import example

And importing it from __preview__ will no longer work.

Rationale

Benefits for the core development team

Currently, the core developers are really reluctant to add new interfaces to the standard library. This is because as soon as they're published in a release, API design mistakes get locked in due to backward compatibility concerns.

By gating all major API additions through some kind of a preview mechanism for a full release, we get one full release cycle of community feedback before we lock in the APIs with our standard backward compatibility guarantee.

We can also start integrating preview modules with the rest of the standard library early, so long as we make it clear to packagers that the preview modules should not be considered optional. The only difference between preview APIs and the rest of the standard library is that preview APIs are explicitly exempted from the usual backward compatibility guarantees.

Essentially, the __preview__ package is intended to lower the risk of locking in minor API design mistakes for extended periods of time. Currently, this concern can block new additions, even when the core development team consensus is that a particular addition is a good idea in principle.

Benefits for end users

For future end users, the broadest benefit lies in a better "out-of-the-box" experience - rather than being told "oh, the standard library tools for task X are horrible, download this 3rd party library instead", those superior tools are more likely to be just be an import away.

For environments where developers are required to conduct due diligence on their upstream dependencies (severely harming the cost-effectiveness of, or even ruling out entirely, much of the material on PyPI), the key benefit lies in ensuring that anything in the __preview__ package is clearly under python-dev's aegis from at least the following perspectives:

  • Licensing: Redistributed by the PSF under a Contributor Licensing Agreement.
  • Documentation: The documentation of the module is published and organized via the standard Python documentation tools (i.e. ReST source, output generated with Sphinx and published on http://docs.python.org).
  • Testing: The module test suites are run on the python.org buildbot fleet and results published via http://www.python.org/dev/buildbot.
  • Issue management: Bugs and feature requests are handled on http://bugs.python.org
  • Source control: The master repository for the software is published on http://hg.python.org.

Candidates for inclusion into __preview__

For Python 3.3, there are a number of clear current candidates:

Other possible future use cases include:

  • Improved HTTP modules (e.g. requests)
  • HTML 5 parsing support (e.g. html5lib)
  • Improved URL/URI/IRI parsing
  • A standard image API (PEP 368)
  • Encapsulation of the import state (PEP 368)
  • Standard event loop API (PEP 3153)
  • A binary version of WSGI for Python 3 (e.g. PEP 444)
  • Generic function support (e.g. simplegeneric)

Relationship with PEP 407

PEP 407 proposes a change to the core Python release cycle to permit interim releases every 6 months (perhaps limited to standard library updates). If such a change to the release cycle is made, the following policy for the __preview__ namespace is suggested:

  • For long term support releases, the __preview__ namespace would always be empty.
  • New modules would be accepted into the __preview__ namespace only in interim releases that immediately follow a long term support release.
  • All modules added will either be migrated to their final location in the standard library or dropped entirely prior to the next long term support release.

Rejected alternatives and variations

Using __future__

Python already has a "forward-looking" namespace in the form of the __future__ module, so it's reasonable to ask why that can't be re-used for this new purpose.

There are two reasons why doing so not appropriate:

1. The __future__ module is actually linked to a separate compiler directives feature that can actually change the way the Python interpreter compiles a module. We don't want that for the preview package - we just want an ordinary Python package.

2. The __future__ module comes with an express promise that names will be maintained in perpetuity, long after the associated features have become the compiler's default behaviour. Again, this is precisely the opposite of what is intended for the preview package - it is almost certain that all names added to the preview will be removed at some point, most likely due to their being moved to a permanent home in the standard library, but also potentially due to their being reverted to third party package status (if community feedback suggests the proposed addition is irredeemably broken).

Versioning the package

One proposed alternative [1] was to add explicit versioning to the __preview__ package, i.e. __preview34__. We think that it's better to simply define that a module being in __preview__ in Python 3.X will either graduate to the normal standard library namespace in Python 3.X+1 or will disappear from the Python source tree altogether. Versioning the _preview__ package complicates the process and does not align well with the main intent of this proposal.

Using a package name without leading and trailing underscores

It was proposed [1] to use a package name like preview or exp, instead of __preview__. This was rejected in the discussion due to the special meaning a "dunder" package name (that is, a name with leading and trailing double-underscores) conveys in Python. Besides, a non-dunder name would suggest normal standard library API stability guarantees, which is not the intention of the __preview__ package.

Preserving pickle compatibility

A pickled class instance based on a module in __preview__ in release 3.X won't be unpickle-able in release 3.X+1, where the module won't be in __preview__. Special code may be added to make this work, but this goes against the intent of this proposal, since it implies backward compatibility. Therefore, this PEP does not propose to preserve pickle compatibility.

Credits

Dj Gilcrease initially proposed the idea of having a __preview__ package in Python [2]. Although his original proposal uses the name __experimental__, we feel that __preview__ conveys the meaning of this package in a better way.

pep-0409 Suppressing exception context

PEP:409
Title:Suppressing exception context
Version:$Revision$
Last-Modified:$Date$
Author:Ethan Furman <ethan at stoneleaf.us>
Status:Final
Type:Standards Track
Content-Type:text/x-rst
Created:26-Jan-2012
Post-History:30-Aug-2002, 01-Feb-2012, 03-Feb-2012
Superseded-By:415
Resolution:http://mail.python.org/pipermail/python-dev/2012-February/116136.html

Abstract

One of the open issues from PEP 3134 is suppressing context: currently there is no way to do it. This PEP proposes one.

Rationale

There are two basic ways to generate exceptions:

  1. Python does it (buggy code, missing resources, ending loops, etc.)
  2. manually (with a raise statement)

When writing libraries, or even just custom classes, it can become necessary to raise exceptions; moreover it can be useful, even necessary, to change from one exception to another. To take an example from my dbf module:

try:
    value = int(value)
except Exception:
    raise DbfError(...)

Whatever the original exception was (ValueError, TypeError, or something else) is irrelevant. The exception from this point on is a DbfError, and the original exception is of no value. However, if this exception is printed, we would currently see both.

Alternatives

Several possibilities have been put forth:

  • raise as NewException()

    Reuses the as keyword; can be confusing since we are not really reraising the originating exception

  • raise NewException() from None

    Follows existing syntax of explicitly declaring the originating exception

  • exc = NewException(); exc.__context__ = None; raise exc

    Very verbose way of the previous method

  • raise NewException.no_context(...)

    Make context suppression a class method.

All of the above options will require changes to the core.

Proposal

I propose going with the second option:

raise NewException from None

It has the advantage of using the existing pattern of explicitly setting the cause:

raise KeyError() from NameError()

but because the cause is None the previous context is not displayed by the default exception printing routines.

Implementation Discussion

Note: after acceptance of this PEP, a cleaner implementation mechanism was proposed and accepted in PEP 415. Refer to that PEP for more details on the implementation actually used in Python 3.3.

Currently, None is the default for both __context__ and __cause__. In order to support raise ... from None (which would set __cause__ to None) we need a different default value for __cause__. Several ideas were put forth on how to implement this at the language level:

  • Overwrite the previous exception information (side-stepping the issue and leaving __cause__ at None).

    Rejected as this can seriously hinder debugging due to poor error messages [1].

  • Use one of the boolean values in __cause__: False would be the default value, and would be replaced when from ... was used with the explicity chained exception or None.

    Rejected as this encourages the use of two different objects types for __cause__ with one of them (boolean) not allowed to have the full range of possible values (True would never be used).

  • Create a special exception class, __NoException__.

    Rejected as possibly confusing, possibly being mistakenly raised by users, and not being a truly unique value as None, True, and False are.

  • Use Ellipsis as the default value (the ... singleton).

    Accepted.

    Ellipses are commonly used in English as place holders when words are omitted. This works in our favor here as a signal that __cause__ is omitted, so look in __context__ for more details.

    Ellipsis is not an exception, so cannot be raised.

    There is only one Ellipsis, so no unused values.

    Error information is not thrown away, so custom code can trace the entire exception chain even if the default code does not.

Language Details

To support raise Exception from None, __context__ will stay as it is, but __cause__ will start out as Ellipsis and will change to None when the raise Exception from None method is used.

form __context__ __cause__
raise None Ellipsis
reraise previous exception Ellipsis
reraise from None | ChainedException previous exception None | explicitly chained exception

The default exception printing routine will then:

  • If __cause__ is Ellipsis the __context__ (if any) will be printed.
  • If __cause__ is None the __context__ will not be printed.
  • if __cause__ is anything else, __cause__ will be printed.

In both of the latter cases the exception chain will stop being followed.

Because the default value for __cause__ is now Ellipsis and raise Exception from Cause is simply syntactic sugar for:

_exc = NewException()
_exc.__cause__ = Cause()
raise _exc

Ellipsis, as well as None, is now allowed as a cause:

raise Exception from Ellipsis

Patches

There is a patch for CPython implementing this attached to Issue 6210 [2].

pep-0410 Use decimal.Decimal type for timestamps

PEP:410
Title:Use decimal.Decimal type for timestamps
Version:$Revision$
Last-Modified:$Date$
Author:Victor Stinner <victor.stinner at gmail.com>
Status:Rejected
Type:Standards Track
Content-Type:text/x-rst
Created:01-February-2012
Python-Version:3.3
Resolution:http://mail.python.org/pipermail/python-dev/2012-February/116837.html

Abstract

Decimal becomes the official type for high-resolution timestamps to make Python support new functions using a nanosecond resolution without loss of precision.

Rationale

Python 2.3 introduced float timestamps to support sub-second resolutions. os.stat() uses float timestamps by default since Python 2.5. Python 3.3 introduced functions supporting nanosecond resolutions:

  • os module: futimens(), utimensat()
  • time module: clock_gettime(), clock_getres(), monotonic(), wallclock()

os.stat() reads nanosecond timestamps but returns timestamps as float.

The Python float type uses binary64 format of the IEEE 754 standard. With a resolution of one nanosecond (10-9), float timestamps lose precision for values bigger than 224 seconds (194 days: 1970-07-14 for an Epoch timestamp).

Nanosecond resolution is required to set the exact modification time on filesystems supporting nanosecond timestamps (e.g. ext4, btrfs, NTFS, ...). It helps also to compare the modification time to check if a file is newer than another file. Use cases: copy the modification time of a file using shutil.copystat(), create a TAR archive with the tarfile module, manage a mailbox with the mailbox module, etc.

An arbitrary resolution is preferred over a fixed resolution (like nanosecond) to not have to change the API when a better resolution is required. For example, the NTP protocol uses fractions of 232 seconds (approximatively 2.3 × 10-10 second), whereas the NTP protocol version 4 uses fractions of 264 seconds (5.4 × 10-20 second).

Note

With a resolution of 1 microsecond (10-6), float timestamps lose precision for values bigger than 233 seconds (272 years: 2242-03-16 for an Epoch timestamp). With a resolution of 100 nanoseconds (10-7, resolution used on Windows), float timestamps lose precision for values bigger than 229 seconds (17 years: 1987-01-05 for an Epoch timestamp).

Specification

Add decimal.Decimal as a new type for timestamps. Decimal supports any timestamp resolution, support arithmetic operations and is comparable. It is possible to coerce a Decimal to float, even if the conversion may lose precision. The clock resolution can also be stored in a Decimal object.

Add an optional timestamp argument to:

  • os module: fstat(), fstatat(), lstat(), stat() (st_atime, st_ctime and st_mtime fields of the stat structure), sched_rr_get_interval(), times(), wait3() and wait4()
  • resource module: ru_utime and ru_stime fields of getrusage()
  • signal module: getitimer(), setitimer()
  • time module: clock(), clock_gettime(), clock_getres(), monotonic(), time() and wallclock()

The timestamp argument value can be float or Decimal, float is still the default for backward compatibility. The following functions support Decimal as input:

  • datetime module: date.fromtimestamp(), datetime.fromtimestamp() and datetime.utcfromtimestamp()
  • os module: futimes(), futimesat(), lutimes(), utime()
  • select module: epoll.poll(), kqueue.control(), select()
  • signal module: setitimer(), sigtimedwait()
  • time module: ctime(), gmtime(), localtime(), sleep()

The os.stat_float_times() function is deprecated: use an explicit cast using int() instead.

Note

The decimal module is implemented in Python and is slower than float, but there is a new C implementation which is almost ready for inclusion in CPython.

Backwards Compatibility

The default timestamp type (float) is unchanged, so there is no impact on backward compatibility nor on performances. The new timestamp type, decimal.Decimal, is only returned when requested explicitly.

Objection: clocks accuracy

Computer clocks and operating systems are inaccurate and fail to provide nanosecond accuracy in practice. A nanosecond is what it takes to execute a couple of CPU instructions. Even on a real-time operating system, a nanosecond-precise measurement is already obsolete when it starts being processed by the higher-level application. A single cache miss in the CPU will make the precision worthless.

Note

Linux actually is able to measure time in nanosecond precision, even though it is not able to keep its clock synchronized to UTC with a nanosecond accuracy.

Alternatives: Timestamp types

To support timestamps with an arbitrary or nanosecond resolution, the following types have been considered:

  • decimal.Decimal
  • number of nanoseconds
  • 128-bits float
  • datetime.datetime
  • datetime.timedelta
  • tuple of integers
  • timespec structure

Criteria:

  • Doing arithmetic on timestamps must be possible
  • Timestamps must be comparable
  • An arbitrary resolution, or at least a resolution of one nanosecond without losing precision
  • It should be possible to coerce the new timestamp to float for backward compatibility

A resolution of one nanosecond is enough to support all current C functions.

The best resolution used by operating systems is one nanosecond. In practice, most clock accuracy is closer to microseconds than nanoseconds. So it sounds reasonable to use a fixed resolution of one nanosecond.

Number of nanoseconds (int)

A nanosecond resolution is enough for all current C functions and so a timestamp can simply be a number of nanoseconds, an integer, not a float.

The number of nanoseconds format has been rejected because it would require to add new specialized functions for this format because it not possible to differentiate a number of nanoseconds and a number of seconds just by checking the object type.

128-bits float

Add a new IEEE 754-2008 quad-precision binary float type. The IEEE 754-2008 quad precision float has 1 sign bit, 15 bits of exponent and 112 bits of mantissa. 128-bits float is supported by GCC (4.3), Clang and ICC compilers.

Python must be portable and so cannot rely on a type only available on some platforms. For example, Visual C++ 2008 doesn't support 128-bits float, whereas it is used to build the official Windows executables. Another example: GCC 4.3 does not support __float128 in 32-bit mode on x86 (but GCC 4.4 does).

There is also a license issue: GCC uses the MPFR library for 128-bits float, library distributed under the GNU LGPL license. This license is not compatible with the Python license.

Note

The x87 floating point unit of Intel CPU supports 80-bit floats. This format is not supported by the SSE instruction set, which is now preferred over float, especially on x86_64. Other CPU vendors don't support 80-bit float.

datetime.datetime

The datetime.datetime type is the natural choice for a timestamp because it is clear that this type contains a timestamp, whereas int, float and Decimal are raw numbers. It is an absolute timestamp and so is well defined. It gives direct access to the year, month, day, hours, minutes and seconds. It has methods related to time like methods to format the timestamp as string (e.g. datetime.datetime.strftime).

The major issue is that except os.stat(), time.time() and time.clock_gettime(time.CLOCK_GETTIME), all time functions have an unspecified starting point and no timezone information, and so cannot be converted to datetime.datetime.

datetime.datetime has also issues with timezone. For example, a datetime object without timezone (unaware) and a datetime with a timezone (aware) cannot be compared. There is also an ordering issues with daylight saving time (DST) in the duplicate hour of switching from DST to normal time.

datetime.datetime has been rejected because it cannot be used for functions using an unspecified starting point like os.times() or time.clock().

For time.time() and time.clock_gettime(time.CLOCK_GETTIME): it is already possible to get the current time as a datetime.datetime object using:

datetime.datetime.now(datetime.timezone.utc)

For os.stat(), it is simple to create a datetime.datetime object from a decimal.Decimal timestamp in the UTC timezone:

datetime.datetime.fromtimestamp(value, datetime.timezone.utc)

Note

datetime.datetime only supports microsecond resolution, but can be enhanced to support nanosecond.

datetime.timedelta

datetime.timedelta is the natural choice for a relative timestamp because it is clear that this type contains a timestamp, whereas int, float and Decimal are raw numbers. It can be used with datetime.datetime to get an absolute timestamp when the starting point is known.

datetime.timedelta has been rejected because it cannot be coerced to float and has a fixed resolution. One new standard timestamp type is enough, Decimal is preferred over datetime.timedelta. Converting a datetime.timedelta to float requires an explicit call to the datetime.timedelta.total_seconds() method.

Note

datetime.timedelta only supports microsecond resolution, but can be enhanced to support nanosecond.

Tuple of integers

To expose C functions in Python, a tuple of integers is the natural choice to store a timestamp because the C language uses structures with integers fields (e.g. timeval and timespec structures). Using only integers avoids the loss of precision (Python supports integers of arbitrary length). Creating and parsing a tuple of integers is simple and fast.

Depending of the exact format of the tuple, the precision can be arbitrary or fixed. The precision can be choose as the loss of precision is smaller than an arbitrary limit like one nanosecond.

Different formats have been proposed:

  • A: (numerator, denominator)
    • value = numerator / denominator
    • resolution = 1 / denominator
    • denominator > 0
  • B: (seconds, numerator, denominator)
    • value = seconds + numerator / denominator
    • resolution = 1 / denominator
    • 0 <= numerator < denominator
    • denominator > 0
  • C: (intpart, floatpart, base, exponent)
    • value = intpart + floatpart / baseexponent
    • resolution = 1 / base exponent
    • 0 <= floatpart < base exponent
    • base > 0
    • exponent >= 0
  • D: (intpart, floatpart, exponent)
    • value = intpart + floatpart / 10exponent
    • resolution = 1 / 10 exponent
    • 0 <= floatpart < 10 exponent
    • exponent >= 0
  • E: (sec, nsec)
    • value = sec + nsec × 10-9
    • resolution = 10 -9 (nanosecond)
    • 0 <= nsec < 10 9

All formats support an arbitrary resolution, except of the format (E).

The format (D) may not be able to store the exact value (may loss of precision) if the clock frequency is arbitrary and cannot be expressed as a power of 10. The format (C) has a similar issue, but in such case, it is possible to use base=frequency and exponent=1.

The formats (C), (D) and (E) allow optimization for conversion to float if the base is 2 and to decimal.Decimal if the base is 10.

The format (A) is a simple fraction. It supports arbitrary precision, is simple (only two fields), only requires a simple division to get the floating point value, and is already used by float.as_integer_ratio().

To simplify the implementation (especially the C implementation to avoid integer overflow), a numerator bigger than the denominator can be accepted. The tuple may be normalized later.

Tuple of integers have been rejected because they don't support arithmetic operations.

Note

On Windows, the QueryPerformanceCounter() clock uses the frequency of the processor which is an arbitrary number and so may not be a power or 2 or 10. The frequency can be read using QueryPerformanceFrequency().

timespec structure

timespec is the C structure used to store timestamp with a nanosecond resolution. Python can use a type with the same structure: (seconds, nanoseconds). For convenience, arithmetic operations on timespec are supported.

Example of an incomplete timespec type supporting addition, subtraction and coercion to float:

class timespec(tuple):
    def __new__(cls, sec, nsec):
        if not isinstance(sec, int):
            raise TypeError
        if not isinstance(nsec, int):
            raise TypeError
        asec, nsec = divmod(nsec, 10 ** 9)
        sec += asec
        obj = tuple.__new__(cls, (sec, nsec))
        obj.sec = sec
        obj.nsec = nsec
        return obj

    def __float__(self):
        return self.sec + self.nsec * 1e-9

    def total_nanoseconds(self):
        return self.sec * 10 ** 9 + self.nsec

    def __add__(self, other):
        if not isinstance(other, timespec):
            raise TypeError
        ns_sum = self.total_nanoseconds() + other.total_nanoseconds()
        return timespec(*divmod(ns_sum, 10 ** 9))

    def __sub__(self, other):
        if not isinstance(other, timespec):
            raise TypeError
        ns_diff = self.total_nanoseconds() - other.total_nanoseconds()
        return timespec(*divmod(ns_diff, 10 ** 9))

    def __str__(self):
        if self.sec < 0 and self.nsec:
            sec = abs(1 + self.sec)
            nsec = 10**9 - self.nsec
            return '-%i.%09u' % (sec, nsec)
        else:
            return '%i.%09u' % (self.sec, self.nsec)

    def __repr__(self):
        return '<timespec(%s, %s)>' % (self.sec, self.nsec)

The timespec type is similar to the format (E) of tuples of integer, except that it supports arithmetic and coercion to float.

The timespec type was rejected because it only supports nanosecond resolution and requires to implement each arithmetic operation, whereas the Decimal type is already implemented and well tested.

Alternatives: API design

Add a string argument to specify the return type

Add an string argument to function returning timestamps, example: time.time(format="datetime"). A string is more extensible than a type: it is possible to request a format that has no type, like a tuple of integers.

This API was rejected because it was necessary to import implicitly modules to instantiate objects (e.g. import datetime to create datetime.datetime). Importing a module may raise an exception and may be slow, such behaviour is unexpected and surprising.

Add a global flag to change the timestamp type

A global flag like os.stat_decimal_times(), similar to os.stat_float_times(), can be added to set globally the timestamp type.

A global flag may cause issues with libraries and applications expecting float instead of Decimal. Decimal is not fully compatible with float. float+Decimal raises a TypeError for example. The os.stat_float_times() case is different because an int can be coerced to float and int+float gives float.

Add a protocol to create a timestamp

Instead of hard coding how timestamps are created, a new protocol can be added to create a timestamp from a fraction.

For example, time.time(timestamp=type) would call the class method type.__fromfraction__(numerator, denominator) to create a timestamp object of the specified type. If the type doesn't support the protocol, a fallback is used: type(numerator) / type(denominator).

A variant is to use a "converter" callback to create a timestamp. Example creating a float timestamp:

def timestamp_to_float(numerator, denominator):
return float(numerator) / float(denominator)

Common converters can be provided by time, datetime and other modules, or maybe a specific "hires" module. Users can define their own converters.

Such protocol has a limitation: the timestamp structure has to be decided once and cannot be changed later. For example, adding a timezone or the absolute start of the timestamp would break the API.

The protocol proposition was as being excessive given the requirements, but that the specific syntax proposed (time.time(timestamp=type)) allows this to be introduced later if compelling use cases are discovered.

Note

Other formats may be used instead of a fraction: see the tuple of integers section for example.

Add new fields to os.stat

To get the creation, modification and access time of a file with a nanosecond resolution, three fields can be added to os.stat() structure.

The new fields can be timestamps with nanosecond resolution (e.g. Decimal) or the nanosecond part of each timestamp (int).

If the new fields are timestamps with nanosecond resolution, populating the extra fields would be time consuming. Any call to os.stat() would be slower, even if os.stat() is only called to check if a file exists. A parameter can be added to os.stat() to make these fields optional, the structure would have a variable number of fields.

If the new fields only contain the fractional part (nanoseconds), os.stat() would be efficient. These fields would always be present and so set to zero if the operating system does not support sub-second resolution. Splitting a timestamp in two parts, seconds and nanoseconds, is similar to the timespec type and tuple of integers, and so have the same drawbacks.

Adding new fields to the os.stat() structure does not solve the nanosecond issue in other modules (e.g. the time module).

Add a boolean argument

Because we only need one new type (Decimal), a simple boolean flag can be added. Example: time.time(decimal=True) or time.time(hires=True).

Such flag would require to do an hidden import which is considered as a bad practice.

The boolean argument API was rejected because it is not "pythonic". Changing the return type with a parameter value is preferred over a boolean parameter (a flag).

Add new functions

Add new functions for each type, examples:

  • time.clock_decimal()
  • time.time_decimal()
  • os.stat_decimal()
  • os.stat_timespec()
  • etc.

Adding a new function for each function creating timestamps duplicate a lot of code and would be a pain to maintain.

Add a new hires module

Add a new module called "hires" with the same API than the time module, except that it would return timestamp with high resolution, e.g. decimal.Decimal. Adding a new module avoids to link low-level modules like time or os to the decimal module.

This idea was rejected because it requires to duplicate most of the code of the time module, would be a pain to maintain, and timestamps are used modules other than the time module. Examples: signal.sigtimedwait(), select.select(), resource.getrusage(), os.stat(), etc. Duplicate the code of each module is not acceptable.

pep-0411 Provisional packages in the Python standard library

PEP:411
Title:Provisional packages in the Python standard library
Version:$Revision$
Last-Modified:$Date$
Author:Nick Coghlan <ncoghlan at gmail.com>, Eli Bendersky <eliben at gmail.com>
Status:Accepted
Type:Informational
Content-Type:text/x-rst
Created:2012-02-10
Python-Version:3.3
Post-History:2012-02-10, 2012-03-24

Abstract

The process of including a new package into the Python standard library is hindered by the API lock-in and promise of backward compatibility implied by a package being formally part of Python. This PEP describes a methodology for marking a standard library package "provisional" for the period of a single feature release. A provisional package may have its API modified prior to "graduating" into a "stable" state. On one hand, this state provides the package with the benefits of being formally part of the Python distribution. On the other hand, the core development team explicitly states that no promises are made with regards to the the stability of the package's API, which may change for the next release. While it is considered an unlikely outcome, such packages may even be removed from the standard library without a deprecation period if the concerns regarding their API or maintenance prove well-founded.

Proposal - a documented provisional state

Whenever the Python core development team decides that a new package should be included into the standard library, but isn't entirely sure about whether the package's API is optimal, the package can be included and marked as "provisional".

In the next feature release, the package may either be "graduated" into a normal "stable" state in the standard library, remain in provisional state, or be rejected and removed entirely from the Python source tree. If the package ends up graduating into the stable state after being provisional, its API may be changed according to accumulated feedback. The core development team explicitly makes no guarantees about API stability and backward compatibility of provisional packages.

Marking a package provisional

A package will be marked provisional by a notice in its documentation page and its docstring. The following paragraph will be added as a note at the top of the documentation page:

The <X> package has been included in the standard library on a provisional basis. Backwards incompatible changes (up to and including removal of the package) may occur if deemed necessary by the core developers.

The phrase "provisional basis" will then be a link to the glossary term "provisional package", defined as:

A provisional package is one which has been deliberately excluded from the standard library's backwards compatibility guarantees. While major changes to such packages are not expected, as long as they are marked provisional, backwards incompatible changes (up to and including removal of the package) may occur if deemed necessary by core developers. Such changes will not be made gratuitously -- they will occur only if serious flaws are uncovered that were missed prior to the inclusion of the package.

This process allows the standard library to continue to evolve over time, without locking in problematic design errors for extended periods of time. See PEP 411 for more details.

The following will be added to the start of the package's docstring:

The API of this package is currently provisional. Refer to the documentation for details.

Moving a package from the provisional to the stable state simply implies removing these notes from its documentation page and docstring.

Which packages should go through the provisional state

We expect most packages proposed for addition into the Python standard library to go through a feature release in the provisional state. There may, however, be some exceptions, such as packages that use a pre-defined API (for example lzma, which generally follows the API of the existing bz2 package), or packages with an API that has wide acceptance in the Python development community.

In any case, packages that are proposed to be added to the standard library, whether via the provisional state or directly, must fulfill the acceptance conditions set by PEP 2.

Criteria for "graduation"

In principle, most provisional packages should eventually graduate to the stable standard library. Some reasons for not graduating are:

  • The package may prove to be unstable or fragile, without sufficient developer support to maintain it.
  • A much better alternative package may be found during the preview release.

Essentially, the decision will be made by the core developers on a per-case basis. The point to emphasize here is that a package's inclusion in the standard library as "provisional" in some release does not guarantee it will continue being part of Python in the next release. At the same time, the bar for making changes in a provisional package is quite high. We expect that most of the API of most provisional packages will be unchanged at graduation. Withdrawals are expected to be rare.

Rationale

Benefits for the core development team

Currently, the core developers are really reluctant to add new interfaces to the standard library. This is because as soon as they're published in a release, API design mistakes get locked in due to backward compatibility concerns.

By gating all major API additions through some kind of a provisional mechanism for a full release, we get one full release cycle of community feedback before we lock in the APIs with our standard backward compatibility guarantee.

We can also start integrating provisional packages with the rest of the standard library early, so long as we make it clear to packagers that the provisional packages should not be considered optional. The only difference between provisional APIs and the rest of the standard library is that provisional APIs are explicitly exempted from the usual backward compatibility guarantees.

Benefits for end users

For future end users, the broadest benefit lies in a better "out-of-the-box" experience - rather than being told "oh, the standard library tools for task X are horrible, download this 3rd party library instead", those superior tools are more likely to be just be an import away.

For environments where developers are required to conduct due diligence on their upstream dependencies (severely harming the cost-effectiveness of, or even ruling out entirely, much of the material on PyPI), the key benefit lies in ensuring that all packages in the provisional state are clearly under python-dev's aegis from at least the following perspectives:

  • Licensing: Redistributed by the PSF under a Contributor Licensing Agreement.
  • Documentation: The documentation of the package is published and organized via the standard Python documentation tools (i.e. ReST source, output generated with Sphinx and published on http://docs.python.org).
  • Testing: The package test suites are run on the python.org buildbot fleet and results published via http://www.python.org/dev/buildbot.
  • Issue management: Bugs and feature requests are handled on http://bugs.python.org
  • Source control: The master repository for the software is published on http://hg.python.org.

Candidates for provisional inclusion into the standard library

For Python 3.3, there are a number of clear current candidates:

Other possible future use cases include:

  • Improved HTTP modules (e.g. requests)
  • HTML 5 parsing support (e.g. html5lib)
  • Improved URL/URI/IRI parsing
  • A standard image API (PEP 368)
  • Improved encapsulation of import state (PEP 406)
  • Standard event loop API (PEP 3153)
  • A binary version of WSGI for Python 3 (e.g. PEP 444)
  • Generic function support (e.g. simplegeneric)

pep-0412 Key-Sharing Dictionary

PEP:412
Title:Key-Sharing Dictionary
Version:$Revision$
Last-Modified:$Date$
Author:Mark Shannon <mark at hotpy.org>
Status:Final
Type:Standards Track
Content-Type:text/x-rst
Created:08-Feb-2012
Python-Version:3.3 or 3.4
Post-History:08-Feb-2012

Abstract

This PEP proposes a change in the implementation of the builtin dictionary type dict. The new implementation allows dictionaries which are used as attribute dictionaries (the __dict__ attribute of an object) to share keys with other attribute dictionaries of instances of the same class.

Motivation

The current dictionary implementation uses more memory than is necessary when used as a container for object attributes as the keys are replicated for each instance rather than being shared across many instances of the same class. Despite this, the current dictionary implementation is finely tuned and performs very well as a general-purpose mapping object.

By separating the keys (and hashes) from the values it is possible to share the keys between multiple dictionaries and improve memory use. By ensuring that keys are separated from the values only when beneficial, it is possible to retain the high-performance of the current dictionary implementation when used as a general-purpose mapping object.

Behaviour

The new dictionary behaves in the same way as the old implementation. It fully conforms to the Python API, the C API and the ABI.

Performance

Memory Usage

Reduction in memory use is directly related to the number of dictionaries with shared keys in existence at any time. These dictionaries are typically half the size of the current dictionary implementation.

Benchmarking shows that memory use is reduced by 10% to 20% for object-oriented programs with no significant change in memory use for other programs.

Speed

The performance of the new implementation is dominated by memory locality effects. When keys are not shared (for example in module dictionaries and dictionary explicitly created by dict() or {}) then performance is unchanged (within a percent or two) from the current implementation.

For the shared keys case, the new implementation tends to separate keys from values, but reduces total memory usage. This will improve performance in many cases as the effects of reduced memory usage outweigh the loss of locality, but some programs may show a small slow down.

Benchmarking shows no significant change of speed for most benchmarks. Object-oriented benchmarks show small speed ups when they create large numbers of objects of the same class (the gcbench benchmark shows a 10% speed up; this is likely to be an upper limit).

Implementation

Both the old and new dictionaries consist of a fixed-sized dict struct and a re-sizeable table. In the new dictionary the table can be further split into a keys table and values array. The keys table holds the keys and hashes and (for non-split tables) the values as well. It differs only from the original implementation in that it contains a number of fields that were previously in the dict struct. If a table is split the values in the keys table are ignored, instead the values are held in a separate array.

Split-Table dictionaries

When dictionaries are created to fill the __dict__ slot of an object, they are created in split form. The keys table is cached in the type, potentially allowing all attribute dictionaries of instances of one class to share keys. In the event of the keys of these dictionaries starting to diverge, individual dictionaries will lazily convert to the combined-table form. This ensures good memory use in the common case, and correctness in all cases.

When resizing a split dictionary it is converted to a combined table. If resizing is as a result of storing an instance attribute, and there is only instance of a class, then the dictionary will be re-split immediately. Since most OO code will set attributes in the __init__ method, all attributes will be set before a second instance is created and no more resizing will be necessary as all further instance dictionaries will have the correct size. For more complex use patterns, it is impossible to know what is the best approach, so the implementation allows extra insertions up to the point of a resize when it reverts to the combined table (non-shared keys).

A deletion from a split dictionary does not change the keys table, it simply removes the value from the values array.

Combined-Table dictionaries

Explicit dictionaries (dict() or {}), module dictionaries and most other dictionaries are created as combined-table dictionaries. A combined-table dictionary never becomes a split-table dictionary. Combined tables are laid out in much the same way as the tables in the old dictionary, resulting in very similar performance.

Implementation

The new dictionary implementation is available at [1].

Pros and Cons

Pros

Significant memory savings for object-oriented applications. Small improvement to speed for programs which create lots of similar objects.

Cons

Change to data structures: Third party modules which meddle with the internals of the dictionary implementation will break.

Changes to repr() output and iteration order: For most cases, this will be unchanged. However for some split-table dictionaries the iteration order will change.

Neither of these cons should be a problem. Modules which meddle with the internals of the dictionary implementation are already broken and should be fixed to use the API. The iteration order of dictionaries was never defined and has always been arbitrary; it is different for Jython and PyPy.

Alternative Implementation

An alternative implementation for split tables, which could save even more memory, is to store an index in the value field of the keys table (instead of ignoring the value field). This index would explicitly state where in the value array to look. The value array would then only require 1 field for each usable slot in the key table, rather than each slot in the key table.

This "indexed" version would reduce the size of value array by about one third. The keys table would need an extra "values_size" field, increasing the size of combined dicts by one word. The extra indirection adds more complexity to the code, potentially reducing performance a little.

The "indexed" version will not be included in this implementation, but should be considered deferred rather than rejected, pending further experimentation.

pep-0413 Faster evolution of the Python Standard Library

PEP:413
Title:Faster evolution of the Python Standard Library
Version:$Revision$
Last-Modified:$Date$
Author:Nick Coghlan <ncoghlan at gmail.com>
Status:Withdrawn
Type:Process
Content-Type:text/x-rst
Created:2012-02-24
Post-History:2012-02-24, 2012-02-25
Resolution:TBD

PEP Withdrawal

With the acceptance of PEP 453 meaning that pip will be available to most new Python users by default, this will hopefully reduce the pressure to add new modules to the standard library before they are sufficiently mature.

The last couple of years have also seen increased usage of the model where a standard library package also has an equivalent available from the Python Package Index that also supports older versions of Python.

Given these two developments and the level of engagement throughout the Python 3.4 release cycle, the PEP author no longer feels it would be appropriate to make such a fundamental change to the standard library development process.

Abstract

This PEP proposes the adoption of a separate versioning scheme for the standard library (distinct from, but coupled to, the existing language versioning scheme) that allows accelerated releases of the Python standard library, while maintaining (or even slowing down) the current rate of change in the core language definition.

Like PEP 407, it aims to adjust the current balance between measured change that allows the broader community time to adapt and being able to keep pace with external influences that evolve more rapidly than the current release cycle can handle (this problem is particularly notable for standard library elements that relate to web technologies).

However, it's more conservative in its aims than PEP 407, seeking to restrict the increased pace of development to builtin and standard library interfaces, without affecting the rate of change for other elements such as the language syntax and version numbering as well as the CPython binary API and bytecode format.

Rationale

To quote the PEP 407 abstract:

Finding a release cycle for an open-source project is a delicate exercise in managing mutually contradicting constraints: developer manpower, availability of release management volunteers, ease of maintenance for users and third-party packagers, quick availability of new features (and behavioural changes), availability of bug fixes without pulling in new features or behavioural changes.

The current release cycle errs on the conservative side. It is adequate for people who value stability over reactivity. This PEP is an attempt to keep the stability that has become a Python trademark, while offering a more fluid release of features, by introducing the notion of long-term support versions.

I agree with the PEP 407 authors that the current release cycle of the standard library is too slow to effectively cope with the pace of change in some key programming areas (specifically, web protocols and related technologies, including databases, templating and serialisation formats).

However, I have written this competing PEP because I believe that the approach proposed in PEP 407 of offering full, potentially binary incompatible releases of CPython every 6 months places too great a burden on the wider Python ecosystem.

Under the current CPython release cycle, distributors of key binary extensions will often support Python releases even after the CPython branches enter "security fix only" mode (for example, Twisted currently ships binaries for 2.5, 2.6 and 2.7, NumPy and SciPy suport those 3 along with 3.1 and 3.2, PyGame adds a 2.4 binary release, wxPython provides both 32-bit and 64-bit binaries for 2.6 and 2.7, etc).

If CPython were to triple (or more) its rate of releases, the developers of those libraries (many of which are even more resource starved than CPython) would face an unpalatable choice: either adopt the faster release cycle themselves (up to 18 simultaneous binary releases for PyGame!), drop older Python versions more quickly, or else tell their users to stick to the CPython LTS releases (thus defeating the entire point of speeding up the CPython release cycle in the first place).

Similarly, many support tools for Python (e.g. syntax highlighters) can take quite some time to catch up with language level changes.

At a cultural level, the Python community is also accustomed to a certain meaning for Python version numbers - they're linked to deprecation periods, support periods, all sorts of things. PEP 407 proposes that collective knowledge all be swept aside, without offering a compelling rationale for why such a course of action is actually necessary (aside from, perhaps, making the lives of the CPython core developers a little easier at the expense of everyone else).

However, if we go back to the primary rationale for increasing the pace of change (i.e. more timely support for web protocols and related technologies), we can note that those only require standard library changes. That means many (perhaps even most) of the negative effects on the wider community can be avoided by explicitly limiting which parts of CPython are affected by the new release cycle, and allowing other parts to evolve at their current, more sedate, pace.

Proposal

This PEP proposes the introduction of a new kind of CPython release: "standard library releases". As with PEP 407, this will give CPython 3 kinds of release:

  • Language release: "x.y.0"
  • Maintenance release: "x.y.z" (where z > 0)
  • Standard library release: "x.y (xy.z)" (where z > 0)

Under this scheme, an unqualified version reference (such as "3.3") would always refer to the most recent corresponding language or maintenance release. It will never be used without qualification to refer to a standard library release (at least, not by python-dev - obviously, we can only set an example, not force the rest of the Python ecosystem to go along with it).

Language releases will continue as they are now, as new versions of the Python language definition, along with a new version of the CPython interpreter and the Python standard library. Accordingly, a language release may contain any and all of the following changes:

  • new language syntax
  • new standard library changes (see below)
  • new deprecation warnings
  • removal of previously deprecated features
  • changes to the emitted bytecode
  • changes to the AST
  • any other significant changes to the compilation toolchain
  • changes to the core interpreter eval loop
  • binary incompatible changes to the C ABI (although the PEP 384 stable ABI must still be preserved)
  • bug fixes

Maintenance releases will also continue as they do today, being strictly limited to bug fixes for the corresponding language release. No new features or radical internal changes are permitted.

The new standard library releases will occur in parallel with each maintenance release and will be qualified with a new version identifier documenting the standard library version. Standard library releases may include the following changes:

  • new features in pure Python modules
  • new features in C extension modules (subject to PEP 399 compatibility requirements)
  • new features in language builtins (provided the C ABI remains unaffected)
  • bug fixes from the corresponding maintenance release

Standard library version identifiers are constructed by combining the major and minor version numbers for the Python language release into a single two digit number and then appending a sequential standard library version identifier.

Release Cycle

When maintenance releases are created, two new versions of Python would actually be published on python.org (using the first 3.3 maintenance release, planned for February 2013 as an example):

3.3.1       # Maintenance release
3.3 (33.1)  # Standard library release

A further 6 months later, the next 3.3 maintenance release would again be accompanied by a new standard library release:

3.3.2       # Maintenance release
3.3 (33.2)  # Standard library release

Again, the standard library release would be binary compatible with the previous language release, merely offering additional features at the Python level.

Finally, 18 months after the release of 3.3, a new language release would be made around the same time as the final 3.3 maintenance and standard library releases:

3.3.3       # Maintenance release
3.3 (33.3)  # Standard library release
3.4.0       # Language release

The 3.4 release cycle would then follow a similar pattern to that for 3.3:

3.4.1       # Maintenance release
3.4 (34.1)  # Standard library release

3.4.2       # Maintenance release
3.4 (34.2)  # Standard library release

3.4.3       # Maintenance release
3.4 (34.3)  # Standard library release
3.5.0       # Language release

Programmatic Version Identification

To expose the new version details programmatically, this PEP proposes the addition of a new sys.stdlib_info attribute that records the new standard library version above and beyond the underlying interpreter version. Using the initial Python 3.3 release as an example:

sys.stdlib_info(python=33, version=0, releaselevel='final', serial=0)

This information would also be included in the sys.version string:

Python 3.3.0 (33.0, default, Feb 17 2012, 23:03:41)
[GCC 4.6.1]

Security Fixes and Other "Out of Cycle" Releases

For maintenance releases the process of handling out-of-cycle releases (for example, to fix a security issue or resolve a critical bug in a new release), remains the same as it is now: the minor version number is incremented and a new release is made incorporating the required bug fixes, as well as any other bug fixes that have been committed since the previous release.

For standard library releases, the process is essentially the same, but the corresponding "What's New?" document may require some tidying up for the release (as the standard library release may incorporate new features, not just bug fixes).

User Scenarios

The versioning scheme proposed above is based on a number of user scenarios that are likely to be encountered if this scheme is adopted. In each case, the scenario is described for both the status quo (i.e. slow release cycle) the versioning scheme in this PEP and the free wheeling minor version number scheme proposed in PEP 407.

To give away the ending, the point of using a separate version number is that for almost all scenarios, the important number is the language version, not the standard library version. Most users won't even need to care that the standard library version number exists. In the two identified cases where it matters, providing it as a separate number is actually clearer and more explicit than embedding the two different kinds of number into a single sequence and then tagging some of the numbers in the unified sequence as special.

Novice user, downloading Python from python.org in March 2013

Status quo: must choose between 3.3 and 2.7

This PEP: must choose between 3.3 (33.1), 3.3 and 2.7.

PEP 407: must choose between 3.4, 3.3 (LTS) and 2.7.

Verdict: explaining the meaning of a Long Term Support release is about as complicated as explaining the meaning of the proposed standard library release version numbers. I call this a tie.

Novice user, attempting to judge currency of third party documentation

Status quo: minor version differences indicate 18-24 months of language evolution

This PEP: same as status quo for language core, standard library version numbers indicate 6 months of standard library evolution.

PEP 407: minor version differences indicate 18-24 months of language evolution up to 3.3, then 6 months of language evolution thereafter.

Verdict: Since language changes and deprecations can have a much bigger effect on the accuracy of third party documentation than the addition of new features to the standard library, I'm calling this a win for the scheme in this PEP.

Novice user, looking for an extension module binary release

Status quo: look for the binary corresponding to the Python version you are running.

This PEP: same as status quo.

PEP 407 (full releases): same as status quo, but corresponding binary version is more likely to be missing (or, if it does exist, has to be found amongst a much larger list of alternatives).

PEP 407 (ABI updates limited to LTS releases): all binary release pages will need to tell users that Python 3.3, 3.4 and 3.5 all need the 3.3 binary.

Verdict: I call this a clear win for the scheme in this PEP. Absolutely nothing changes from the current situation, since the standard library version is actually irrelevant in this case (only binary extension compatibility is important).

Extension module author, deciding whether or not to make a binary release

Status quo: unless using the PEP 384 stable ABI, a new binary release is needed every time the minor version number changes.

This PEP: same as status quo.

PEP 407 (full releases): same as status quo, but becomes a far more frequent occurrence.

PEP 407 (ABI updates limited to LTS releases): before deciding, must first look up whether the new release is an LTS release or an interim release. If it is an LTS release, then a new build is necessary.

Verdict: I call this another clear win for the scheme in this PEP. As with the end user facing side of this problem, the standard library version is actually irrelevant in this case. Moving that information out to a separate number avoids creating unnecessary confusion.

Python developer, deciding priority of eliminating a Deprecation Warning

Status quo: code that triggers deprecation warnings is not guaranteed to run on a version of Python with a higher minor version number.

This PEP: same as status quo

PEP 407: unclear, as the PEP doesn't currently spell this out. Assuming the deprecation cycle is linked to LTS releases, then upgrading to a non-LTS release is safe but upgrading to the next LTS release may require avoiding the deprecated construct.

Verdict: another clear win for the scheme in this PEP since, once again, the standard library version is irrelevant in this scenario.

Alternative interpreter implementor, updating with new features

Status quo: new Python versions arrive infrequently, but are a mish-mash of standard library updates and core language definition and interpreter changes.

This PEP: standard library updates, which are easier to integrate, are made available more frequently in a form that is clearly and explicitly compatible with the previous version of the language definition. This means that, once an alternative implementation catches up to Python 3.3, they should have a much easier time incorporating standard library features as they happen (especially pure Python changes), leaving minor version number updates as the only task that requires updates to their core compilation and execution components.

PEP 407 (full releases): same as status quo, but becomes a far more frequent occurrence.

PEP 407 (language updates limited to LTS releases): unclear, as the PEP doesn't currently spell out a specific development strategy. Assuming a 3.3 compatibility branch is adopted (as proposed in this PEP), then the outcome would be much the same, but the version number signalling would be slightly less clear (since you would have to check to see if a particular release was an LTS release or not).

Verdict: while not as clear cut as some previous scenarios, I'm still calling this one in favour of the scheme in this PEP. Explicit is better than implicit, and the scheme in this PEP makes a clear split between the two different kinds of update rather than adding a separate "LTS" tag to an otherwise ordinary release number. Tagging a particular version as being special is great for communicating with version control systems and associated automated tools, but it's a lousy way to communicate information to other humans.

Python developer, deciding their minimum version dependency

Status quo: look for "version added" or "version changed" markers in the documentation, check against sys.version_info

This PEP: look for "version added" or "version changed" markers in the documentation. If written as a bare Python version, such as "3.3", check against sys.version_info. If qualified with a standard library version, such as "3.3 (33.1)", check against sys.stdlib_info.

PEP 407: same as status quo

Verdict: the scheme in this PEP actually allows third party libraries to be more explicit about their rate of adoption of standard library features. More conservative projects will likely pin their dependency to the language version and avoid features added in the standard library releases. Faster moving projects could instead declare their dependency on a particular standard library version. However, since PEP 407 does have the advantage of preserving the status quo, I'm calling this one for PEP 407 (albeit with a slim margin).

Python developers, attempting to reproduce a tracker issue

Status quo: if not already provided, ask the reporter which version of Python they're using. This is often done by asking for the first two lines displayed by the interactive prompt or the value of sys.version.

This PEP: same as the status quo (as sys.version will be updated to also include the standard library version), but may be needed on additional occasions (where the user knew enough to state their Python version, but that proved to be insufficient to reproduce the fault).

PEP 407: same as the status quo

Verdict: another marginal win for PEP 407. The new standard library version is an extra piece of information that users may need to pass back to developers when reporting issues with Python libraries (or Python itself, on our own tracker). However, by including it in sys.version, many fault reports will already include it, and it is easy to request if needed.

CPython release managers, handling a security fix

Status quo: create a new maintenance release incorporating the security fix and any other bug fixes under source control. Also create source releases for any branches open solely for security fixes.

This PEP: same as the status quo for maintenance branches. Also create a new standard library release (potentially incorporating new features along with the security fix). For security branches, create source releases for both the former maintenance branch and the standard library update branch.

PEP 407: same as the status quo for maintenance and security branches, but handling security fixes for non-LTS releases is currently an open question.

Verdict: until PEP 407 is updated to actually address this scenario, a clear win for this PEP.

Effects

Effect on development cycle

Similar to PEP 407, this PEP will break up the delivery of new features into more discrete chunks. Instead of a whole raft of changes landing all at once in a language release, each language release will be limited to 6 months worth of standard library changes, as well as any changes associated with new syntax.

Effect on workflow

This PEP proposes the creation of a single additional branch for use in the normal workflow. After the release of 3.3, the following branches would be in use:

2.7         # Maintenance branch, no change
3.3         # Maintenance branch, as for 3.2
3.3-compat  # New branch, backwards compatible changes
default     # Language changes, standard library updates that depend on them

When working on a new feature, developers will need to decide whether or not it is an acceptable change for a standard library release. If so, then it should be checked in on 3.3-compat and then merged to default. Otherwise it should be checked in directly to default.

The "version added" and "version changed" markers for any changes made on the 3.3-compat branch would need to be flagged with both the language version and the standard library version. For example: "3.3 (33.1)".

Any changes made directly on the default branch would just be flagged with "3.4" as usual.

The 3.3-compat branch would be closed to normal development at the same time as the 3.3 maintenance branch. The 3.3-compat branch would remain open for security fixes for the same period of time as the 3.3 maintenance branch.

Effect on bugfix cycle

The effect on the bug fix workflow is essentially the same as that on the workflow for new features - there is one additional branch to pass through before the change reaches the default branch.

If critical bugs are found in a maintenance release, then new maintenance and standard library releases will be created to resolve the problem. The final part of the version number will be incremented for both the language version and the standard library version.

If critical bugs are found in a standard library release that do not affect the associated maintenance release, then only a new standard library release will be created and only the standard library's version number will be incremented.

Note that in these circumstances, the standard library release may include additional features, rather than just containing the bug fix. It is assumed that anyone that cares about receiving only bug fixes without any new features mixed in will already be relying strictly on the maintenance releases rather than using the new standard library releases.

Effect on the community

PEP 407 has this to say about the effects on the community:

People who value stability can just synchronize on the LTS releases which, with the proposed figures, would give a similar support cycle (both in duration and in stability).

I believe this statement is just plain wrong. Life isn't that simple. Instead, developers of third party modules and frameworks will come under pressure to support the full pace of the new release cycle with binary updates, teachers and book authors will receive complaints that they're only covering an "old" version of Python ("You're only using 3.3, the latest is 3.5!"), etc.

As the minor version number starts climbing 3 times faster than it has in the past, I believe perceptions of language stability would also fall (whether such opinions were justified or not).

I believe isolating the increased pace of change to the standard library, and clearly delineating it with a separate version number will greatly reassure the rest of the community that no, we're not suddenly asking them to triple their own rate of development. Instead, we're merely going to ship standard library updates for the next language release in 6-monthly installments rather than delaying them all until the next language definition update, even those changes that are backwards compatible with the previously released version of Python.

The community benefits listed in PEP 407 are equally applicable to this PEP, at least as far as the standard library is concerned:

People who value reactivity and access to new features (without taking the risk to install alpha versions or Mercurial snapshots) would get much more value from the new release cycle than currently.

People who want to contribute new features or improvements would be more motivated to do so, knowing that their contributions will be more quickly available to normal users.

If the faster release cycle encourages more people to focus on contributing to the standard library rather than proposing changes to the language definition, I don't see that as a bad thing.

Handling News Updates

What's New?

The "What's New" documents would be split out into separate documents for standard library releases and language releases. So, during the 3.3 release cycle, we would see:

  • What's New in Python 3.3?
  • What's New in the Python Standard Library 33.1?
  • What's New in the Python Standard Library 33.2?
  • What's New in the Python Standard Library 33.3?

And then finally, we would see the next language release:

  • What's New in Python 3.4?

For the benefit of users that ignore standard library releases, the 3.4 What's New would link back to the What's New documents for each of the standard library releases in the 3.3 series.

NEWS

Merge conflicts on the NEWS file are already a hassle. Since this PEP proposes introduction of an additional branch into the normal workflow, resolving this becomes even more critical. While Mercurial phases may help to some degree, it would be good to eliminate the problem entirely.

One suggestion from Barry Warsaw is to adopt a non-conflicting separate-files-per-change approach, similar to that used by Twisted [2].

Given that the current manually updated NEWS file will be used for the 3.3.0 release, one possible layout for such an approach might look like:

Misc/
  NEWS  # Now autogenerated from news_entries
  news_entries/
    3.3/
      NEWS # Original 3.3 NEWS file
      maint.1/ # Maintenance branch changes
        core/
          <news entries>
        builtins/
          <news entries>
        extensions/
          <news entries>
        library/
          <news entries>
        documentation/
          <news entries>
        tests/
          <news entries>
      compat.1/ # Compatibility branch changes
        builtins/
          <news entries>
        extensions/
          <news entries>
        library/
          <news entries>
        documentation/
          <news entries>
        tests/
          <news entries>
      # Add maint.2, compat.2 etc as releases are made
    3.4/
      core/
        <news entries>
      builtins/
        <news entries>
      extensions/
        <news entries>
      library/
        <news entries>
      documentation/
        <news entries>
      tests/
        <news entries>
      # Add maint.1, compat.1 etc as releases are made

Putting the version information in the directory hierarchy isn't strictly necessary (since the NEWS file generator could figure out from the version history), but does make it easier for humans to keep the different versions in order.

Other benefits of reduced version coupling

Slowing down the language release cycle

The current release cycle is a compromise between the desire for stability in the core language definition and C extension ABI, and the desire to get new features (most notably standard library updates) into user's hands more quickly.

With the standard library release cycle decoupled (to some degree) from that of the core language definition, it provides an opportunity to actually slow down the rate of change in the language definition. The language moratorium for Python 3.2 effectively slowed that cycle down to more than 3 years (3.1: June 2009, 3.3: August 2012) without causing any major problems or complaints.

The NEWS file management scheme described above is actually designed to allow us the flexibility to slow down language releases at the same time as standard library releases become more frequent.

As a simple example, if a full two years was allowed between 3.3 and 3.4, the 3.3 release cycle would end up looking like:

3.2.4       # Maintenance release
3.3.0       # Language release

3.3.1       # Maintenance release
3.3 (33.1)  # Standard library release

3.3.2       # Maintenance release
3.3 (33.2)  # Standard library release

3.3.3       # Maintenance release
3.3 (33.3)  # Standard library release

3.3.4       # Maintenance release
3.3 (33.4)  # Standard library release
3.4.0       # Language release

The elegance of the proposed branch structure and NEWS entry layout is that this decision wouldn't really need to be made until shortly before the planned 3.4 release date. At that point, the decision could be made to postpone the 3.4 release and keep the 3.3 and 3.3-compat branches open after the 3.3.3 maintenance release and the 3.3 (33.3) standard library release, thus adding another standard library release to the cycle. The choice between another standard library release or a full language release would then be available every 6 months after that.

Further increasing the pace of standard library development

As noted in the previous section, one benefit of the scheme proposed in this PEP is that it largely decouples the language release cycle from the standard library release cycle. The standard library could be updated every 3 months, or even once a month, without having any flow on effects on the language version numbering or the perceived stability of the core language.

While that pace of development isn't practical as long as the binary installer creation for Windows and Mac OS X involves several manual steps (including manual testing) and for as long as we don't have separate "<branch>-release" trees that only receive versions that have been marked as good by the stable buildbots, it's still a useful criterion to keep in mind when considering proposed new versioning schemes: what if we eventually want to make standard library releases even faster than every 6 months?

If the practical issues were ever resolved, then the separate standard library versioning scheme in this PEP could handle it. The tagged version number approach proposed in PEP 407 could not (at least, not without a lot of user confusion and uncertainty).

Other Questions

Why not use the major version number?

The simplest and most logical solution would actually be to map the major.minor.micro version numbers to the language version, stdlib version and maintenance release version respectively.

Instead of releasing Python 3.3.0, we would instead release Python 4.0.0 and the release cycle would look like:

4.0.0  # Language release

4.0.1  # Maintenance release
4.1.0  # Standard library release

4.0.2  # Maintenance release
4.2.0  # Standard library release

4.0.3  # Maintenance release
4.3.0  # Standard library release
5.0.0  # Language release

However, the ongoing pain of the Python 2 -> Python 3 transition (and associated workarounds like the python3 and python2 symlinks to refer directly to the desired release series) means that this simple option isn't viable for historical reasons.

One way that this simple approach could be made to work is to merge the current major and minor version numbers directly into a 2-digit major version number:

33.0.0  # Language release

33.0.1  # Maintenance release
33.1.0  # Standard library release

33.0.2  # Maintenance release
33.2.0  # Standard library release

33.0.3  # Maintenance release
33.3.0  # Standard library release
34.0.0  # Language release

Why not use a four part version number?

Another simple versioning scheme would just add a "standard library" version into the existing versioning scheme:

3.3.0.0  # Language release

3.3.0.1  # Maintenance release
3.3.1.0  # Standard library release

3.3.0.2  # Maintenance release
3.3.2.0  # Standard library release

3.3.0.3  # Maintenance release
3.3.3.0  # Standard library release
3.4.0.0  # Language release

However, this scheme isn't viable due to backwards compatibility constraints on the sys.version_info structure.

Why not use a date-based versioning scheme?

Earlier versions of this PEP proposed a date-based versioning scheme for the standard library. However, such a scheme made it very difficult to handle out-of-cycle releases to fix security issues and other critical bugs in standard library releases, as it required the following steps:

  1. Change the release version number to the date of the current month.
  2. Update the What's New, NEWS and documentation to refer to the new release number.
  3. Make the new release.

With the sequential scheme now proposed, such releases should at most require a little tidying up of the What's New document before making the release.

Why isn't PEP 384 enough?

PEP 384 introduced the notion of a "Stable ABI" for CPython, a limited subset of the full C ABI that is guaranteed to remain stable. Extensions built against the stable ABI should be able to support all subsequent Python versions with the same binary.

This will help new projects to avoid coupling their C extension modules too closely to a specific version of CPython. For existing modules, however, migrating to the stable ABI can involve quite a lot of work (especially for extension modules that define a lot of classes). With limited development resources available, any time spent on such a change is time that could otherwise have been spent working on features that offer more direct benefits to end users.

There are also other benefits to separate versioning (as described above) that are not directly related to the question of binary compatibility with third party C extensions.

Why no binary compatible additions to the C ABI in standard library releases?

There's a case to be made that additions to the CPython C ABI could reasonably be permitted in standard library releases. This would give C extension authors the same freedom as any other package or module author to depend either on a particular language version or on a standard library version.

The PEP currently associates the interpreter version with the language version, and therefore limits major interpreter changes (including C ABI additions) to the language releases.

An alternative, internally consistent, approach would be to link the interpreter version with the standard library version, with only changes that may affect backwards compatibility limited to language releases.

Under such a scheme, the following changes would be acceptable in standard library releases:

  • Standard library updates
    • new features in pure Python modules
    • new features in C extension modules (subject to PEP 399 compatibility requirements)
    • new features in language builtins
  • Interpreter implementation updates
    • binary compatible additions to the C ABI
    • changes to the compilation toolchain that do not affect the AST or alter the bytecode magic number
    • changes to the core interpreter eval loop
  • bug fixes from the corresponding maintenance release

And the following changes would be acceptable in language releases:

  • new language syntax
  • any updates acceptable in a standard library release
  • new deprecation warnings
  • removal of previously deprecated features
  • changes to the AST
  • changes to the emitted bytecode that require altering the magic number
  • binary incompatible changes to the C ABI (although the PEP 384 stable ABI must still be preserved)

While such an approach could probably be made to work, there does not appear to be a compelling justification for it, and the approach currently described in the PEP is simpler and easier to explain.

Why not separate out the standard library entirely?

A concept that is occasionally discussed is the idea of making the standard library truly independent from the CPython reference implementation.

My personal opinion is that actually making such a change would involve a lot of work for next to no pay-off. CPython without the standard library is useless (the build chain won't even run, let alone the test suite). You also can't create a standalone pure Python standard library either, because too many "standard library modules" are actually tightly linked in to the internal details of their respective interpreters (for example, the builtins, weakref, gc, sys, inspect, ast).

Creating a separate CPython development branch that is kept compatible with the previous language release, and making releases from that branch that are identified with a separate standard library version number should provide most of the benefits of a separate standard library repository with only a fraction of the pain.

Acknowledgements

Thanks go to the PEP 407 authors for starting this discussion, as well as to those authors and Larry Hastings for initial discussions of the proposal made in this PEP.

References

[1]PEP 407: New release cycle and introducing long-term support versions http://www.python.org/dev/peps/pep-0407/
[2]Twisted's "topfiles" approach to NEWS generation http://twistedmatrix.com/trac/wiki/ReviewProcess#Newsfiles

pep-0414 Explicit Unicode Literal for Python 3.3

PEP:414
Title:Explicit Unicode Literal for Python 3.3
Version:$Revision$
Last-Modified:$Date$
Author:Armin Ronacher <armin.ronacher at active-4.com>, Nick Coghlan <ncoghlan at gmail.com>
Status:Final
Type:Standards Track
Content-Type:text/x-rst
Created:15-Feb-2012
Post-History:28-Feb-2012, 04-Mar-2012
Resolution:http://mail.python.org/pipermail/python-dev/2012-February/116995.html

Abstract

This document proposes the reintegration of an explicit unicode literal from Python 2.x to the Python 3.x language specification, in order to reduce the volume of changes needed when porting Unicode-aware Python 2 applications to Python 3.

BDFL Pronouncement

This PEP has been formally accepted for Python 3.3:

I'm accepting the PEP. It's about as harmless as they come. Make it so.

Proposal

This PEP proposes that Python 3.3 restore support for Python 2's Unicode literal syntax, substantially increasing the number of lines of existing Python 2 code in Unicode aware applications that will run without modification on Python 3.

Specifically, the Python 3 definition for string literal prefixes will be expanded to allow:

"u" | "U"

in addition to the currently supported:

"r" | "R"

The following will all denote ordinary Python 3 strings:

'text'
"text"
'''text'''
"""text"""
u'text'
u"text"
u'''text'''
u"""text"""
U'text'
U"text"
U'''text'''
U"""text"""

No changes are proposed to Python 3's actual Unicode handling, only to the acceptable forms for string literals.

Exclusion of "Raw" Unicode Literals

Python 2 supports a concept of "raw" Unicode literals that don't meet the conventional definition of a raw string: \uXXXX and \UXXXXXXXX escape sequences are still processed by the compiler and converted to the appropriate Unicode code points when creating the associated Unicode objects.

Python 3 has no corresponding concept - the compiler performs no preprocessing of the contents of raw string literals. This matches the behaviour of 8-bit raw string literals in Python 2.

Since such strings are rarely used and would be interpreted differently in Python 3 if permitted, it was decided that leaving them out entirely was a better choice. Code which uses them will thus still fail immediately on Python 3 (with a Syntax Error), rather than potentially producing different output.

To get equivalent behaviour that will run on both Python 2 and Python 3, either an ordinary Unicode literal can be used (with appropriate additional escaping within the string), or else string concatenation or string formatting can be combine the raw portions of the string with those that require the use of Unicode escape sequences.

Note that when using from __future__ import unicode_literals in Python 2, the nominally "raw" Unicode string literals will process \uXXXX and \UXXXXXXXX escape sequences, just like Python 2 strings explicitly marked with the "raw Unicode" prefix.

Author's Note

This PEP was originally written by Armin Ronacher, and Guido's approval was given based on that version.

The currently published version has been rewritten by Nick Coghlan to include additional historical details and rationale that were taken into account when Guido made his decision, but were not explicitly documented in Armin's version of the PEP.

Readers should be aware that many of the arguments in this PEP are not technical ones. Instead, they relate heavily to the social and personal aspects of software development.

Rationale

With the release of a Python 3 compatible version of the Web Services Gateway Interface (WSGI) specification (PEP 3333) for Python 3.2, many parts of the Python web ecosystem have been making a concerted effort to support Python 3 without adversely affecting their existing developer and user communities.

One major item of feedback from key developers in those communities, including Chris McDonough (WebOb, Pyramid), Armin Ronacher (Flask, Werkzeug), Jacob Kaplan-Moss (Django) and Kenneth Reitz (requests) is that the requirement to change the spelling of every Unicode literal in an application (regardless of how that is accomplished) is a key stumbling block for porting efforts.

In particular, unlike many of the other Python 3 changes, it isn't one that framework and library authors can easily handle on behalf of their users. Most of those users couldn't care less about the "purity" of the Python language specification, they just want their websites and applications to work as well as possible.

While it is the Python web community that has been most vocal in highlighting this concern, it is expected that other highly Unicode aware domains (such as GUI development) may run into similar issues as they (and their communities) start making concerted efforts to support Python 3.

Common Objections

Complaint: This PEP may harm adoption of Python 3.2

This complaint is interesting, as it carries within it a tacit admission that this PEP will make it easier to port Unicode aware Python 2 applications to Python 3.

There are many existing Python communities that are prepared to put up with the constraints imposed by the existing suite of porting tools, or to update their Python 2 code bases sufficiently that the problems are minimised.

This PEP is not for those communities. Instead, it is designed specifically to help people that don't want to put up with those difficulties.

However, since the proposal is for a comparatively small tweak to the language syntax with no semantic changes, it is feasible to support it as a third party import hook. While such an import hook imposes some import time overhead, and requires additional steps from each application that needs it to get the hook in place, it allows applications that target Python 3.2 to use libraries and frameworks that would otherwise only run on Python 3.3+ due to their use of unicode literal prefixes.

One such import hook project is Vinay Sajip's uprefix [4].

For those that prefer to translate their code in advance rather than converting on the fly at import time, Armin Ronacher is working on a hook that runs at install time rather than during import [5].

Combining the two approaches is of course also possible. For example, the import hook could be used for rapid edit-test cycles during local development, but the install hook for continuous integration tasks and deployment on Python 3.2.

The approaches described in this section may prove useful, for example, for applications that wish to target Python 3 on the Ubuntu 12.04 LTS release, which will ship with Python 2.7 and 3.2 as officially supported Python versions.

Complaint: Python 3 shouldn't be made worse just to support porting from Python 2

This is indeed one of the key design principles of Python 3. However, one of the key design principles of Python as a whole is that "practicality beats purity". If we're going to impose a significant burden on third party developers, we should have a solid rationale for doing so.

In most cases, the rationale for backwards incompatible Python 3 changes are either to improve code correctness (for example, stricter default separation of binary and text data and integer division upgrading to floats when necessary), reduce typical memory usage (for example, increased usage of iterators and views over concrete lists), or to remove distracting nuisances that make Python code harder to read without increasing its expressiveness (for example, the comma based syntax for naming caught exceptions). Changes backed by such reasoning are not going to be reverted, regardless of objections from Python 2 developers attempting to make the transition to Python 3.

In many cases, Python 2 offered two ways of doing things for historical reasons. For example, inequality could be tested with both != and <> and integer literals could be specified with an optional L suffix. Such redundancies have been eliminated in Python 3, which reduces the overall size of the language and improves consistency across developers.

In the original Python 3 design (up to and including Python 3.2), the explicit prefix syntax for unicode literals was deemed to fall into this category, as it is completely unnecessary in Python 3. However, the difference between those other cases and unicode literals is that the unicode literal prefix is not redundant in Python 2 code: it is a programmatically significant distinction that needs to be preserved in some fashion to avoid losing information.

While porting tools were created to help with the transition (see next section) it still creates an additional burden on heavy users of unicode strings in Python 2, solely so that future developers learning Python 3 don't need to be told "For historical reasons, string literals may have an optional u or U prefix. Never use this yourselves, it's just there to help with porting from an earlier version of the language."

Plenty of students learning Python 2 received similar warnings regarding string exceptions without being confused or irreparably stunted in their growth as Python developers. It will be the same with this feature.

This point is further reinforced by the fact that Python 3 still allows the uppercase variants of the B and R prefixes for bytes literals and raw bytes and string literals. If the potential for confusion due to string prefix variants is that significant, where was the outcry asking that these redundant prefixes be removed along with all the other redundancies that were eliminated in Python 3?

Just as support for string exceptions was eliminated from Python 2 using the normal deprecation process, support for redundant string prefix characters (specifically, B, R, u, U) may eventually be eliminated from Python 3, regardless of the current acceptance of this PEP. However, such a change will likely only occur once third party libraries supporting Python 2.7 is about as common as libraries supporting Python 2.2 or 2.3 is today.

Complaint: The WSGI "native strings" concept is an ugly hack

One reason the removal of unicode literals has provoked such concern amongst the web development community is that the updated WSGI specification had to make a few compromises to minimise the disruption for existing web servers that provide a WSGI-compatible interface (this was deemed necessary in order to make the updated standard a viable target for web application authors and web framework developers).

One of those compromises is the concept of a "native string". WSGI defines three different kinds of string:

  • text strings: handled as unicode in Python 2 and str in Python 3
  • native strings: handled as str in both Python 2 and Python 3
  • binary data: handled as str in Python 2 and bytes in Python 3

Some developers consider WSGI's "native strings" to be an ugly hack, as they are explicitly documented as being used solely for latin-1 decoded "text", regardless of the actual encoding of the underlying data. Using this approach bypasses many of the updates to Python 3's data model that are designed to encourage correct handling of text encodings. However, it generally works due to the specific details of the problem domain - web server and web framework developers are some of the individuals most aware of how blurry the line can get between binary data and text when working with HTTP and related protocols, and how important it is to understand the implications of the encodings in use when manipulating encoded text data. At the application level most of these details are hidden from the developer by the web frameworks and support libraries (both in Python 2 and in Python 3).

In practice, native strings are a useful concept because there are some APIs (both in the standard library and in third party frameworks and packages) and some internal interpreter details that are designed primarily to work with str. These components often don't support unicode in Python 2 or bytes in Python 3, or, if they do, require additional encoding details and/or impose constraints that don't apply to the str variants.

Some example of interfaces that are best handled by using actual str instances are:

  • Python identifiers (as attributes, dict keys, class names, module names, import references, etc)
  • URLs for the most part as well as HTTP headers in urllib/http servers
  • WSGI environment keys and CGI-inherited values
  • Python source code for dynamic compilation and AST hacks
  • Exception messages
  • __repr__ return value
  • preferred filesystem paths
  • preferred OS environment

In Python 2.6 and 2.7, these distinctions are most naturally expressed as follows:

  • u"": text string (unicode)
  • "": native string (str)
  • b"": binary data (str, also aliased as bytes)

In Python 3, the latin-1 decoded native strings are not distinguished from any other text strings:

  • "": text string (str)
  • "": native string (str)
  • b"": binary data (bytes)

If from __future__ import unicode_literals is used to modify the behaviour of Python 2, then, along with an appropriate definition of n(), the distinction can be expressed as:

  • "": text string
  • n(""): native string
  • b"": binary data

(While n=str works for simple cases, it can sometimes have problems due to non-ASCII source encodings)

In the common subset of Python 2 and Python 3 (with appropriate specification of a source encoding and definitions of the u() and b() helper functions), they can be expressed as:

  • u(""): text string
  • "": native string
  • b(""): binary data

That last approach is the only variant that supports Python 2.5 and earlier.

Of all the alternatives, the format currently supported in Python 2.6 and 2.7 is by far the cleanest approach that clearly distinguishes the three desired kinds of behaviour. With this PEP, that format will also be supported in Python 3.3+. It will also be supported in Python 3.1 and 3.2 through the use of import and install hooks. While it is significantly less likely, it is also conceivable that the hooks could be adapted to allow the use of the b prefix on Python 2.5.

Complaint: The existing tools should be good enough for everyone

A commonly expressed sentiment from developers that have already successfully ported applications to Python 3 is along the lines of "if you think it's hard, you're doing it wrong" or "it's not that hard, just try it!". While it is no doubt unintentional, these responses all have the effect of telling the people that are pointing out inadequacies in the current porting toolset "there's nothing wrong with the porting tools, you just suck and don't know how to use them properly".

These responses are a case of completely missing the point of what people are complaining about. The feedback that resulted in this PEP isn't due to people complaining that ports aren't possible. Instead, the feedback is coming from people that have successfully completed ports and are objecting that they found the experience thoroughly unpleasant for the class of application that they needed to port (specifically, Unicode aware web frameworks and support libraries).

This is a subjective appraisal, and it's the reason why the Python 3 porting tools ecosystem is a case where the "one obvious way to do it" philosophy emphatically does not apply. While it was originally intended that "develop in Python 2, convert with 2to3, test both" would be the standard way to develop for both versions in parallel, in practice, the needs of different projects and developer communities have proven to be sufficiently diverse that a variety of approaches have been devised, allowing each group to select an approach that best fits their needs.

Lennart Regebro has produced an excellent overview of the available migration strategies [2], and a similar review is provided in the official porting guide [3]. (Note that the official guidance has softened to "it depends on your specific situation" since Lennart wrote his overview).

However, both of those guides are written from the founding assumption that all of the developers involved are already committed to the idea of supporting Python 3. They make no allowance for the social aspects of such a change when you're interacting with a user base that may not be especially tolerant of disruptions without a clear benefit, or are trying to persuade Python 2 focused upstream developers to accept patches that are solely about improving Python 3 forward compatibility.

With the current porting toolset, every migration strategy will result in changes to every Unicode literal in a project. No exceptions. They will be converted to either an unprefixed string literal (if the project decides to adopt the unicode_literals import) or else to a converter call like u("text").

If the unicode_literals import approach is employed, but is not adopted across the entire project at the same time, then the meaning of a bare string literal may become annoyingly ambiguous. This problem can be particularly pernicious for aggregated software, like a Django site - in such a situation, some files may end up using the unicode_literals import and others may not, creating definite potential for confusion.

While these problems are clearly solvable at a technical level, they're a completely unnecessary distraction at the social level. Developer energy should be reserved for addressing real technical difficulties associated with the Python 3 transition (like distinguishing their 8-bit text strings from their binary data). They shouldn't be punished with additional code changes (even automated ones) solely due to the fact that they have already explicitly identified their Unicode strings in Python 2.

Armin Ronacher has created an experimental extension to 2to3 which only modernizes Python code to the extent that it runs on Python 2.7 or later with support from the cross-version compatibility six library. This tool is available as python-modernize [1]. Currently, the deltas generated by this tool will affect every Unicode literal in the converted source. This will create legitimate concerns amongst upstream developers asked to accept such changes, and amongst framework users being asked to change their applications.

However, by eliminating the noise from changes to the Unicode literal syntax, many projects could be cleanly and (comparatively) non-controversially made forward compatible with Python 3.3+ just by running python-modernize and applying the recommended changes.

References

[1]Python-Modernize (http://github.com/mitsuhiko/python-modernize)
[2]Porting to Python 3: Migration Strategies (http://python3porting.com/strategies.html)
[3]Porting Python 2 Code to Python 3 (http://docs.python.org/howto/pyporting.html)
[4]uprefix import hook project (https://bitbucket.org/vinay.sajip/uprefix)
[5]install hook to remove unicode string prefix characters (https://github.com/mitsuhiko/unicode-literals-pep/tree/master/install-hook)

pep-0415 Implement context suppression with exception attributes

PEP:415
Title:Implement context suppression with exception attributes
Version:$Revision$
Last-Modified:$Date$
Author:Benjamin Peterson <benjamin at python.org>
BDFL-Delegate:Nick Coghlan
Status:Final
Type:Standards Track
Content-Type:text/x-rst
Created:26-Feb-2012
Python-Version:3.3
Post-History:26-Feb-2012
Replaces:409
Resolution:http://mail.python.org/pipermail/python-dev/2012-May/119467.html

Abstract

PEP 409 introduced support for the raise exc from None construct to allow the display of the exception context to be explicitly suppressed. This PEP retains the language level changes already implemented in PEP 409, but replaces the underlying implementation mechanism with a simpler approach based on a new __suppress_context__ attribute on all BaseException instances.

PEP Acceptance

This PEP was accepted by Nick Coghlan on the 14th of May, 2012.

Rationale

PEP 409 changes __cause__ to be Ellipsis by default. Then if __cause__ is set to None by raise exc from None, no context or cause will be printed should the exception be uncaught.

The main problem with this scheme is it complicates the role of __cause__. __cause__ should indicate the cause of the exception not whether __context__ should be printed or not. This use of __cause__ is also not easily extended in the future. For example, we may someday want to allow the programmer to select which of __context__ and __cause__ will be printed. The PEP 409 implementation is not amenable to this.

The use of Ellipsis is a hack. Before PEP 409, Ellipsis was used exclusively in extended slicing. Extended slicing has nothing to do with exceptions, so it's not clear to someone inspecting an exception object why __cause__ should be set to Ellipsis. Using Ellipsis by default for __cause__ makes it asymmetrical with __context__.

Proposal

A new attribute on BaseException, __suppress_context__, will be introduced. Whenever __cause__ is set, __suppress_context__ will be set to True. In particular, raise exc from cause syntax will set exc.__suppress_context__ to True. Exception printing code will check for that attribute to determine whether context and cause will be printed. __cause__ will return to its original purpose and values.

There is precedence for __suppress_context__ with the print_line_and_file exception attribute.

To summarize, raise exc from cause will be equivalent to:

exc.__cause__ = cause
raise exc

where exc.__cause__ = cause implicitly sets exc.__suppress_context__.

Patches

There is a patch on Issue 14133 [1].

pep-0416 Add a frozendict builtin type

PEP:416
Title:Add a frozendict builtin type
Version:$Revision$
Last-Modified:$Date$
Author:Victor Stinner <victor.stinner at gmail.com>
Status:Rejected
Type:Standards Track
Content-Type:text/x-rst
Created:29-February-2012
Python-Version:3.3

Rejection Notice

I'm rejecting this PEP. A number of reasons (not exhaustive):

  • According to Raymond Hettinger, use of frozendict is low. Those that do use it tend to use it as a hint only, such as declaring global or class-level "constants": they aren't really immutable, since anyone can still assign to the name.
  • There are existing idioms for avoiding mutable default values.
  • The potential of optimizing code using frozendict in PyPy is unsure; a lot of other things would have to change first. The same holds for compile-time lookups in general.
  • Multiple threads can agree by convention not to mutate a shared dict, there's no great need for enforcement. Multiple processes can't share dicts.
  • Adding a security sandbox written in Python, even with a limited scope, is frowned upon by many, due to the inherent difficulty with ever proving that the sandbox is actually secure. Because of this we won't be adding one to the stdlib any time soon, so this use case falls outside the scope of a PEP.

On the other hand, exposing the existing read-only dict proxy as a built-in type sounds good to me. (It would need to be changed to allow calling the constructor.) GvR.

Update (2012-04-15): A new MappingProxyType type was added to the types module of Python 3.3.

Abstract

Add a new frozendict builtin type.

Rationale

A frozendict is a read-only mapping: a key cannot be added nor removed, and a key is always mapped to the same value. However, frozendict values can be not hashable. A frozendict is hashable if and only if all values are hashable.

Use cases:

  • Immutable global variable like a default configuration.
  • Default value of a function parameter. Avoid the issue of mutable default arguments.
  • Implement a cache: frozendict can be used to store function keywords. frozendict can be used as a key of a mapping or as a member of set.
  • frozendict avoids the need of a lock when the frozendict is shared by multiple threads or processes, especially hashable frozendict. It would also help to prohibe coroutines (generators + greenlets) to modify the global state.
  • frozendict lookup can be done at compile time instead of runtime because the mapping is read-only. frozendict can be used instead of a preprocessor to remove conditional code at compilation, like code specific to a debug build.
  • frozendict helps to implement read-only object proxies for security modules. For example, it would be possible to use frozendict type for __builtins__ mapping or type.__dict__. This is possible because frozendict is compatible with the PyDict C API.
  • frozendict avoids the need of a read-only proxy in some cases. frozendict is faster than a proxy because getting an item in a frozendict is a fast lookup whereas a proxy requires a function call.

Constraints

  • frozendict has to implement the Mapping abstract base class
  • frozendict keys and values can be unorderable
  • a frozendict is hashable if all keys and values are hashable
  • frozendict hash does not depend on the items creation order

Implementation

  • Add a PyFrozenDictObject structure based on PyDictObject with an extra "Py_hash_t hash;" field
  • frozendict.__hash__() is implemented using hash(frozenset(self.items())) and caches the result in its private hash attribute
  • Register frozendict as a collections.abc.Mapping
  • frozendict can be used with PyDict_GetItem(), but PyDict_SetItem() and PyDict_DelItem() raise a TypeError

Recipe: hashable dict

To ensure that a a frozendict is hashable, values can be checked before creating the frozendict:

import itertools

def hashabledict(*args, **kw):
    # ensure that all values are hashable
    for key, value in itertools.chain(args, kw.items()):
        if isinstance(value, (int, str, bytes, float, frozenset, complex)):
            # avoid the compute the hash (which may be slow) for builtin
            # types known to be hashable for any value
            continue
        hash(value)
        # don't check the key: frozendict already checks the key
    return frozendict.__new__(cls, *args, **kw)

Objections

namedtuple may fit the requiements of a frozendict.

A namedtuple is not a mapping, it does not implement the Mapping abstract base class.

frozendict can be implemented in Python using descriptors" and "frozendict just need to be practically constant.

If frozendict is used to harden Python (security purpose), it must be implemented in C. A type implemented in C is also faster.

The PEP 351 was rejected.

The PEP 351 tries to freeze an object and so may convert a mutable object to an immutable object (using a different type). frozendict doesn't convert anything: hash(frozendict) raises a TypeError if a value is not hashable. Freezing an object is not the purpose of this PEP.

Alternative: dictproxy

Python has a builtin dictproxy type used by type.__dict__ getter descriptor. This type is not public. dictproxy is a read-only view of a dictionary, but it is not read-only mapping. If a dictionary is modified, the dictproxy is also modified.

dictproxy can be used using ctypes and the Python C API, see for example the make dictproxy object via ctypes.pythonapi and type() (Python recipe 576540) [1] by Ikkei Shimomura. The recipe contains a test checking that a dictproxy is "mutable" (modify the dictionary linked to the dictproxy).

However dictproxy can be useful in some cases, where its mutable property is not an issue, to avoid a copy of the dictionary.

Existing implementations

Whitelist approach.

  • Implementing an Immutable Dictionary (Python recipe 498072) by Aristotelis Mikropoulos. Similar to frozendict except that it is not truly read-only: it is possible to access to this private internal dict. It does not implement __hash__ and has an implementation issue: it is possible to call again __init__() to modify the mapping.
  • PyWebmail contains an ImmutableDict type: webmail.utils.ImmutableDict. It is hashable if keys and values are hashable. It is not truly read-only: its internal dict is a public attribute.
  • remember project: remember.dicts.FrozenDict. It is used to implement a cache: FrozenDict is used to store function callbacks. FrozenDict may be hashable. It has an extra supply_dict() class method to create a FrozenDict from a dict without copying the dict: store the dict as the internal dict. Implementation issue: __init__() can be called to modify the mapping and the hash may differ depending on item creation order. The mapping is not truly read-only: the internal dict is accessible in Python.

Blacklist approach: inherit from dict and override write methods to raise an exception. It is not truly read-only: it is still possible to call dict methods on such "frozen dictionary" to modify it.

Hashable dict: inherit from dict and just add an __hash__ method.

pep-0417 Including mock in the Standard Library

PEP:417
Title:Including mock in the Standard Library
Version:$Revision$
Last-Modified:$Date$
Author:Michael Foord <michael at python.org>
Status:Final
Type:Standards Track
Content-Type:text/x-rst
Created:12-Mar-2012
Python-Version:3.3
Post-History:12-Mar-2012
Resolution:http://mail.python.org/pipermail/python-dev/2012-March/117507.html

Abstract

This PEP proposes adding the mock [1] testing library to the Python standard library as unittest.mock.

Rationale

Creating mock objects for testing is a common need in Python. Many developers create ad-hoc mocks, as needed, in their test suites. This is currently what we do in the Python test suite, where a standardised mock object library would be helpful.

There are many mock object libraries available for Python [2]. Of these, mock is overwhelmingly the most popular, with as many downloads on PyPI as the other mocking libraries combined.

An advantage of mock is that it is a mocking library and not a framework. It provides a configurable and flexible mock object, without being opinionated about how you write your tests. The mock api is now well battle-tested and stable.

mock also handles safely monkeypatching and unmonkeypatching objects during the scope of a test. This is hard to do safely and many developers / projects mimic this functionality (often incorrectly). A standardised way to do this, handling the complexity of patching in the presence of the descriptor protocol (etc) is useful. People are asking for a "patch" [3] feature to unittest. Doing this via mock.patch is preferable to re-implementing part of this functionality in unittest.

Background

Addition of mock to the Python standard library was discussed and agreed to at the Python Language Summit 2012.

Open Issues

As of release 0.8, which is current at the time of writing, mock is compatible with Python 2.4-3.2. Moving into the Python standard library will allow for the removal of some Python 2 specific "compatibility hacks".

mock 0.8 introduced a new feature, "auto-speccing", obsoletes an older mock feature called "mocksignature". The "mocksignature" functionality can be removed from mock altogether prior to inclusion.

pep-0418 Add monotonic time, performance counter, and process time functions

PEP:418
Title:Add monotonic time, performance counter, and process time functions
Version:$Revision$
Last-Modified:$Date$
Author:Cameron Simpson <cs at zip.com.au>, Jim Jewett <jimjjewett at gmail.com>, Stephen J. Turnbull <stephen at xemacs.org>, Victor Stinner <victor.stinner at gmail.com>
Status:Final
Type:Standards Track
Content-Type:text/x-rst
Created:26-March-2012
Python-Version:3.3

Abstract

This PEP proposes to add time.get_clock_info(name), time.monotonic(), time.perf_counter() and time.process_time() functions to Python 3.3.

Rationale

If a program uses the system time to schedule events or to implement a timeout, it may fail to run events at the right moment or stop the timeout too early or too late when the system time is changed manually or adjusted automatically by NTP. A monotonic clock should be used instead to not be affected by system time updates: time.monotonic().

To measure the performance of a function, time.clock() can be used but it is very different on Windows and on Unix. On Windows, time.clock() includes time elapsed during sleep, whereas it does not on Unix. time.clock() resolution is very good on Windows, but very bad on Unix. The new time.perf_counter() function should be used instead to always get the most precise performance counter with a portable behaviour (ex: include time spend during sleep).

Until now, Python did not provide directly a portable function to measure CPU time. time.clock() can be used on Unix, but it has bad resolution. resource.getrusage() or os.times() can also be used on Unix, but they require to compute the sum of time spent in kernel space and user space. The new time.process_time() function acts as a portable counter that always measures CPU time (excluding time elapsed during sleep) and has the best available resolution.

Each operating system implements clocks and performance counters differently, and it is useful to know exactly which function is used and some properties of the clock like its resolution. The new time.get_clock_info() function gives access to all available information about each Python time function.

New functions:

  • time.monotonic(): timeout and scheduling, not affected by system clock updates
  • time.perf_counter(): benchmarking, most precise clock for short period
  • time.process_time(): profiling, CPU time of the process

Users of new functions:

  • time.monotonic(): concurrent.futures, multiprocessing, queue, subprocess, telnet and threading modules to implement timeout
  • time.perf_counter(): trace and timeit modules, pybench program
  • time.process_time(): profile module
  • time.get_clock_info(): pybench program to display information about the timer like the resolution

The time.clock() function is deprecated because it is not portable: it behaves differently depending on the operating system. time.perf_counter() or time.process_time() should be used instead, depending on your requirements. time.clock() is marked as deprecated but is not planned for removal.

Limitations:

  • The behaviour of clocks after a system suspend is not defined in the documentation of new functions. The behaviour depends on the operating system: see the Monotonic Clocks section below. Some recent operating systems provide two clocks, one including time elapsed during system suspsend, one not including this time. Most operating systems only provide one kind of clock.
  • time.monotonic() and time.perf_counter() may or may not be adjusted. For example, CLOCK_MONOTONIC is slewed on Linux, whereas GetTickCount() is not adjusted on Windows. time.get_clock_info('monotonic')['adjustable'] can be used to check if the monotonic clock is adjustable or not.
  • No time.thread_time() function is proposed by this PEP because it is not needed by Python standard library nor a common asked feature. Such function would only be available on Windows and Linux. On Linux, it is possible to use time.clock_gettime(CLOCK_THREAD_CPUTIME_ID). On Windows, ctypes or another module can be used to call the GetThreadTimes() function.

Python functions

New Functions

time.get_clock_info(name)

Get information on the specified clock. Supported clock names:

  • "clock": time.clock()
  • "monotonic": time.monotonic()
  • "perf_counter": time.perf_counter()
  • "process_time": time.process_time()
  • "time": time.time()

Return a time.clock_info object which has the following attributes:

  • implementation (str): name of the underlying operating system function. Examples: "QueryPerformanceCounter()", "clock_gettime(CLOCK_REALTIME)".
  • monotonic (bool): True if the clock cannot go backward.
  • adjustable (bool): True if the clock can be changed automatically (e.g. by a NTP daemon) or manually by the system administrator, False otherwise
  • resolution (float): resolution in seconds of the clock.

time.monotonic()

Monotonic clock, i.e. cannot go backward. It is not affected by system clock updates. The reference point of the returned value is undefined, so that only the difference between the results of consecutive calls is valid and is a number of seconds.

On Windows versions older than Vista, time.monotonic() detects GetTickCount() integer overflow (32 bits, roll-over after 49.7 days). It increases an internal epoch (reference time by) 232 each time that an overflow is detected. The epoch is stored in the process-local state and so the value of time.monotonic() may be different in two Python processes running for more than 49 days. On more recent versions of Windows and on other operating systems, time.monotonic() is system-wide.

Availability: Windows, Mac OS X, Linux, FreeBSD, OpenBSD, Solaris. Not available on GNU/Hurd.

Pseudo-code [2]:

if os.name == 'nt':
    # GetTickCount64() requires Windows Vista, Server 2008 or later
    if hasattr(_time, 'GetTickCount64'):
        def monotonic():
            return _time.GetTickCount64() * 1e-3
    else:
        def monotonic():
            ticks = _time.GetTickCount()
            if ticks < monotonic.last:
                # Integer overflow detected
                monotonic.delta += 2**32
            monotonic.last = ticks
            return (ticks + monotonic.delta) * 1e-3
        monotonic.last = 0
        monotonic.delta = 0

elif sys.platform == 'darwin':
    def monotonic():
        if monotonic.factor is None:
            factor = _time.mach_timebase_info()
            monotonic.factor = timebase[0] / timebase[1] * 1e-9
        return _time.mach_absolute_time() * monotonic.factor
    monotonic.factor = None

elif hasattr(time, "clock_gettime") and hasattr(time, "CLOCK_HIGHRES"):
    def monotonic():
        return time.clock_gettime(time.CLOCK_HIGHRES)

elif hasattr(time, "clock_gettime") and hasattr(time, "CLOCK_MONOTONIC"):
    def monotonic():
        return time.clock_gettime(time.CLOCK_MONOTONIC)

On Windows, QueryPerformanceCounter() is not used even though it has a better resolution than GetTickCount(). It is not reliable and has too many issues.

time.perf_counter()

Performance counter with the highest available resolution to measure a short duration. It does include time elapsed during sleep and is system-wide. The reference point of the returned value is undefined, so that only the difference between the results of consecutive calls is valid and is a number of seconds.

It is available on all platforms.

Pseudo-code:

if os.name == 'nt':
    def _win_perf_counter():
        if _win_perf_counter.frequency is None:
            _win_perf_counter.frequency = _time.QueryPerformanceFrequency()
        return _time.QueryPerformanceCounter() / _win_perf_counter.frequency
    _win_perf_counter.frequency = None

def perf_counter():
    if perf_counter.use_performance_counter:
        try:
            return _win_perf_counter()
        except OSError:
            # QueryPerformanceFrequency() fails if the installed
            # hardware does not support a high-resolution performance
            # counter
            perf_counter.use_performance_counter = False
    if perf_counter.use_monotonic:
        # The monotonic clock is preferred over the system time
        try:
            return time.monotonic()
        except OSError:
            perf_counter.use_monotonic = False
    return time.time()
perf_counter.use_performance_counter = (os.name == 'nt')
perf_counter.use_monotonic = hasattr(time, 'monotonic')

time.process_time()

Sum of the system and user CPU time of the current process. It does not include time elapsed during sleep. It is process-wide by definition. The reference point of the returned value is undefined, so that only the difference between the results of consecutive calls is valid.

It is available on all platforms.

Pseudo-code [2]:

if os.name == 'nt':
    def process_time():
        handle = _time.GetCurrentProcess()
        process_times = _time.GetProcessTimes(handle)
        return (process_times['UserTime'] + process_times['KernelTime']) * 1e-7
else:
    try:
        import resource
    except ImportError:
        has_resource = False
    else:
        has_resource = True

    def process_time():
        if process_time.clock_id is not None:
            try:
                return time.clock_gettime(process_time.clock_id)
            except OSError:
                process_time.clock_id = None
        if process_time.use_getrusage:
            try:
                usage = resource.getrusage(resource.RUSAGE_SELF)
                return usage[0] + usage[1]
            except OSError:
                process_time.use_getrusage = False
        if process_time.use_times:
            try:
                times = _time.times()
                cpu_time = times.tms_utime + times.tms_stime
                return cpu_time / process_time.ticks_per_seconds
            except OSError:
                process_time.use_getrusage = False
        return _time.clock()
    if (hasattr(time, 'clock_gettime')
        and hasattr(time, 'CLOCK_PROF')):
        process_time.clock_id = time.CLOCK_PROF
    elif (hasattr(time, 'clock_gettime')
          and hasattr(time, 'CLOCK_PROCESS_CPUTIME_ID')):
        process_time.clock_id = time.CLOCK_PROCESS_CPUTIME_ID
    else:
        process_time.clock_id = None
    process_time.use_getrusage = has_resource
    process_time.use_times = hasattr(_time, 'times')
    if process_time.use_times:
        # sysconf("SC_CLK_TCK"), or the HZ constant, or 60
        process_time.ticks_per_seconds = _times.ticks_per_seconds

Existing Functions

time.time()

The system time which is usually the civil time. It is system-wide by definition. It can be set manually by the system administrator or automatically by a NTP daemon.

It is available on all platforms and cannot fail.

Pseudo-code [2]:

if os.name == "nt":
    def time():
        return _time.GetSystemTimeAsFileTime()
else:
    def time():
        if hasattr(time, "clock_gettime"):
            try:
                return time.clock_gettime(time.CLOCK_REALTIME)
            except OSError:
                # CLOCK_REALTIME is not supported (unlikely)
                pass
        if hasattr(_time, "gettimeofday"):
            try:
                return _time.gettimeofday()
            except OSError:
                # gettimeofday() should not fail
                pass
        if hasattr(_time, "ftime"):
            return _time.ftime()
        else:
            return _time.time()

time.sleep()

Suspend execution for the given number of seconds. The actual suspension time may be less than that requested because any caught signal will terminate the time.sleep() following execution of that signal's catching routine. Also, the suspension time may be longer than requested by an arbitrary amount because of the scheduling of other activity in the system.

Pseudo-code [2]:

try:
    import select
except ImportError:
    has_select = False
else:
    has_select = hasattr(select, "select")

if has_select:
    def sleep(seconds):
        return select.select([], [], [], seconds)

elif hasattr(_time, "delay"):
    def sleep(seconds):
        milliseconds = int(seconds * 1000)
        _time.delay(milliseconds)

elif os.name == "nt":
    def sleep(seconds):
        milliseconds = int(seconds * 1000)
        win32api.ResetEvent(hInterruptEvent);
        win32api.WaitForSingleObject(sleep.sigint_event, milliseconds)

    sleep.sigint_event = win32api.CreateEvent(NULL, TRUE, FALSE, FALSE)
    # SetEvent(sleep.sigint_event) will be called by the signal handler of SIGINT

elif os.name == "os2":
    def sleep(seconds):
        milliseconds = int(seconds * 1000)
        DosSleep(milliseconds)

else:
    def sleep(seconds):
        seconds = int(seconds)
        _time.sleep(seconds)

Deprecated Function

time.clock()

On Unix, return the current processor time as a floating point number expressed in seconds. It is process-wide by definition. The resolution, and in fact the very definition of the meaning of "processor time", depends on that of the C function of the same name, but in any case, this is the function to use for benchmarking Python or timing algorithms.

On Windows, this function returns wall-clock seconds elapsed since the first call to this function, as a floating point number, based on the Win32 function QueryPerformanceCounter(). The resolution is typically better than one microsecond. It is system-wide.

Pseudo-code [2]:

if os.name == 'nt':
    def clock():
        try:
            return _win_perf_counter()
        except OSError:
            # QueryPerformanceFrequency() fails if the installed
            # hardware does not support a high-resolution performance
            # counter
            pass
        return _time.clock()
else:
    clock = _time.clock

Alternatives: API design

Other names for time.monotonic()

  • time.counter()
  • time.metronomic()
  • time.seconds()
  • time.steady(): "steady" is ambiguous: it means different things to different people. For example, on Linux, CLOCK_MONOTONIC is adjusted. If we uses the real time as the reference clock, we may say that CLOCK_MONOTONIC is steady. But CLOCK_MONOTONIC gets suspended on system suspend, whereas real time includes any time spent in suspend.
  • time.timeout_clock()
  • time.wallclock(): time.monotonic() is not the system time aka the "wall clock", but a monotonic clock with an unspecified starting point.

The name "time.try_monotonic()" was also proposed for an older version of time.monotonic() which would fall back to the system time when no monotonic clock was available.

Other names for time.perf_counter()

  • time.high_precision()
  • time.highres()
  • time.hires()
  • time.performance_counter()
  • time.timer()

Only expose operating system clocks

To not have to define high-level clocks, which is a difficult task, a simpler approach is to only expose operating system clocks. time.clock_gettime() and related clock identifiers were already added to Python 3.3 for example.

time.monotonic(): Fallback to system time

If no monotonic clock is available, time.monotonic() falls back to the system time.

Issues:

  • It is hard to define such a function correctly in the documentation: is it monotonic? Is it steady? Is it adjusted?
  • Some users want to decide what to do when no monotonic clock is available: use another clock, display an error, or do something else.

Different APIs were proposed to define such function.

One function with a flag: time.monotonic(fallback=True)

  • time.monotonic(fallback=True) falls back to the system time if no monotonic clock is available or if the monotonic clock failed.
  • time.monotonic(fallback=False) raises OSError if monotonic clock fails and NotImplementedError if the system does not provide a monotonic clock

A keyword argument that gets passed as a constant in the caller is usually poor API.

Raising NotImplementedError for a function is something uncommon in Python and should be avoided.

One time.monotonic() function, no flag

time.monotonic() returns (time: float, is_monotonic: bool).

An alternative is to use a function attribute: time.monotonic.is_monotonic. The attribute value would be None before the first call to time.monotonic().

Choosing the clock from a list of constraints

The PEP as proposed offers a few new clocks, but their guarantees are deliberately loose in order to offer useful clocks on different platforms. This inherently embeds policy in the calls, and the caller must thus choose a policy.

The "choose a clock" approach suggests an additional API to let callers implement their own policy if necessary by making most platform clocks available and letting the caller pick amongst them. The PEP's suggested clocks are still expected to be available for the common simple use cases.

To do this two facilities are needed: an enumeration of clocks, and metadata on the clocks to enable the user to evaluate their suitability.

The primary interface is a function make simple choices easy: the caller can use time.get_clock(*flags) with some combination of flags. This includes at least:

  • time.MONOTONIC: clock cannot go backward
  • time.STEADY: clock rate is steady
  • time.ADJUSTED: clock may be adjusted, for example by NTP
  • time.HIGHRES: clock with the highest resolution

It returns a clock object with a .now() method returning the current time. The clock object is annotated with metadata describing the clock feature set; its .flags field will contain at least all the requested flags.

time.get_clock() returns None if no matching clock is found and so calls can be chained using the or operator. Example of a simple policy decision:

T = get_clock(MONOTONIC) or get_clock(STEADY) or get_clock()
t = T.now()

The available clocks always at least include a wrapper for time.time(), so a final call with no flags can always be used to obtain a working clock.

Examples of flags of system clocks:

  • QueryPerformanceCounter: MONOTONIC | HIGHRES
  • GetTickCount: MONOTONIC | STEADY
  • CLOCK_MONOTONIC: MONOTONIC | STEADY (or only MONOTONIC on Linux)
  • CLOCK_MONOTONIC_RAW: MONOTONIC | STEADY
  • gettimeofday(): (no flag)

The clock objects contain other metadata including the clock flags with additional feature flags above those listed above, the name of the underlying OS facility, and clock precisions.

time.get_clock() still chooses a single clock; an enumeration facility is also required. The most obvious method is to offer time.get_clocks() with the same signature as time.get_clock(), but returning a sequence of all clocks matching the requested flags. Requesting no flags would thus enumerate all available clocks, allowing the caller to make an arbitrary choice amongst them based on their metadata.

Example partial implementation: clockutils.py.

Working around operating system bugs?

Should Python ensure that a monotonic clock is truly monotonic by computing the maximum with the clock value and the previous value?

Since it's relatively straightforward to cache the last value returned using a static variable, it might be interesting to use this to make sure that the values returned are indeed monotonic.

  • Virtual machines provide less reliable clocks.
  • QueryPerformanceCounter() has known bugs (only one is not fixed yet)

Python may only work around a specific known operating system bug: KB274323 [4] contains a code example to workaround the bug (use GetTickCount() to detect QueryPerformanceCounter() leap).

Issues with "correcting" non-monotonicities:

  • if the clock is accidentally set forward by an hour and then back again, you wouldn't have a useful clock for an hour
  • the cache is not shared between processes so different processes wouldn't see the same clock value

Glossary

Accuracy:The amount of deviation of measurements by a given instrument from true values. See also Accuracy and precision. Inaccuracy in clocks may be caused by lack of precision, drift, or an incorrect initial setting of the clock (e.g., timing of threads is inherently inaccurate because perfect synchronization in resetting counters is quite difficult).
Adjusted:Resetting a clock to the correct time. This may be done either with a <Step> or by <Slewing>.
Civil Time:Time of day; external to the system. 10:45:13am is a Civil time; 45 seconds is not. Provided by existing function time.localtime() and time.gmtime(). Not changed by this PEP.
Clock:An instrument for measuring time. Different clocks have different characteristics; for example, a clock with nanosecond <precision> may start to <drift> after a few minutes, while a less precise clock remained accurate for days. This PEP is primarily concerned with clocks which use a unit of seconds.
Counter:A clock which increments each time a certain event occurs. A counter is strictly monotonic, but not a monotonic clock. It can be used to generate a unique (and ordered) timestamp, but these timestamps cannot be mapped to <civil time>; tick creation may well be bursty, with several advances in the same millisecond followed by several days without any advance.
CPU Time:A measure of how much CPU effort has been spent on a certain task. CPU seconds are often normalized (so that a variable number can occur in the same actual second). CPU seconds can be important when profiling, but they do not map directly to user response time, nor are they directly comparable to (real time) seconds.
Drift:The accumulated error against "true" time, as defined externally to the system. Drift may be due to imprecision, or to a difference between the average rate at which clock time advances and that of real time.
Epoch:The reference point of a clock. For clocks providing <civil time>, this is often midnight as the day (and year) rolled over to January 1, 1970. For a <clock_monotonic> clock, the epoch may be undefined (represented as None).
Latency:Delay. By the time a clock call returns, the <real time> has advanced, possibly by more than the precision of the clock.
Monotonic:The characteristics expected of a monotonic clock in practice. Moving in at most one direction; for clocks, that direction is forward. The <clock> should also be <steady>, and should be convertible to a unit of seconds. The tradeoffs often include lack of a defined <epoch> or mapping to <Civil Time>.
Precision:The amount of deviation among measurements of the same physical value by a single instrument. Imprecision in clocks may be caused by a fluctuation of the rate at which clock time advances relative to real time, including clock adjustment by slewing.
Process Time:Time elapsed since the process began. It is typically measured in <CPU time> rather than <real time>, and typically does not advance while the process is suspended.
Real Time:Time in the real world. This differs from <Civil time> in that it is not <adjusted>, but they should otherwise advance in lockstep. It is not related to the "real time" of "Real Time [Operating] Systems". It is sometimes called "wall clock time" to avoid that ambiguity; unfortunately, that introduces different ambiguities.
Resolution:The smallest difference between two physical values that results in a different measurement by a given instrument.
Slew:A slight change to a clock's speed, usually intended to correct <drift> with respect to an external authority.
Stability:Persistence of accuracy. A measure of expected <drift>.
Steady:A clock with high <stability> and relatively high <accuracy> and <precision>. In practice, it is often used to indicate a <clock_monotonic> clock, but places greater emphasis on the consistency of the duration between subsequent ticks.
Step:An instantaneous change in the represented time. Instead of speeding or slowing the clock (<slew>), a single offset is permanently added.
System Time:Time as represented by the Operating System.
Thread Time:Time elapsed since the thread began. It is typically measured in <CPU time> rather than <real time>, and typically does not advance while the thread is idle.
Wallclock:What the clock on the wall says. This is typically used as a synonym for <real time>; unfortunately, wall time is itself ambiguous.

Hardware clocks

List of hardware clocks

  • HPET: An High Precision Event Timer (HPET) chip consists of a 64-bit up-counter (main counter) counting at least at 10 MHz and a set of up to 256 comparators (at least 3). Each HPET can have up to 32 timers. HPET can cause around 3 seconds of drift per day.
  • TSC (Time Stamp Counter): Historically, the TSC increased with every internal processor clock cycle, but now the rate is usually constant (even if the processor changes frequency) and usually equals the maximum processor frequency. Multiple cores have different TSC values. Hibernation of system will reset TSC value. The RDTSC instruction can be used to read this counter. CPU frequency scaling for power saving.
  • ACPI Power Management Timer: ACPI 24-bit timer with a frequency of 3.5 MHz (3,579,545 Hz).
  • Cyclone: The Cyclone timer uses a 32-bit counter on IBM Extended X-Architecture (EXA) chipsets which include computers that use the IBM "Summit" series chipsets (ex: x440). This is available in IA32 and IA64 architectures.
  • PIT (programmable interrupt timer): Intel 8253/8254 chipsets with a configurable frequency in range 18.2 Hz - 1.2 MHz. It uses a 16-bit counter.
  • RTC (Real-time clock). Most RTCs use a crystal oscillator with a frequency of 32,768 Hz.

Linux clocksource

There were 4 implementations of the time in the Linux kernel: UTIME (1996), timer wheel (1997), HRT (2001) and hrtimers (2007). The latter is the result of the "high-res-timers" project started by George Anzinger in 2001, with contributions by Thomas Gleixner and Douglas Niehaus. The hrtimers implementation was merged into Linux 2.6.21, released in 2007.

hrtimers supports various clock sources. It sets a priority to each source to decide which one will be used. Linux supports the following clock sources:

  • tsc
  • hpet
  • pit
  • pmtmr: ACPI Power Management Timer
  • cyclone

High-resolution timers are not supported on all hardware architectures. They are at least provided on x86/x86_64, ARM and PowerPC.

clock_getres() returns 1 nanosecond for CLOCK_REALTIME and CLOCK_MONOTONIC regardless of underlying clock source. Read Re: clock_getres() and real resolution from Thomas Gleixner (9 Feb 2012) for an explanation.

The /sys/devices/system/clocksource/clocksource0 directory contains two useful files:

  • available_clocksource: list of available clock sources
  • current_clocksource: clock source currently used. It is possible to change the current clocksource by writing the name of a clocksource into this file.

/proc/timer_list contains the list of all hardware timers.

Read also the time(7) manual page: "overview of time and timers".

FreeBSD timecounter

kern.timecounter.choice lists available hardware clocks with their priority. The sysctl program can be used to change the timecounter. Example:

# dmesg | grep Timecounter
Timecounter "i8254" frequency 1193182 Hz quality 0
Timecounter "ACPI-safe" frequency 3579545 Hz quality 850
Timecounter "HPET" frequency 100000000 Hz quality 900
Timecounter "TSC" frequency 3411154800 Hz quality 800
Timecounters tick every 10.000 msec
# sysctl kern.timecounter.choice
kern.timecounter.choice: TSC(800) HPET(900) ACPI-safe(850) i8254(0) dummy(-1000000)
# sysctl kern.timecounter.hardware="ACPI-fast"
kern.timecounter.hardware: HPET -> ACPI-fast

Available clocks:

  • "TSC": Time Stamp Counter of the processor
  • "HPET": High Precision Event Timer
  • "ACPI-fast": ACPI Power Management timer (fast mode)
  • "ACPI-safe": ACPI Power Management timer (safe mode)
  • "i8254": PIT with Intel 8254 chipset

The commit 222222 (May 2011) decreased ACPI-fast timecounter quality to 900 and increased HPET timecounter quality to 950: "HPET on modern platforms usually have better resolution and lower latency than ACPI timer".

Read Timecounters: Efficient and precise timekeeping in SMP kernels by Poul-Henning Kamp (2002) for the FreeBSD Project.

Performance

Reading a hardware clock has a cost. The following table compares the performance of different hardware clocks on Linux 3.3 with Intel Core i7-2600 at 3.40GHz (8 cores). The bench_time.c program was used to fill these tables.

Function TSC ACPI PM HPET
time() 2 ns 2 ns 2 ns
CLOCK_REALTIME_COARSE 10 ns 10 ns 10 ns
CLOCK_MONOTONIC_COARSE 12 ns 13 ns 12 ns
CLOCK_THREAD_CPUTIME_ID 134 ns 135 ns 135 ns
CLOCK_PROCESS_CPUTIME_ID 127 ns 129 ns 129 ns
clock() 146 ns 146 ns 143 ns
gettimeofday() 23 ns 726 ns 637 ns
CLOCK_MONOTONIC_RAW 31 ns 716 ns 607 ns
CLOCK_REALTIME 27 ns 707 ns 629 ns
CLOCK_MONOTONIC 27 ns 723 ns 635 ns

FreeBSD 8.0 in kvm with hardware virtualization:

Function TSC ACPI-Safe HPET i8254
time() 191 ns 188 ns 189 ns 188 ns
CLOCK_SECOND 187 ns 184 ns 187 ns 183 ns
CLOCK_REALTIME_FAST 189 ns 180 ns 187 ns 190 ns
CLOCK_UPTIME_FAST 191 ns 185 ns 186 ns 196 ns
CLOCK_MONOTONIC_FAST 188 ns 187 ns 188 ns 189 ns
CLOCK_THREAD_CPUTIME_ID 208 ns 206 ns 207 ns 220 ns
CLOCK_VIRTUAL 280 ns 279 ns 283 ns 296 ns
CLOCK_PROF 289 ns 280 ns 282 ns 286 ns
clock() 342 ns 340 ns 337 ns 344 ns
CLOCK_UPTIME_PRECISE 197 ns 10380 ns 4402 ns 4097 ns
CLOCK_REALTIME 196 ns 10376 ns 4337 ns 4054 ns
CLOCK_MONOTONIC_PRECISE 198 ns 10493 ns 4413 ns 3958 ns
CLOCK_UPTIME 197 ns 10523 ns 4458 ns 4058 ns
gettimeofday() 202 ns 10524 ns 4186 ns 3962 ns
CLOCK_REALTIME_PRECISE 197 ns 10599 ns 4394 ns 4060 ns
CLOCK_MONOTONIC 201 ns 10766 ns 4498 ns 3943 ns

Each function was called 100,000 times and CLOCK_MONOTONIC was used to get the time before and after. The benchmark was run 5 times, keeping the minimum time.

NTP adjustment

NTP has different methods to adjust a clock:

  • "slewing": change the clock frequency to be slightly faster or slower (which is done with adjtime()). Since the slew rate is limited to 0.5 millisecond per second, each second of adjustment requires an amortization interval of 2000 seconds. Thus, an adjustment of many seconds can take hours or days to amortize.
  • "stepping": jump by a large amount in a single discrete step (which is done with settimeofday())

By default, the time is slewed if the offset is less than 128 ms, but stepped otherwise.

Slewing is generally desirable (i.e. we should use CLOCK_MONOTONIC, not CLOCK_MONOTONIC_RAW) if one wishes to measure "real" time (and not a time-like object like CPU cycles). This is because the clock on the other end of the NTP connection from you is probably better at keeping time: hopefully that thirty-five thousand dollars of Cesium timekeeping goodness is doing something better than your PC's $3 quartz crystal, after all.

Get more detail in the documentation of the NTP daemon.

Operating system time functions

Monotonic Clocks

Name C Resolution Adjusted Include Sleep Include Suspend
gethrtime() 1 ns No Yes Yes
CLOCK_HIGHRES 1 ns No Yes Yes
CLOCK_MONOTONIC 1 ns Slewed on Linux Yes No
CLOCK_MONOTONIC_COARSE 1 ns Slewed on Linux Yes No
CLOCK_MONOTONIC_RAW 1 ns No Yes No
CLOCK_BOOTTIME 1 ns ? Yes Yes
CLOCK_UPTIME 1 ns No Yes ?
mach_absolute_time() 1 ns No Yes No
QueryPerformanceCounter() - No Yes ?
GetTickCount[64]() 1 ms No Yes Yes
timeGetTime() 1 ms No Yes ?

The "C Resolution" column is the resolution of the underlying C structure.

Examples of clock resolution on x86_64:

Name Operating system OS Resolution Python Resolution
QueryPerformanceCounter Windows Seven 10 ns 10 ns
CLOCK_HIGHRES SunOS 5.11 2 ns 265 ns
CLOCK_MONOTONIC Linux 3.0 1 ns 322 ns
CLOCK_MONOTONIC_RAW Linux 3.3 1 ns 628 ns
CLOCK_BOOTTIME Linux 3.3 1 ns 628 ns
mach_absolute_time() Mac OS 10.6 1 ns 3 µs
CLOCK_MONOTONIC FreeBSD 8.2 11 ns 5 µs
CLOCK_MONOTONIC OpenBSD 5.0 10 ms 5 µs
CLOCK_UPTIME FreeBSD 8.2 11 ns 6 µs
CLOCK_MONOTONIC_COARSE Linux 3.3 1 ms 1 ms
CLOCK_MONOTONIC_COARSE Linux 3.0 4 ms 4 ms
GetTickCount64() Windows Seven 16 ms 15 ms

The "OS Resolution" is the resolution announced by the operating system. The "Python Resolution" is the smallest difference between two calls to the time function computed in Python using the clock_resolution.py program.

mach_absolute_time

Mac OS X provides a monotonic clock: mach_absolute_time(). It is based on absolute elapsed time since system boot. It is not adjusted and cannot be set.

mach_timebase_info() gives a fraction to convert the clock value to a number of nanoseconds. See also the Technical Q&A QA1398.

mach_absolute_time() stops during a sleep on a PowerPC CPU, but not on an Intel CPU: Different behaviour of mach_absolute_time() on i386/ppc.

CLOCK_MONOTONIC, CLOCK_MONOTONIC_RAW, CLOCK_BOOTTIME

CLOCK_MONOTONIC and CLOCK_MONOTONIC_RAW represent monotonic time since some unspecified starting point. They cannot be set. The resolution can be read using clock_getres().

Documentation: refer to the manual page of your operating system. Examples:

CLOCK_MONOTONIC is available at least on the following operating systems:

  • DragonFly BSD, FreeBSD >= 5.0, OpenBSD, NetBSD
  • Linux
  • Solaris

The following operating systems don't support CLOCK_MONOTONIC:

On Linux, NTP may adjust the CLOCK_MONOTONIC rate (slewed), but it cannot jump backward.

CLOCK_MONOTONIC_RAW is specific to Linux. It is similar to CLOCK_MONOTONIC, but provides access to a raw hardware-based time that is not subject to NTP adjustments. CLOCK_MONOTONIC_RAW requires Linux 2.6.28 or later.

Linux 2.6.39 and glibc 2.14 introduces a new clock: CLOCK_BOOTTIME. CLOCK_BOOTTIME is idential to CLOCK_MONOTONIC, except that it also includes any time spent in suspend. Read also Waking systems from suspend (March, 2011).

CLOCK_MONOTONIC stops while the machine is suspended.

Linux provides also CLOCK_MONOTONIC_COARSE since Linux 2.6.32. It is similar to CLOCK_MONOTONIC, less precise but faster.

clock_gettime() fails if the system does not support the specified clock, even if the standard C library supports it. For example, CLOCK_MONOTONIC_RAW requires a kernel version 2.6.28 or later.

Windows: QueryPerformanceCounter

High-resolution performance counter. It is monotonic. The frequency of the counter can be read using QueryPerformanceFrequency(). The resolution is 1 / QueryPerformanceFrequency().

It has a much higher resolution, but has lower long term precision than GetTickCount() and timeGetTime() clocks. For example, it will drift compared to the low precision clocks.

Documentation:

Hardware clocks used by QueryPerformanceCounter:

  • Windows XP: RDTSC instruction of Intel processors, the clock frequency is the frequency of the processor (between 200 MHz and 3 GHz, usually greater than 1 GHz nowadays).
  • Windows 2000: ACPI power management timer, frequency = 3,549,545 Hz. It can be forced through the "/usepmtimer" flag in boot.ini.

QueryPerformanceFrequency() should only be called once: the frequency will not change while the system is running. It fails if the installed hardware does not support a high-resolution performance counter.

QueryPerformanceCounter() cannot be adjusted: SetSystemTimeAdjustment() only adjusts the system time.

Bugs:

  • The performance counter value may unexpectedly leap forward because of a hardware bug, see KB274323 [4].
  • On VirtualBox, QueryPerformanceCounter() does not increment the high part every time the low part overflows, see Monotonic timers (2009).
  • VirtualBox had a bug in its HPET virtualized device: QueryPerformanceCounter() did jump forward by approx. 42 seconds (issue #8707).
  • Windows XP had a bug (see KB896256 [3]): on a multiprocessor computer, QueryPerformanceCounter() returned a different value for each processor. The bug was fixed in Windows XP SP2.
  • Issues with processor with variable frequency: the frequency is changed depending on the workload to reduce memory consumption.
  • Chromium don't use QueryPerformanceCounter() on Athlon X2 CPUs (model 15) because "QueryPerformanceCounter is unreliable" (see base/time_win.cc in Chromium source code)

Windows: GetTickCount(), GetTickCount64()

GetTickCount() and GetTickCount64() are monotonic, cannot fail and are not adjusted by SetSystemTimeAdjustment(). MSDN documentation: GetTickCount(), GetTickCount64(). The resolution can be read using GetSystemTimeAdjustment().

The elapsed time retrieved by GetTickCount() or GetTickCount64() includes time the system spends in sleep or hibernation.

GetTickCount64() was added to Windows Vista and Windows Server 2008.

It is possible to improve the precision using the undocumented NtSetTimerResolution() function. There are applications using this undocumented function, example: Timer Resolution.

WaitForSingleObject() uses the same timer as GetTickCount() with the same precision.

Windows: timeGetTime

The timeGetTime function retrieves the system time, in milliseconds. The system time is the time elapsed since Windows was started. Read the timeGetTime() documentation.

The return type of timeGetTime() is a 32-bit unsigned integer. As GetTickCount(), timeGetTime() rolls over after 2^32 milliseconds (49.7 days).

The elapsed time retrieved by timeGetTime() includes time the system spends in sleep.

The default precision of the timeGetTime function can be five milliseconds or more, depending on the machine.

timeBeginPeriod() can be used to increase the precision of timeGetTime() up to 1 millisecond, but it negatively affects power consumption. Calling timeBeginPeriod() also affects the granularity of some other timing calls, such as CreateWaitableTimer(), WaitForSingleObject() and Sleep().

Note

timeGetTime() and timeBeginPeriod() are part the Windows multimedia library and so require to link the program against winmm or to dynamically load the library.

Solaris: CLOCK_HIGHRES

The Solaris OS has a CLOCK_HIGHRES timer that attempts to use an optimal hardware source, and may give close to nanosecond resolution. CLOCK_HIGHRES is the nonadjustable, high-resolution clock. For timers created with a clockid_t value of CLOCK_HIGHRES, the system will attempt to use an optimal hardware source.

The resolution of CLOCK_HIGHRES can be read using clock_getres().

Solaris: gethrtime

The gethrtime() function returns the current high-resolution real time. Time is expressed as nanoseconds since some arbitrary time in the past; it is not correlated in any way to the time of day, and thus is not subject to resetting or drifting by way of adjtime() or settimeofday(). The hires timer is ideally suited to performance measurement tasks, where cheap, accurate interval timing is required.

The linearity of gethrtime() is not preserved across a suspend-resume cycle (Bug 4272663).

Read the gethrtime() manual page of Solaris 11.

On Solaris, gethrtime() is the same as clock_gettime(CLOCK_MONOTONIC).

System Time

Name C Resolution Include Sleep Include Suspend
CLOCK_REALTIME 1 ns Yes Yes
CLOCK_REALTIME_COARSE 1 ns Yes Yes
GetSystemTimeAsFileTime 100 ns Yes Yes
gettimeofday() 1 µs Yes Yes
ftime() 1 ms Yes Yes
time() 1 sec Yes Yes

The "C Resolution" column is the resolution of the underlying C structure.

Examples of clock resolution on x86_64:

Name Operating system OS Resolution Python Resolution
CLOCK_REALTIME SunOS 5.11 10 ms 238 ns
CLOCK_REALTIME Linux 3.0 1 ns 238 ns
gettimeofday() Mac OS 10.6 1 µs 4 µs
CLOCK_REALTIME FreeBSD 8.2 11 ns 6 µs
CLOCK_REALTIME OpenBSD 5.0 10 ms 5 µs
CLOCK_REALTIME_COARSE Linux 3.3 1 ms 1 ms
CLOCK_REALTIME_COARSE Linux 3.0 4 ms 4 ms
GetSystemTimeAsFileTime() Windows Seven 16 ms 1 ms
ftime() Windows Seven - 1 ms

The "OS Resolution" is the resolution announced by the operating system. The "Python Resolution" is the smallest difference between two calls to the time function computed in Python using the clock_resolution.py program.

Windows: GetSystemTimeAsFileTime

The system time can be read using GetSystemTimeAsFileTime(), ftime() and time(). The resolution of the system time can be read using GetSystemTimeAdjustment().

Read the GetSystemTimeAsFileTime() documentation.

The system time can be set using SetSystemTime().

System time on UNIX

gettimeofday(), ftime(), time() and clock_gettime(CLOCK_REALTIME) return the system time. The resolution of CLOCK_REALTIME can be read using clock_getres().

The system time can be set using settimeofday() or clock_settime(CLOCK_REALTIME).

Linux provides also CLOCK_REALTIME_COARSE since Linux 2.6.32. It is similar to CLOCK_REALTIME, less precise but faster.

Alexander Shishkin proposed an API for Linux to be notified when the system clock is changed: timerfd: add TFD_NOTIFY_CLOCK_SET to watch for clock changes (4th version of the API, March 2011). The API is not accepted yet, but CLOCK_BOOTTIME provides a similar feature.

Process Time

The process time cannot be set. It is not monotonic: the clocks stop while the process is idle.

Name C Resolution Include Sleep Include Suspend
GetProcessTimes() 100 ns No No
CLOCK_PROCESS_CPUTIME_ID 1 ns No No
getrusage(RUSAGE_SELF) 1 µs No No
times() - No No
clock() - Yes on Windows, No otherwise No

The "C Resolution" column is the resolution of the underlying C structure.

Examples of clock resolution on x86_64:

Name Operating system OS Resolution Python Resolution
CLOCK_PROCESS_CPUTIME_ID Linux 3.3 1 ns 1 ns
CLOCK_PROF FreeBSD 8.2 10 ms 1 µs
getrusage(RUSAGE_SELF) FreeBSD 8.2 - 1 µs
getrusage(RUSAGE_SELF) SunOS 5.11 - 1 µs
CLOCK_PROCESS_CPUTIME_ID Linux 3.0 1 ns 1 µs
getrusage(RUSAGE_SELF) Mac OS 10.6 - 5 µs
clock() Mac OS 10.6 1 µs 5 µs
CLOCK_PROF OpenBSD 5.0 - 5 µs
getrusage(RUSAGE_SELF) Linux 3.0 - 4 ms
getrusage(RUSAGE_SELF) OpenBSD 5.0 - 8 ms
clock() FreeBSD 8.2 8 ms 8 ms
clock() Linux 3.0 1 µs 10 ms
times() Linux 3.0 10 ms 10 ms
clock() OpenBSD 5.0 10 ms 10 ms
times() OpenBSD 5.0 10 ms 10 ms
times() Mac OS 10.6 10 ms 10 ms
clock() SunOS 5.11 1 µs 10 ms
times() SunOS 5.11 1 µs 10 ms
GetProcessTimes() Windows Seven 16 ms 16 ms
clock() Windows Seven 1 ms 1 ms

The "OS Resolution" is the resolution announced by the operating system. The "Python Resolution" is the smallest difference between two calls to the time function computed in Python using the clock_resolution.py program.

Functions

  • Windows: GetProcessTimes(). The resolution can be read using GetSystemTimeAdjustment().
  • clock_gettime(CLOCK_PROCESS_CPUTIME_ID): High-resolution per-process timer from the CPU. The resolution can be read using clock_getres().
  • clock(). The resolution is 1 / CLOCKS_PER_SEC.
    • Windows: The elapsed wall-clock time since the start of the process (elapsed time in seconds times CLOCKS_PER_SEC). Include time elapsed during sleep. It can fail.
    • UNIX: returns an approximation of processor time used by the program.
  • getrusage(RUSAGE_SELF) returns a structure of resource usage of the currenet process. ru_utime is user CPU time and ru_stime is the system CPU time.
  • times(): structure of process times. The resolution is 1 / ticks_per_seconds, where ticks_per_seconds is sysconf(_SC_CLK_TCK) or the HZ constant.

Python source code includes a portable library to get the process time (CPU time): Tools/pybench/systimes.py.

See also the QueryProcessCycleTime() function (sum of the cycle time of all threads) and clock_getcpuclockid().

Thread Time

The thread time cannot be set. It is not monotonic: the clocks stop while the thread is idle.

Name C Resolution Include Sleep Include Suspend
CLOCK_THREAD_CPUTIME_ID 1 ns Yes Epoch changes
GetThreadTimes() 100 ns No ?

The "C Resolution" column is the resolution of the underlying C structure.

Examples of clock resolution on x86_64:

Name Operating system OS Resolution Python Resolution
CLOCK_THREAD_CPUTIME_ID FreeBSD 8.2 1 µs 1 µs
CLOCK_THREAD_CPUTIME_ID Linux 3.3 1 ns 649 ns
GetThreadTimes() Windows Seven 16 ms 16 ms

The "OS Resolution" is the resolution announced by the operating system. The "Python Resolution" is the smallest difference between two calls to the time function computed in Python using the clock_resolution.py program.

Functions

  • Windows: GetThreadTimes(). The resolution can be read using GetSystemTimeAdjustment().
  • clock_gettime(CLOCK_THREAD_CPUTIME_ID): Thread-specific CPU-time clock. It uses a number of CPU cycles, not a number of seconds. The resolution can be read using of clock_getres().

See also the QueryThreadCycleTime() function (cycle time for the specified thread) and pthread_getcpuclockid().

Windows: QueryUnbiasedInterruptTime

Gets the current unbiased interrupt time from the biased interrupt time and the current sleep bias amount. This time is not affected by power management sleep transitions.

The elapsed time retrieved by the QueryUnbiasedInterruptTime function includes only time that the system spends in the working state. QueryUnbiasedInterruptTime() is not monotonic.

QueryUnbiasedInterruptTime() was introduced in Windows 7.

See also QueryIdleProcessorCycleTime() function (cycle time for the idle thread of each processor)

Sleep

Suspend execution of the process for the given number of seconds. Sleep is not affected by system time updates. Sleep is paused during system suspend. For example, if a process sleeps for 60 seconds and the system is suspended for 30 seconds in the middle of the sleep, the sleep duration is 90 seconds in the real time.

Sleep can be interrupted by a signal: the function fails with EINTR.

Name C Resolution
nanosleep() 1 ns
clock_nanosleep() 1 ns
usleep() 1 µs
delay() 1 µs
sleep() 1 sec

Other functions:

Name C Resolution
sigtimedwait() 1 ns
pthread_cond_timedwait() 1 ns
sem_timedwait() 1 ns
select() 1 µs
epoll() 1 ms
poll() 1 ms
WaitForSingleObject() 1 ms

The "C Resolution" column is the resolution of the underlying C structure.

Functions

clock_nanosleep

clock_nanosleep(clock_id, flags, nanoseconds, remaining): Linux manpage of clock_nanosleep().

If flags is TIMER_ABSTIME, then request is interpreted as an absolute time as measured by the clock, clock_id. If request is less than or equal to the current value of the clock, then clock_nanosleep() returns immediately without suspending the calling thread.

POSIX.1 specifies that changing the value of the CLOCK_REALTIME clock via clock_settime(2) shall have no effect on a thread that is blocked on a relative clock_nanosleep().

select()

select(nfds, readfds, writefds, exceptfs, timeout).

Since Linux 2.6.28, select() uses high-resolution timers to handle the timeout. A process has a "slack" attribute to configure the precision of the timeout, the default slack is 50 microseconds. Before Linux 2.6.28, timeouts for select() were handled by the main timing subsystem at a jiffy-level resolution. Read also High- (but not too high-) resolution timeouts and Timer slack.

Other functions

  • poll(), epoll()
  • sigtimedwait(). POSIX: "If the Monotonic Clock option is supported, the CLOCK_MONOTONIC clock shall be used to measure the time interval specified by the timeout argument."
  • pthread_cond_timedwait(), pthread_condattr_setclock(). "The default value of the clock attribute shall refer to the system time."
  • sem_timedwait(): "If the Timers option is supported, the timeout shall be based on the CLOCK_REALTIME clock. If the Timers option is not supported, the timeout shall be based on the system time as returned by the time() function. The precision of the timeout shall be the precision of the clock on which it is based."
  • WaitForSingleObject(): use the same timer than GetTickCount() with the same precision.

System Standby

The ACPI power state "S3" is a system standby mode, also called "Suspend to RAM". RAM remains powered.

On Windows, the WM_POWERBROADCAST message is sent to Windows applications to notify them of power-management events (ex: owner status has changed).

For Mac OS X, read Registering and unregistering for sleep and wake notifications (Technical Q&A QA1340).

Footnotes

[2](1, 2, 3, 4, 5) "_time" is an hypothetical module only used for the example. The time module is implemented in C and so there is no need for such a module.

Acceptance

The PEP was accepted on 2012-04-28 by Guido van Rossum [1]. The PEP implementation has since been committed to the repository.

pep-0419 Protecting cleanup statements from interruptions

PEP:419
Title:Protecting cleanup statements from interruptions
Version:$Revision$
Last-Modified:$Date$
Author:Paul Colomiets <paul at colomiets.name>
Status:Deferred
Type:Standards Track
Content-Type:text/x-rst
Created:06-Apr-2012
Python-Version:3.3

Abstract

This PEP proposes a way to protect Python code from being interrupted inside a finally clause or during context manager cleanup.

PEP Deferral

Further exploration of the concepts covered in this PEP has been deferred for lack of a current champion interested in promoting the goals of the PEP and collecting and incorporating feedback, and with sufficient available time to do so effectively.

Rationale

Python has two nice ways to do cleanup. One is a finally statement and the other is a context manager (usually called using a with statement). However, neither is protected from interruption by KeyboardInterrupt or GeneratorExit caused by generator.throw(). For example:

lock.acquire()
try:
    print('starting')
    do_something()
finally:
    print('finished')
    lock.release()

If KeyboardInterrupt occurs just after the second print() call, the lock will not be released. Similarly, the following code using the with statement is affected:

from threading import Lock

class MyLock:

    def __init__(self):
        self._lock_impl = Lock()

    def __enter__(self):
        self._lock_impl.acquire()
        print("LOCKED")

    def __exit__(self):
        print("UNLOCKING")
        self._lock_impl.release()

lock = MyLock()
with lock:
    do_something

If KeyboardInterrupt occurs near any of the print() calls, the lock will never be released.

Coroutine Use Case

A similar case occurs with coroutines. Usually coroutine libraries want to interrupt the coroutine with a timeout. The generator.throw() method works for this use case, but there is no way of knowing if the coroutine is currently suspended from inside a finally clause.

An example that uses yield-based coroutines follows. The code looks similar using any of the popular coroutine libraries Monocle [1], Bluelet [2], or Twisted [3].

def run_locked():
    yield connection.sendall('LOCK')
    try:
        yield do_something()
        yield do_something_else()
    finally:
        yield connection.sendall('UNLOCK')

with timeout(5):
    yield run_locked()

In the example above, yield something means to pause executing the current coroutine and to execute coroutine something until it finishes execution. Therefore the coroutine library itself needs to maintain a stack of generators. The connection.sendall() call waits until the socket is writable and does a similar thing to what socket.sendall() does.

The with statement ensures that all code is executed within 5 seconds timeout. It does so by registering a callback in the main loop, which calls generator.throw() on the top-most frame in the coroutine stack when a timeout happens.

The greenlets extension works in a similar way, except that it doesn't need yield to enter a new stack frame. Otherwise considerations are similar.

Specification

Frame Flag 'f_in_cleanup'

A new flag on the frame object is proposed. It is set to True if this frame is currently executing a finally clause. Internally, the flag must be implemented as a counter of nested finally statements currently being executed.

The internal counter also needs to be incremented during execution of the SETUP_WITH and WITH_CLEANUP bytecodes, and decremented when execution for these bytecodes is finished. This allows to also protect __enter__() and __exit__() methods.

Function 'sys.setcleanuphook'

A new function for the sys module is proposed. This function sets a callback which is executed every time f_in_cleanup becomes false. Callbacks get a frame object as their sole argument, so that they can figure out where they are called from.

The setting is thread local and must be stored in the PyThreadState structure.

Inspect Module Enhancements

Two new functions are proposed for the inspect module: isframeincleanup() and getcleanupframe().

isframeincleanup(), given a frame or generator object as its sole argument, returns the value of the f_in_cleanup attribute of a frame itself or of the gi_frame attribute of a generator.

getcleanupframe(), given a frame object as its sole argument, returns the innermost frame which has a true value of f_in_cleanup, or None if no frames in the stack have a nonzero value for that attribute. It starts to inspect from the specified frame and walks to outer frames using f_back pointers, just like getouterframes() does.

Example

An example implementation of a SIGINT handler that interrupts safely might look like:

import inspect, sys, functools

def sigint_handler(sig, frame):
    if inspect.getcleanupframe(frame) is None:
        raise KeyboardInterrupt()
    sys.setcleanuphook(functools.partial(sigint_handler, 0))

A coroutine example is out of scope of this document, because its implementation depends very much on a trampoline (or main loop) used by coroutine library.

Unresolved Issues

Interruption Inside With Statement Expression

Given the statement

with open(filename):
    do_something()

Python can be interrupted after open() is called, but before the SETUP_WITH bytecode is executed. There are two possible decisions:

  • Protect with expressions. This would require another bytecode, since currently there is no way of recognizing the start of the with expression.

  • Let the user write a wrapper if he considers it important for the use-case. A safe wrapper might look like this:

    class FileWrapper(object):
    
        def __init__(self, filename, mode):
            self.filename = filename
            self.mode = mode
    
        def __enter__(self):
            self.file = open(self.filename, self.mode)
    
        def __exit__(self):
            self.file.close()
    

    Alternatively it can be written using the contextmanager() decorator:

    @contextmanager
    def open_wrapper(filename, mode):
        file = open(filename, mode)
        try:
            yield file
        finally:
            file.close()
    

    This code is safe, as the first part of the generator (before yield) is executed inside the SETUP_WITH bytecode of the caller.

Exception Propagation

Sometimes a finally clause or an __enter__()/__exit__() method can raise an exception. Usually this is not a problem, since more important exceptions like KeyboardInterrupt or SystemExit should be raised instead. But it may be nice to be able to keep the original exception inside a __context__ attribute. So the cleanup hook signature may grow an exception argument:

def sigint_handler(sig, frame)
    if inspect.getcleanupframe(frame) is None:
        raise KeyboardInterrupt()
    sys.setcleanuphook(retry_sigint)

def retry_sigint(frame, exception=None):
    if inspect.getcleanupframe(frame) is None:
        raise KeyboardInterrupt() from exception

Note

There is no need to have three arguments like in the __exit__ method since there is a __traceback__ attribute in exception in Python 3.

However, this will set the __cause__ for the exception, which is not exactly what's intended. So some hidden interpreter logic may be used to put a __context__ attribute on every exception raised in a cleanup hook.

Interruption Between Acquiring Resource and Try Block

The example from the first section is not totally safe. Let's take a closer look:

lock.acquire()
try:
    do_something()
finally:
    lock.release()

The problem might occur if the code is interrupted just after lock.acquire() is executed but before the try block is entered.

There is no way the code can be fixed unmodified. The actual fix depends very much on the use case. Usually code can be fixed using a with statement:

with lock:
    do_something()

However, for coroutines one usually can't use the with statement because you need to yield for both the acquire and release operations. So the code might be rewritten like this:

try:
    yield lock.acquire()
    do_something()
finally:
    yield lock.release()

The actual locking code might need more code to support this use case, but the implementation is usually trivial, like this: check if the lock has been acquired and unlock if it is.

Handling EINTR Inside a Finally

Even if a signal handler is prepared to check the f_in_cleanup flag, InterruptedError might be raised in the cleanup handler, because the respective system call returned an EINTR error. The primary use cases are prepared to handle this:

  • Posix mutexes never return EINTR
  • Networking libraries are always prepared to handle EINTR
  • Coroutine libraries are usually interrupted with the throw() method, not with a signal

The platform-specific function siginterrupt() might be used to remove the need to handle EINTR. However, it may have hardly predictable consequences, for example SIGINT a handler is never called if the main thread is stuck inside an IO routine.

A better approach would be to have the code, which is usually used in cleanup handlers, be prepared to handle InterruptedError explicitly. An example of such code might be a file-based lock implementation.

signal.pthread_sigmask can be used to block signals inside cleanup handlers which can be interrupted with EINTR.

Setting Interruption Context Inside Finally Itself

Some coroutine libraries may need to set a timeout for the finally clause itself. For example:

try:
    do_something()
finally:
    with timeout(0.5):
        try:
            yield do_slow_cleanup()
        finally:
            yield do_fast_cleanup()

With current semantics, timeout will either protect the whole with block or nothing at all, depending on the implementation of each library. What the author intended is to treat do_slow_cleanup as ordinary code, and do_fast_cleanup as a cleanup (a non-interruptible one).

A similar case might occur when using greenlets or tasklets.

This case can be fixed by exposing f_in_cleanup as a counter, and by calling a cleanup hook on each decrement. A coroutine library may then remember the value at timeout start, and compare it on each hook execution.

But in practice, the example is considered to be too obscure to take into account.

Modifying KeyboardInterrupt

It should be decided if the default SIGINT handler should be modified to use the described mechanism. The initial proposition is to keep old behavior, for two reasons:

  • Most application do not care about cleanup on exit (either they do not have external state, or they modify it in crash-safe way).
  • Cleanup may take too much time, not giving user a chance to interrupt an application.

The latter case can be fixed by allowing an unsafe break if a SIGINT handler is called twice, but it seems not worth the complexity.

Alternative Python Implementations Support

We consider f_in_cleanup an implementation detail. The actual implementation may have some fake frame-like object passed to signal handler, cleanup hook and returned from getcleanupframe(). The only requirement is that the inspect module functions work as expected on these objects. For this reason, we also allow to pass a generator object to the isframeincleanup() function, which removes the need to use the gi_frame attribute.

It might be necessary to specify that getcleanupframe() must return the same object that will be passed to cleanup hook at the next invocation.

Alternative Names

The original proposal had a f_in_finally frame attribute, as the original intention was to protect finally clauses. But as it grew up to protecting __enter__ and __exit__ methods too, the f_in_cleanup name seems better. Although the __enter__ method is not a cleanup routine, it at least relates to cleanup done by context managers.

setcleanuphook, isframeincleanup and getcleanupframe can be unobscured to set_cleanup_hook, is_frame_in_cleanup and get_cleanup_frame, although they follow the naming convention of their respective modules.

Alternative Proposals

Propagating 'f_in_cleanup' Flag Automatically

This can make getcleanupframe() unnecessary. But for yield-based coroutines you need to propagate it yourself. Making it writable leads to somewhat unpredictable behavior of setcleanuphook().

Add Bytecodes 'INCR_CLEANUP', 'DECR_CLEANUP'

These bytecodes can be used to protect the expression inside the with statement, as well as making counter increments more explicit and easy to debug (visible inside a disassembly). Some middle ground might be chosen, like END_FINALLY and SETUP_WITH implicitly decrementing the counter (END_FINALLY is present at end of every with suite).

However, adding new bytecodes must be considered very carefully.

Expose 'f_in_cleanup' as a Counter

The original intention was to expose a minimum of needed functionality. However, as we consider the frame flag f_in_cleanup an implementation detail, we may expose it as a counter.

Similarly, if we have a counter we may need to have the cleanup hook called on every counter decrement. It's unlikely to have much performance impact as nested finally clauses are an uncommon case.

Add code object flag 'CO_CLEANUP'

As an alternative to set the flag inside the SETUP_WITH and WITH_CLEANUP bytecodes, we can introduce a flag CO_CLEANUP. When the interpreter starts to execute code with CO_CLEANUP set, it sets f_in_cleanup for the whole function body. This flag is set for code objects of __enter__ and __exit__ special methods. Technically it might be set on functions called __enter__ and __exit__.

This seems to be less clear solution. It also covers the case where __enter__ and __exit__ are called manually. This may be accepted either as a feature or as an unnecessary side-effect (or, though unlikely, as a bug).

It may also impose a problem when __enter__ or __exit__ functions are implemented in C, as there is no code object to check for the f_in_cleanup flag.

Have Cleanup Callback on Frame Object Itself

The frame object may be extended to have a f_cleanup_callback member which is called when f_in_cleanup is reset to 0. This would help to register different callbacks to different coroutines.

Despite its apparent beauty, this solution doesn't add anything, as the two primary use cases are:

  • Setting the callback in a signal handler. The callback is inherently a single one for this case.
  • Use a single callback per loop for the coroutine use case. Here, in almost all cases, there is only one loop per thread.

No Cleanup Hook

The original proposal included no cleanup hook specification, as there are a few ways to achieve the same using current tools:

  • Using sys.settrace() and the f_trace callback. This may impose some problem to debugging, and has a big performance impact (although interrupting doesn't happen very often).
  • Sleeping a bit more and trying again. For a coroutine library this is easy. For signals it may be achieved using signal.alert.

Both methods are considered too impractical and a way to catch exit from finally clauses is proposed.

pep-0420 Implicit Namespace Packages

PEP:420
Title:Implicit Namespace Packages
Version:$Revision$
Last-Modified:$Date$
Author:Eric V. Smith <eric at trueblade.com>
Status:Final
Type:Standards Track
Content-Type:text/x-rst
Created:19-Apr-2012
Python-Version:3.3
Post-History:
Resolution:http://mail.python.org/pipermail/python-dev/2012-May/119651.html

Abstract

Namespace packages are a mechanism for splitting a single Python package across multiple directories on disk. In current Python versions, an algorithm to compute the packages __path__ must be formulated. With the enhancement proposed here, the import machinery itself will construct the list of directories that make up the package. This PEP builds upon previous work, documented in PEP 382 and PEP 402. Those PEPs have since been rejected in favor of this one. An implementation of this PEP is at [1].

Terminology

Within this PEP:

  • "package" refers to Python packages as defined by Python's import statement.
  • "distribution" refers to separately installable sets of Python modules as stored in the Python package index, and installed by distutils or setuptools.
  • "vendor package" refers to groups of files installed by an operating system's packaging mechanism (e.g. Debian or Redhat packages install on Linux systems).
  • "regular package" refers to packages as they are implemented in Python 3.2 and earlier.
  • "portion" refers to a set of files in a single directory (possibly stored in a zip file) that contribute to a namespace package.
  • "legacy portion" refers to a portion that uses __path__ manipulation in order to implement namespace packages.

This PEP defines a new type of package, the "namespace package".

Namespace packages today

Python currently provides pkgutil.extend_path to denote a package as a namespace package. The recommended way of using it is to put:

from pkgutil import extend_path
__path__ = extend_path(__path__, __name__)

in the package's __init__.py. Every distribution needs to provide the same contents in its __init__.py, so that extend_path is invoked independent of which portion of the package gets imported first. As a consequence, the package's __init__.py cannot practically define any names as it depends on the order of the package fragments on sys.path to determine which portion is imported first. As a special feature, extend_path reads files named <packagename>.pkg which allows declaration of additional portions.

setuptools provides a similar function named pkg_resources.declare_namespace that is used in the form:

import pkg_resources
pkg_resources.declare_namespace(__name__)

In the portion's __init__.py, no assignment to __path__ is necessary, as declare_namespace modifies the package __path__ through sys.modules. As a special feature, declare_namespace also supports zip files, and registers the package name internally so that future additions to sys.path by setuptools can properly add additional portions to each package.

setuptools allows declaring namespace packages in a distribution's setup.py, so that distribution developers don't need to put the magic __path__ modification into __init__.py themselves.

See PEP 402's "The Problem" section [2] for additional motivations for namespace packages. Note that PEP 402 has been rejected, but the motivating use cases are still valid.

Rationale

The current imperative approach to namespace packages has led to multiple slightly-incompatible mechanisms for providing namespace packages. For example, pkgutil supports *.pkg files; setuptools doesn't. Likewise, setuptools supports inspecting zip files, and supports adding portions to its _namespace_packages variable, whereas pkgutil doesn't.

Namespace packages are designed to support being split across multiple directories (and hence found via multiple sys.path entries). In this configuration, it doesn't matter if multiple portions all provide an __init__.py file, so long as each portion correctly initializes the namespace package. However, Linux distribution vendors (amongst others) prefer to combine the separate portions and install them all into the same file system directory. This creates a potential for conflict, as the portions are now attempting to provide the same file on the target system - something that is not allowed by many package managers. Allowing implicit namespace packages means that the requirement to provide an __init__.py file can be dropped completely, and affected portions can be installed into a common directory or split across multiple directories as distributions see fit.

A namespace package will not be constrained by a fixed __path__, computed from the parent path at namespace package creation time. Consider the standard library encodings package:

  1. Suppose that encodings becomes a namespace package.
  2. It sometimes gets imported during interpreter startup to initialize the standard io streams.
  3. An application modifies sys.path after startup and wants to contribute additional encodings from new path entries.
  4. An attempt is made to import an encoding from an encodings portion that is found on a path entry added in step 3.

If the import system was restricted to only finding portions along the value of sys.path that existed at the time the encodings namespace package was created, the additional paths added in step 3 would never be searched for the additional portions imported in step 4. In addition, if step 2 were sometimes skipped (due to some runtime flag or other condition), then the path items added in step 3 would indeed be used the first time a portion was imported. Thus this PEP requires that the list of path entries be dynamically computed when each portion is loaded. It is expected that the import machinery will do this efficiently by caching __path__ values and only refreshing them when it detects that the parent path has changed. In the case of a top-level package like encodings, this parent path would be sys.path.

Specification

Regular packages will continue to have an __init__.py and will reside in a single directory.

Namespace packages cannot contain an __init__.py. As a consequence, pkgutil.extend_path and pkg_resources.declare_namespace become obsolete for purposes of namespace package creation. There will be no marker file or directory for specifying a namespace package.

During import processing, the import machinery will continue to iterate over each directory in the parent path as it does in Python 3.2. While looking for a module or package named "foo", for each directory in the parent path:

  • If <directory>/foo/__init__.py is found, a regular package is imported and returned.
  • If not, but <directory>/foo.{py,pyc,so,pyd} is found, a module is imported and returned. The exact list of extension varies by platform and whether the -O flag is specified. The list here is representative.
  • If not, but <directory>/foo is found and is a directory, it is recorded and the scan continues with the next directory in the parent path.
  • Otherwise the scan continues with the next directory in the parent path.

If the scan completes without returning a module or package, and at least one directory was recorded, then a namespace package is created. The new namespace package:

  • Has a __path__ attribute set to an iterable of the path strings that were found and recorded during the scan.
  • Does not have a __file__ attribute.

Note that if "import foo" is executed and "foo" is found as a namespace package (using the above rules), then "foo" is immediately created as a package. The creation of the namespace package is not deferred until a sub-level import occurs.

A namespace package is not fundamentally different from a regular package. It is just a different way of creating packages. Once a namespace package is created, there is no functional difference between it and a regular package.

Dynamic path computation

The import machinery will behave as if a namespace package's __path__ is recomputed before each portion is loaded.

For performance reasons, it is expected that this will be achieved by detecting that the parent path has changed. If no change has taken place, then no __path__ recomputation is required. The implementation must ensure that changes to the contents of the parent path are detected, as well as detecting the replacement of the parent path with a new path entry list object.

Impact on import finders and loaders

PEP 302 defines "finders" that are called to search path elements. These finders' find_module methods return either a "loader" object or None.

For a finder to contribute to namespace packages, it must implement a new find_loader(fullname) method. fullname has the same meaning as for find_module. find_loader always returns a 2-tuple of (loader, <iterable-of-path-entries>). loader may be None, in which case <iterable-of-path-entries> (which may be empty) is added to the list of recorded path entries and path searching continues. If loader is not None, it is immediately used to load a module or regular package.

Even if loader is returned and is not None, <iterable-of-path-entries> must still contain the path entries for the package. This allows code such as pkgutil.extend_path() to compute path entries for packages that it does not load.

Note that multiple path entries per finder are allowed. This is to support the case where a finder discovers multiple namespace portions for a given fullname. Many finders will support only a single namespace package portion per find_loader call, in which case this iterable will contain only a single string.

The import machinery will call find_loader if it exists, else fall back to find_module. Legacy finders which implement find_module but not find_loader will be unable to contribute portions to a namespace package.

The specification expands PEP 302 loaders to include an optional method called module_repr() which if present, is used to generate module object reprs. See the section below for further details.

Differences between namespace packages and regular packages

Namespace packages and regular packages are very similar. The differences are:

  • Portions of namespace packages need not all come from the same directory structure, or even from the same loader. Regular packages are self-contained: all parts live in the same directory hierarchy.
  • Namespace packages have no __file__ attribute.
  • Namespace packages' __path__ attribute is a read-only iterable of strings, which is automatically updated when the parent path is modified.
  • Namespace packages have no __init__.py module.
  • Namespace packages have a different type of object for their __loader__ attribute.

Namespace packages in the standard library

It is possible, and this PEP explicitly allows, that parts of the standard library be implemented as namespace packages. When and if any standard library packages become namespace packages is outside the scope of this PEP.

Migrating from legacy namespace packages

As described above, prior to this PEP pkgutil.extend_path() was used by legacy portions to create namespace packages. Because it is likely not practical for all existing portions of a namespace package to be migrated to this PEP at once, extend_path() will be modified to also recognize PEP 420 namespace packages. This will allow some portions of a namespace to be legacy portions while others are migrated to PEP 420. These hybrid namespace packages will not have the dynamic path computation that normal namespace packages have, since extend_path() never provided this functionality in the past.

Packaging Implications

Multiple portions of a namespace package can be installed into the same directory, or into separate directories. For this section, suppose there are two portions which define "foo.bar" and "foo.baz". "foo" itself is a namespace package.

If these are installed in the same location, a single directory "foo" would be in a directory that is on sys.path. Inside "foo" would be two directories, "bar" and "baz". If "foo.bar" is removed (perhaps by an OS package manager), care must be taken not to remove the "foo/baz" or "foo" directories. Note that in this case "foo" will be a namespace package (because it lacks an __init__.py), even though all of its portions are in the same directory.

Note that "foo.bar" and "foo.baz" can be installed into the same "foo" directory because they will not have any files in common.

If the portions are installed in different locations, two different "foo" directories would be in directories that are on sys.path. "foo/bar" would be in one of these sys.path entries, and "foo/baz" would be in the other. Upon removal of "foo.bar", the "foo/bar" and corresponding "foo" directories can be completely removed. But "foo/baz" and its corresponding "foo" directory cannot be removed.

It is also possible to have the "foo.bar" portion installed in a directory on sys.path, and have the "foo.baz" portion provided in a zip file, also on sys.path.

Examples

Nested namespace packages

This example uses the following directory structure:

Lib/test/namespace_pkgs
    project1
        parent
            child
                one.py
    project2
        parent
            child
                two.py

Here, both parent and child are namespace packages: Portions of them exist in different directories, and they do not have __init__.py files.

Here we add the parent directories to sys.path, and show that the portions are correctly found:

>>> import sys
>>> sys.path += ['Lib/test/namespace_pkgs/project1', 'Lib/test/namespace_pkgs/project2']
>>> import parent.child.one
>>> parent.__path__
_NamespacePath(['Lib/test/namespace_pkgs/project1/parent', 'Lib/test/namespace_pkgs/project2/parent'])
>>> parent.child.__path__
_NamespacePath(['Lib/test/namespace_pkgs/project1/parent/child', 'Lib/test/namespace_pkgs/project2/parent/child'])
>>> import parent.child.two
>>>

Dynamic path computation

This example uses a similar directory structure, but adds a third portion:

Lib/test/namespace_pkgs
    project1
        parent
            child
                one.py
    project2
        parent
            child
                two.py
    project3
        parent
            child
                three.py

We add project1 and project2 to sys.path, then import parent.child.one and parent.child.two. Then we add the project3 to sys.path and when parent.child.three is imported, project3/parent is automatically added to parent.__path__:

# add the first two parent paths to sys.path
>>> import sys
>>> sys.path += ['Lib/test/namespace_pkgs/project1', 'Lib/test/namespace_pkgs/project2']

# parent.child.one can be imported, because project1 was added to sys.path:
>>> import parent.child.one
>>> parent.__path__
_NamespacePath(['Lib/test/namespace_pkgs/project1/parent', 'Lib/test/namespace_pkgs/project2/parent'])

# parent.child.__path__ contains project1/parent/child and project2/parent/child, but not project3/parent/child:
>>> parent.child.__path__
_NamespacePath(['Lib/test/namespace_pkgs/project1/parent/child', 'Lib/test/namespace_pkgs/project2/parent/child'])

# parent.child.two can be imported, because project2 was added to sys.path:
>>> import parent.child.two

# we cannot import parent.child.three, because project3 is not in the path:
>>> import parent.child.three
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<frozen importlib._bootstrap>", line 1286, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1250, in _find_and_load_unlocked
ImportError: No module named 'parent.child.three'

# now add project3 to sys.path:
>>> sys.path.append('Lib/test/namespace_pkgs/project3')

# and now parent.child.three can be imported:
>>> import parent.child.three

# project3/parent has been added to parent.__path__:
>>> parent.__path__
_NamespacePath(['Lib/test/namespace_pkgs/project1/parent', 'Lib/test/namespace_pkgs/project2/parent', 'Lib/test/namespace_pkgs/project3/parent'])

# and project3/parent/child has been added to parent.child.__path__
>>> parent.child.__path__
_NamespacePath(['Lib/test/namespace_pkgs/project1/parent/child', 'Lib/test/namespace_pkgs/project2/parent/child', 'Lib/test/namespace_pkgs/project3/parent/child'])
>>>

Discussion

At PyCon 2012, we had a discussion about namespace packages at which PEP 382 and PEP 402 were rejected, to be replaced by this PEP [3].

There is no intention to remove support of regular packages. If a developer knows that her package will never be a portion of a namespace package, then there is a performance advantage to it being a regular package (with an __init__.py). Creation and loading of a regular package can take place immediately when it is located along the path. With namespace packages, all entries in the path must be scanned before the package is created.

Note that an ImportWarning will no longer be raised for a directory lacking an __init__.py file. Such a directory will now be imported as a namespace package, whereas in prior Python versions an ImportWarning would be raised.

Nick Coghlan presented a list of his objections to this proposal [4]. They are:

  1. Implicit package directories go against the Zen of Python.
  2. Implicit package directories pose awkward backwards compatibility challenges.
  3. Implicit package directories introduce ambiguity into file system layouts.
  4. Implicit package directories will permanently entrench current newbie-hostile behavior in __main__.

Nick later gave a detailed response to his own objections [5], which is summarized here:

  1. The practicality of this PEP wins over other proposals and the status quo.
  2. Minor backward compatibility issues are okay, as long as they are properly documented.
  3. This will be addressed in PEP 395.
  4. This will also be addressed in PEP 395.

The inclusion of namespace packages in the standard library was motivated by Martin v. Lรถwis, who wanted the encodings package to become a namespace package [6]. While this PEP allows for standard library packages to become namespaces, it defers a decision on encodings.

find_module versus find_loader

An early draft of this PEP specified a change to the find_module method in order to support namespace packages. It would be modified to return a string in the case where a namespace package portion was discovered.

However, this caused a problem with existing code outside of the standard library which calls find_module. Because this code would not be upgraded in concert with changes required by this PEP, it would fail when it would receive unexpected return values from find_module. Because of this incompatibility, this PEP now specifies that finders that want to provide namespace portions must implement the find_loader method, described above.

The use case for supporting multiple portions per find_loader call is given in [7].

Dynamic path computation

Guido raised a concern that automatic dynamic path computation was an unnecessary feature [8]. Later in that thread, PJ Eby and Nick Coghlan presented arguments as to why dynamic computation would minimize surprise to Python users. The conclusion of that discussion has been included in this PEP's Rationale section.

An earlier version of this PEP required that dynamic path computation could only take affect if the parent path object were modified in-place. That is, this would work:

sys.path.append('new-dir')

But this would not:

sys.path = sys.path + ['new-dir']

In the same thread [8], it was pointed out that this restriction is not required. If the parent path is looked up by name instead of by holding a reference to it, then there is no restriction on how the parent path is modified or replaced. For a top-level namespace package, the lookup would be the module named "sys" then its attribute "path". For a namespace package nested inside a package foo, the lookup would be for the module named "foo" then its attribute "__path__".

Module reprs

Previously, module reprs were hard coded based on assumptions about a module's __file__ attribute. If this attribute existed and was a string, it was assumed to be a file system path, and the module object's repr would include this in its value. The only exception was that PEP 302 reserved missing __file__ attributes to built-in modules, and in CPython, this assumption was baked into the module object's implementation. Because of this restriction, some modules contained contrived __file__ values that did not reflect file system paths, and which could cause unexpected problems later (e.g. os.path.join() on a non-path __file__ would return gibberish).

This PEP relaxes this constraint, and leaves the setting of __file__ to the purview of the loader producing the module. Loaders may opt to leave __file__ unset if no file system path is appropriate. Loaders may also set additional reserved attributes on the module if useful. This means that the definitive way to determine the origin of a module is to check its __loader__ attribute.

For example, namespace packages as described in this PEP will have no __file__ attribute because no corresponding file exists. In order to provide flexibility and descriptiveness in the reprs of such modules, a new optional protocol is added to PEP 302 loaders. Loaders can implement a module_repr() method which takes a single argument, the module object. This method should return the string to be used verbatim as the repr of the module. The rules for producing a module repr are now standardized as:

  • If the module has an __loader__ and that loader has a module_repr() method, call it with a single argument, which is the module object. The value returned is used as the module's repr.
  • If an exception occurs in module_repr(), the exception is caught and discarded, and the calculation of the module's repr continues as if module_repr() did not exist.
  • If the module has an __file__ attribute, this is used as part of the module's repr.
  • If the module has no __file__ but does have an __loader__, then the loader's repr is used as part of the module's repr.
  • Otherwise, just use the module's __name__ in the repr.

Here is a snippet showing how namespace module reprs are calculated from its loader:

class NamespaceLoader:
    @classmethod
    def module_repr(cls, module):
        return "<module '{}' (namespace)>".format(module.__name__)

Built-in module reprs would no longer need to be hard-coded, but instead would come from their loader as well:

class BuiltinImporter:
    @classmethod
    def module_repr(cls, module):
        return "<module '{}' (built-in)>".format(module.__name__)

Here are some example reprs of different types of modules with different sets of the related attributes:

>>> import email
>>> email
<module 'email' from '/home/barry/projects/python/pep-420/Lib/email/__init__.py'>
>>> m = type(email)('foo')
>>> m
<module 'foo'>
>>> m.__file__ = 'zippy:/de/do/dah'
>>> m
<module 'foo' from 'zippy:/de/do/dah'>
>>> class Loader: pass
...
>>> m.__loader__ = Loader
>>> del m.__file__
>>> m
<module 'foo' (<class '__main__.Loader'>)>
>>> class NewLoader:
...   @classmethod
...   def module_repr(cls, module):
...      return '<mystery module!>'
...
>>> m.__loader__ = NewLoader
>>> m
<mystery module!>
>>>

References

[1]PEP 420 branch (http://hg.python.org/features/pep-420)
[2]PEP 402's description of use cases for namespace packages (http://www.python.org/dev/peps/pep-0402/#the-problem)
[3]PyCon 2012 Namespace Package discussion outcome (http://mail.python.org/pipermail/import-sig/2012-March/000421.html)
[4]Nick Coghlan's objection to the lack of marker files or directories (http://mail.python.org/pipermail/import-sig/2012-March/000423.html)
[5]Nick Coghlan's response to his initial objections (http://mail.python.org/pipermail/import-sig/2012-April/000464.html)
[6]Martin v. Lรถwis's suggestion to make encodings a namespace package (http://mail.python.org/pipermail/import-sig/2012-May/000540.html)
[7]Use case for multiple portions per find_loader call (http://mail.python.org/pipermail/import-sig/2012-May/000585.html)
[8](1, 2) Discussion about dynamic path computation (http://mail.python.org/pipermail/python-dev/2012-May/119560.html)

pep-0421 Adding sys.implementation

PEP:421
Title:Adding sys.implementation
Version:$Revision$
Last-Modified:$Date$
Author:Eric Snow <ericsnowcurrently at gmail.com>
BDFL-Delegate:Barry Warsaw
Status:Final
Type:Standards Track
Content-Type:text/x-rst
Created:26-April-2012
Post-History:26-April-2012
Resolution:http://mail.python.org/pipermail/python-dev/2012-May/119683.html

Abstract

This PEP introduces a new attribute for the sys module: sys.implementation. The attribute holds consolidated information about the implementation of the running interpreter. Thus sys.implementation is the source to which the standard library may look for implementation-specific information.

The proposal in this PEP is in line with a broader emphasis on making Python friendlier to alternate implementations. It describes the new variable and the constraints on what that variable contains. The PEP also explains some immediate use cases for sys.implementation.

Motivation

For a number of years now, the distinction between Python-the-language and CPython (the reference implementation) has been growing. Most of this change is due to the emergence of Jython, IronPython, and PyPy as viable alternate implementations of Python.

Consider, however, the nearly two decades of CPython-centric Python (i.e. most of its existence). That focus has understandably contributed to quite a few CPython-specific artifacts both in the standard library and exposed in the interpreter. Though the core developers have made an effort in recent years to address this, quite a few of the artifacts remain.

Part of the solution is presented in this PEP: a single namespace in which to consolidate implementation specifics. This will help focus efforts to differentiate the implementation specifics from the language. Additionally, it will foster a multiple-implementation mindset.

Proposal

We will add a new attribute to the sys module, called sys.implementation, as an object with attribute-access (as opposed to a mapping). It will contain implementation-specific information.

The attributes of this object will remain fixed during interpreter execution and through the course of an implementation version. This ensures behaviors don't change between versions which depend on attributes of sys.implementation.

The object has each of the attributes described in the Required Attributes section below. Those attribute names will never start with an underscore. The standard library and the language definition will rely only on those required attributes.

This proposal takes a conservative approach in requiring only a small number of attributes. As more become appropriate, they may be added with discretion, as described in Adding New Required Attributes.

While this PEP places no other constraints on sys.implementation, it also recommends that no one rely on capabilities outside those described here. The only exception to that recommendation is for attributes starting with an underscore. Implementers may use those as appropriate to store per-implementation data.

Required Attributes

These are attributes in sys.implementation on which the standard library and language definition will rely, meaning implementers must define them:

name
A lower-case identifier representing the implementation. Examples include 'pypy', 'jython', 'ironpython', and 'cpython'.
version
The version of the implementation, as opposed to the version of the language it implements. This value conforms to the format described in Version Format.
hexversion
The version of the implementation in the same hexadecimal format as sys.hexversion.
cache_tag
A string used for the PEP 3147 cache tag [12]. It would normally be a composite of the name and version (e.g. 'cpython-33' for CPython 3.3). However, an implementation may explicitly use a different cache tag. If cache_tag is set to None, it indicates that module caching should be disabled.

Adding New Required Attributes

In time more required attributes will be added to sys.implementation. However, each must have a meaningful use case across all Python implementations in order to be considered. This is made most clear by a use case in the standard library or language specification.

All proposals for new required attributes will go through the normal PEP process. Such a PEP need not be long, just long enough. It will need to sufficiently spell out the rationale for the new attribute, its use cases, and the impact it will have on the various Python implementations.

Version Format

A main point of sys.implementation is to contain information that will be used internally in the standard library. In order to facilitate the usefulness of the version attribute, its value should be in a consistent format across implementations.

As such, the format of sys.implementation.version will follow that of sys.version_info, which is effectively a named tuple. It is a familiar format and generally consistent with normal version format conventions.

Rationale

The status quo for implementation-specific information gives us that information in a more fragile, harder to maintain way. It is spread out over different modules or inferred from other information, as we see with platform.python_implementation().

This PEP is the main alternative to that approach. It consolidates the implementation-specific information into a single namespace and makes explicit that which was implicit.

Type Considerations

It's very easy to get bogged down in discussions about the type of sys.implementation. However, its purpose is to support the standard library and language definition. As such, there isn't much that really matters regarding its type, as opposed to a feature that would be more generally used. Thus characteristics like immutability and sequence-ness have been disregarded.

The only real choice has been between an object with attribute access and a mapping with item access. This PEP espouses dotted access to reflect the relatively fixed nature of the namespace.

Non-Required Attributes

Earlier versions of this PEP included a required attribute called metadata that held any non-required, per-implementation data [17]. However, this proved to be an unnecessary addition considering the purpose of sys.implementation.

Ultimately, non-required attributes are virtually ignored in this PEP. They have no impact other than that careless use may collide with future required attributes. That, however, is but a marginal concern for sys.implementation.

Why a Part of sys?

The sys module holds the new namespace because sys is the depot for interpreter-centric variables and functions. Many implementation-specific attributes are already found in sys.

Why Strict Constraints on Any of the Values?

As already noted in Version Format, values in sys.implementation are intended for use by the standard library. Constraining those values, essentially specifying an API for them, allows them to be used consistently, regardless of how they are otherwise implemented. However, care should be take to not over-specify the constraints.

Discussion

The topic of sys.implementation came up on the python-ideas list in 2009, where the reception was broadly positive [1]. I revived the discussion recently while working on a pure-python imp.get_tag() [2]. Discussion has been ongoing [3]. The messages in issue #14673 [19] are also relevant.

A good part of the recent discussion centered on the type to use for sys.implementation.

Use-cases

platform.python_implementation()

"explicit is better than implicit"

The platform module determines the python implementation by looking for clues in a couple different sys variables [11]. However, this approach is fragile, requiring changes to the standard library each time an implementation changes. Beyond that, support in platform is limited to those implementations that core developers have blessed by special-casing them in the platform module.

With sys.implementation the various implementations would explicitly set the values in their own version of the sys module.

Another concern is that the platform module is part of the stdlib, which ideally should minimize implementation details such as would be moved to sys.implementation.

Any overlap between sys.implementation and the platform module would simply defer to sys.implementation (with the same interface in platform wrapping it).

Cache Tag Generation in Frozen Importlib

PEP 3147 defined the use of a module cache and cache tags for file names. The importlib bootstrap code, frozen into the Python binary as of 3.3, uses the cache tags during the import process. Part of the project to bootstrap importlib has been to clean code out of Python/import.c [21] that did not need to be there any longer.

The cache tag defined in Python/import.c was hard-coded to "cpython" MAJOR MINOR [12]. For importlib the options are either hard-coding it in the same way, or guessing the implementation in the same way as does platform.python_implementation().

As long as the hard-coded tag is limited to CPython-specific code, it is livable. However, inasmuch as other Python implementations use the importlib code to work with the module cache, a hard-coded tag would become a problem.

Directly using the platform module in this case is a non-starter. Any module used in the importlib bootstrap must be built-in or frozen, neither of which apply to the platform module. This is the point that led to the recent interest in sys.implementation.

Regardless of the outcome for the implementation name used, another problem relates to the version used in the cache tag. That version is likely to be the implementation version rather than the language version. However, the implementation version is not readily identified anywhere in the standard library.

Implementation-Specific Tests

Currently there are a number of implementation-specific tests in the test suite under Lib/test. The test support module (Lib/test/support.py [20]) provides some functionality for dealing with these tests. However, like the platform module, test.support must do some guessing that sys.implementation would render unnecessary.

Jython's os.name Hack

In Jython, os.name is set to 'java' to accommodate special treatment of the java environment in the standard library [15] [16]. Unfortunately it masks the os name that would otherwise go there. sys.implementation would help obviate the need for this special case. Currently Jython sets os._name for the normal os.name value.

The Problem With sys.(version|version_info|hexversion)

Earlier versions of this PEP made the mistake of calling sys.version_info (and friends) the version of the Python language, in contrast to the implementation. However, this is not the case. Instead, it is the version of the CPython implementation. Incidentally, the first two components of sys.version_info (major and minor) also reflect the version of the language definition.

As Barry Warsaw noted, the "semantics of sys.version_info have been sufficiently squishy in the past" [18]. With sys.implementation we have the opportunity to improve this situation by first establishing an explicit location for the version of the implementation.

This PEP makes no other effort to directly clarify the semantics of sys.version_info. Regardless, having an explicit version for the implementation will definitely help to clarify the distinction from the language version.

Feedback From Other Python Implementers

IronPython

Jeff Hardy responded to a request for feedback [4]. He said, "I'll probably add it the day after it's approved" [6]. He also gave useful feedback on both the type of sys.implementation and on the metadata attribute (which has since been removed from the PEP).

Jython

In 2009 Frank Wierzbicki said this (relative to Jython implementing the required attributes) [8]:

Speaking for Jython, so far it looks like something we would adopt
soonish after it was accepted (it looks pretty useful to me).

PyPy

Some of the PyPy developers have responded to a request for feedback [9]. Armin Rigo said the following [10]:

For myself, I can only say that it looks like a good idea, which we
will happily adhere to when we migrate to Python 3.3.

He also expressed support for keeping the required list small. Both Armin and Laura Creighton indicated that an effort to better catalog Python's implementation would be welcome. Such an effort, for which this PEP is a small start, will be considered separately.

Past Efforts

PEP 3139

PEP 3139, from 2008, recommended a clean-up of the sys module in part by extracting implementation-specific variables and functions into a separate module. PEP 421 is less ambitious version of that idea. While PEP 3139 was rejected, its goals are reflected in PEP 421 to a large extent, though with a much lighter approach.

PEP 399

PEP 399 dictates policy regarding the standard library, helping to make it friendlier to alternate implementations. PEP 421 is proposed in that same spirit.

The Bigger Picture

It's worth noting again that this PEP is a small part of a larger on-going effort to identify the implementation-specific parts of Python and mitigate their impact on alternate implementations.

sys.implementation is a focal point for implementation-specific data, acting as a nexus for cooperation between the language, the standard library, and the different implementations. As time goes by it is feasible that sys.implementation will assume current attributes of sys and other builtin/stdlib modules, where appropriate. In this way, it is a PEP 3137-lite, but starting as small as possible.

However, as already noted, many other efforts predate sys.implementation. Neither is it necessarily a major part of the effort. Rather, consider it as part of the infrastructure of the effort to make Python friendlier to alternate implementations.

Alternatives

Since the single-namespace-under-sys approach is relatively straightforward, no alternatives have been considered for this PEP.

Examples of Other Attributes

These are examples only and not part of the proposal. Most of them were suggested during previous discussions, but did not fit into the goals of this PEP. (See Adding New Required Attributes if they get you excited.)

common_name
The case-sensitive name by which the implementation is known.
vcs_url
A URL for the main VCS repository for the implementation project.
vcs_revision_id
A value that identifies the VCS revision of the implementation.
build_toolchain
The tools used to build the interpreter.
build_date
The timestamp of when the interpreter was built.
homepage
The URL of the implementation's website.
site_prefix
The preferred site prefix for the implementation.
runtime
The run-time environment in which the interpreter is running, as in "Common Language Runtime" (.NET CLR) or "Java Runtime Executable".
gc_type
The type of garbage collection used, like "reference counting" or "mark and sweep".

Open Issues

Currently none.

Implementation

The implementation of this PEP is covered in issue #14673 [19].

References

[1]The 2009 sys.implementation discussion: http://mail.python.org/pipermail/python-dev/2009-October/092893.html
[2]The initial 2012 discussion: http://mail.python.org/pipermail/python-ideas/2012-March/014555.html (and http://mail.python.org/pipermail/python-ideas/2012-April/014878.html)
[3]Feedback on the PEP: http://mail.python.org/pipermail/python-ideas/2012-April/014954.html
[4]Feedback from the IronPython developers: http://mail.python.org/pipermail/ironpython-users/2012-May/015980.html
[5](2009) Dino Viehland offers his opinion: http://mail.python.org/pipermail/python-dev/2009-October/092894.html
[6](2012) Jeff Hardy offers his opinion: http://mail.python.org/pipermail/ironpython-users/2012-May/015981.html
[7]Feedback from the Jython developers: ???
[8](2009) Frank Wierzbicki offers his opinion: http://mail.python.org/pipermail/python-dev/2009-October/092974.html
[9]Feedback from the PyPy developers: http://mail.python.org/pipermail/pypy-dev/2012-May/009883.html
[10](2012) Armin Rigo offers his opinion: http://mail.python.org/pipermail/pypy-dev/2012-May/009884.html
[11]The platform code which divines the implementation name: http://hg.python.org/cpython/file/2f563908ebc5/Lib/platform.py#l1247
[12](1, 2) The definition for cache tags in PEP 3147: http://www.python.org/dev/peps/pep-3147/#id53
[13]The original implementation of the cache tag in CPython: http://hg.python.org/cpython/file/2f563908ebc5/Python/import.c#l121
[14]Examples of implementation-specific handling in test.support: * http://hg.python.org/cpython/file/2f563908ebc5/Lib/test/support.py#l509 * http://hg.python.org/cpython/file/2f563908ebc5/Lib/test/support.py#l1246 * http://hg.python.org/cpython/file/2f563908ebc5/Lib/test/support.py#l1252 * http://hg.python.org/cpython/file/2f563908ebc5/Lib/test/support.py#l1275
[15]The standard library entry for os.name: http://docs.python.org/3.3/library/os.html#os.name
[16]The use of os.name as 'java' in the stdlib test suite. http://hg.python.org/cpython/file/2f563908ebc5/Lib/test/support.py#l512
[17]Nick Coghlan's proposal for sys.implementation.metadata: http://mail.python.org/pipermail/python-ideas/2012-May/014984.html
[18]Feedback from Barry Warsaw: http://mail.python.org/pipermail/python-dev/2012-May/119374.html
[19](1, 2) http://bugs.python.org/issue14673
[20]http://hg.python.org/cpython/file/2f563908ebc5/Lib/test/support.py
[21]http://hg.python.org/cpython/file/2f563908ebc5/Python/import.c

pep-0422 Simpler customisation of class creation

PEP:422
Title:Simpler customisation of class creation
Version:$Revision$
Last-Modified:$Date$
Author:Nick Coghlan <ncoghlan at gmail.com>, Daniel Urban <urban.dani+py at gmail.com>
Status:Withdrawn
Type:Standards Track
Content-Type:text/x-rst
Created:5-Jun-2012
Python-Version:3.5
Post-History:5-Jun-2012, 10-Feb-2013

Abstract

Currently, customising class creation requires the use of a custom metaclass. This custom metaclass then persists for the entire lifecycle of the class, creating the potential for spurious metaclass conflicts.

This PEP proposes to instead support a wide range of customisation scenarios through a new namespace parameter in the class header, and a new __autodecorate__ hook in the class body.

The new mechanism should be easier to understand and use than implementing a custom metaclass, and thus should provide a gentler introduction to the full power Python's metaclass machinery.

PEP Withdrawal

This proposal has been withdrawn in favour of Martin Teichmann's proposal in PEP 487, which achieves the same goals through a simpler, easier to use __init_subclass__ hook that simply isn't invoked for the base class that defines the hook.

Background

For an already created class cls, the term "metaclass" has a clear meaning: it is the value of type(cls).

During class creation, it has another meaning: it is also used to refer to the metaclass hint that may be provided as part of the class definition. While in many cases these two meanings end up referring to one and the same object, there are two situations where that is not the case:

  • If the metaclass hint refers to an instance of type, then it is considered as a candidate metaclass along with the metaclasses of all of the parents of the class being defined. If a more appropriate metaclass is found amongst the candidates, then it will be used instead of the one given in the metaclass hint.
  • Otherwise, an explicit metaclass hint is assumed to be a factory function and is called directly to create the class object. In this case, the final metaclass will be determined by the factory function definition. In the typical case (where the factory functions just calls type, or, in Python 3.3 or later, types.new_class) the actual metaclass is then determined based on the parent classes.

It is notable that only the actual metaclass is inherited - a factory function used as a metaclass hook sees only the class currently being defined, and is not invoked for any subclasses.

In Python 3, the metaclass hint is provided using the metaclass=Meta keyword syntax in the class header. This allows the __prepare__ method on the metaclass to be used to create the locals() namespace used during execution of the class body (for example, specifying the use of collections.OrderedDict instead of a regular dict).

In Python 2, there was no __prepare__ method (that API was added for Python 3 by PEP 3115). Instead, a class body could set the __metaclass__ attribute, and the class creation process would extract that value from the class namespace to use as the metaclass hint. There is published code [1] that makes use of this feature.

Another new feature in Python 3 is the zero-argument form of the super() builtin, introduced by PEP 3135. This feature uses an implicit __class__ reference to the class being defined to replace the "by name" references required in Python 2. Just as code invoked during execution of a Python 2 metaclass could not call methods that referenced the class by name (as the name had not yet been bound in the containing scope), similarly, Python 3 metaclasses cannot call methods that rely on the implicit __class__ reference (as it is not populated until after the metaclass has returned control to the class creation machinery).

Finally, when a class uses a custom metaclass, it can pose additional challenges to the use of multiple inheritance, as a new class cannot inherit from parent classes with unrelated metaclasses. This means that it is impossible to add a metaclass to an already published class: such an addition is a backwards incompatible change due to the risk of metaclass conflicts.

Proposal

This PEP proposes that a new mechanism to customise class creation be added to Python 3.4 that meets the following criteria:

  1. Integrates nicely with class inheritance structures (including mixins and multiple inheritance)
  2. Integrates nicely with the implicit __class__ reference and zero-argument super() syntax introduced by PEP 3135
  3. Can be added to an existing base class without a significant risk of introducing backwards compatibility problems
  4. Restores the ability for class namespaces to have some influence on the class creation process (above and beyond populating the namespace itself), but potentially without the full flexibility of the Python 2 style __metaclass__ hook

One mechanism that can achieve this goal is to add a new implicit class decoration hook, modelled directly on the existing explicit class decorators, but defined in the class body or in a parent class, rather than being part of the class definition header.

Specifically, it is proposed that class definitions be able to provide a class initialisation hook as follows:

class Example:
    def __autodecorate__(cls):
        # This is invoked after the class is created, but before any
        # explicit decorators are called
        # The usual super() mechanisms are used to correctly support
        # multiple inheritance. The class decorator style signature helps
        # ensure that invoking the parent class is as simple as possible.
        cls = super().__autodecorate__()
        return cls

To simplify the cooperative multiple inheritance case, object will gain a default implementation of the hook that returns the class unmodified:

class object:
def __autodecorate__(cls):
return cls

If a metaclass wishes to block implicit class decoration for some reason, it must arrange for cls.__autodecorate__ to trigger AttributeError.

If present on the created object, this new hook will be called by the class creation machinery after the __class__ reference has been initialised. For types.new_class(), it will be called as the last step before returning the created class object. __autodecorate__ is implicitly converted to a class method when the class is created (prior to the hook being invoked).

Note, that when __autodecorate__ is called, the name of the class is not yet bound to the new class object. As a consequence, the two argument form of super() cannot be used to call methods (e.g., super(Example, cls) wouldn't work in the example above). However, the zero argument form of super() works as expected, since the __class__ reference is already initialised.

This general proposal is not a new idea (it was first suggested for inclusion in the language definition more than 10 years ago [2], and a similar mechanism has long been supported by Zope's ExtensionClass [3]), but the situation has changed sufficiently in recent years that the idea is worth reconsidering for inclusion as a native language feature.

In addition, the introduction of the metaclass __prepare__ method in PEP 3115 allows a further enhancement that was not possible in Python 2: this PEP also proposes that type.__prepare__ be updated to accept a factory function as a namespace keyword-only argument. If present, the value provided as the namespace argument will be called without arguments to create the result of type.__prepare__ instead of using a freshly created dictionary instance. For example, the following will use an ordered dictionary as the class namespace:

class OrderedExample(namespace=collections.OrderedDict):
    def __autodecorate__(cls):
        # cls.__dict__ is still a read-only proxy to the class namespace,
        # but the underlying storage is an OrderedDict instance

Note

This PEP, along with the existing ability to use __prepare__ to share a single namespace amongst multiple class objects, highlights a possible issue with the attribute lookup caching: when the underlying mapping is updated by other means, the attribute lookup cache is not invalidated correctly (this is a key part of the reason class __dict__ attributes produce a read-only view of the underlying storage).

Since the optimisation provided by that cache is highly desirable, the use of a preexisting namespace as the class namespace may need to be declared as officially unsupported (since the observed behaviour is rather strange when the caches get out of sync).

Key Benefits

Easier use of custom namespaces for a class

Currently, to use a different type (such as collections.OrderedDict) for a class namespace, or to use a pre-populated namespace, it is necessary to write and use a custom metaclass. With this PEP, using a custom namespace becomes as simple as specifying an appropriate factory function in the class header.

Easier inheritance of definition time behaviour

Understanding Python's metaclasses requires a deep understanding of the type system and the class construction process. This is legitimately seen as challenging, due to the need to keep multiple moving parts (the code, the metaclass hint, the actual metaclass, the class object, instances of the class object) clearly distinct in your mind. Even when you know the rules, it's still easy to make a mistake if you're not being extremely careful. An earlier version of this PEP actually included such a mistake: it stated "subclass of type" for a constraint that is actually "instance of type".

Understanding the proposed implicit class decoration hook only requires understanding decorators and ordinary method inheritance, which isn't quite as daunting a task. The new hook provides a more gradual path towards understanding all of the phases involved in the class definition process.

Reduced chance of metaclass conflicts

One of the big issues that makes library authors reluctant to use metaclasses (even when they would be appropriate) is the risk of metaclass conflicts. These occur whenever two unrelated metaclasses are used by the desired parents of a class definition. This risk also makes it very difficult to add a metaclass to a class that has previously been published without one.

By contrast, adding an __autodecorate__ method to an existing type poses a similar level of risk to adding an __init__ method: technically, there is a risk of breaking poorly implemented subclasses, but when that occurs, it is recognised as a bug in the subclass rather than the library author breaching backwards compatibility guarantees. In fact, due to the constrained signature of __autodecorate__, the risk in this case is actually even lower than in the case of __init__.

Integrates cleanly with PEP 3135

Unlike code that runs as part of the metaclass, code that runs as part of the new hook will be able to freely invoke class methods that rely on the implicit __class__ reference introduced by PEP 3135, including methods that use the zero argument form of super().

Replaces many use cases for dynamic setting of __metaclass__

For use cases that don't involve completely replacing the defined class, Python 2 code that dynamically set __metaclass__ can now dynamically set __autodecorate__ instead. For more advanced use cases, introduction of an explicit metaclass (possibly made available as a required base class) will still be necessary in order to support Python 3.

Design Notes

Determining if the class being decorated is the base class

In the body of an __autodecorate__ method, as in any other class method, __class__ will be bound to the class declaring the method, while the value passed in may be a subclass.

This makes it relatively straightforward to skip processing the base class if necessary:

class Example:
    def __autodecorate__(cls):
        cls = super().__autodecorate__()
        # Don't process the base class
        if cls is __class__:
            return
        # Process subclasses here
        ...

Replacing a class with a different kind of object

As an implicit decorator, __autodecorate__ is able to relatively easily replace the defined class with a different kind of object. Technically custom metaclasses and even __new__ methods can already do this implicitly, but the decorator model makes such code much easier to understand and implement.

class BuildDict:
    def __autodecorate__(cls):
        cls = super().__autodecorate__()
        # Don't process the base class
        if cls is __class__:
            return
        # Convert subclasses to ordinary dictionaries
        return cls.__dict__.copy()

It's not clear why anyone would ever do this implicitly based on inheritance rather than just using an explicit decorator, but the possibility seems worth noting.

Open Questions

Is the namespace concept worth the extra complexity?

Unlike the new __autodecorate__ hook the proposed namespace keyword argument is not automatically inherited by subclasses. Given the way this proposal is currently written , the only way to get a special namespace used consistently in subclasses is still to write a custom metaclass with a suitable __prepare__ implementation.

Changing the custom namespace factory to also be inherited would significantly increase the complexity of this proposal, and introduce a number of the same potential base class conflict issues as arise with the use of custom metaclasses.

Eric Snow has put forward a separate proposal to instead make the execution namespace for class bodies an ordered dictionary by default, and capture the class attribute definition order for future reference as an attribute (e.g. __definition_order__) on the class object.

Eric's suggested approach may be a better choice for a new default behaviour for type that combines well with the proposed __autodecorate__ hook, leaving the more complex configurable namespace factory idea to a custom metaclass like the one shown below.

New Ways of Using Classes

The new namespace keyword in the class header enables a number of interesting options for controlling the way a class is initialised, including some aspects of the object models of both Javascript and Ruby.

All of the examples below are actually possible today through the use of a custom metaclass:

class CustomNamespace(type):
    @classmethod
    def __prepare__(meta, name, bases, *, namespace=None, **kwds):
        parent_namespace = super().__prepare__(name, bases, **kwds)
        return namespace() if namespace is not None else parent_namespace
    def __new__(meta, name, bases, ns, *, namespace=None, **kwds):
        return super().__new__(meta, name, bases, ns, **kwds)
    def __init__(cls, name, bases, ns, *, namespace=None, **kwds):
        return super().__init__(name, bases, ns, **kwds)

The advantage of implementing the new keyword directly in type.__prepare__ is that the only persistent effect is then the change in the underlying storage of the class attributes. The metaclass of the class remains unchanged, eliminating many of the drawbacks typically associated with these kinds of customisations.

Order preserving classes

class OrderedClass(namespace=collections.OrderedDict):
    a = 1
    b = 2
    c = 3

Prepopulated namespaces

seed_data = dict(a=1, b=2, c=3)
class PrepopulatedClass(namespace=seed_data.copy):
    pass

Cloning a prototype class

class NewClass(namespace=Prototype.__dict__.copy):
    pass

Extending a class

Note

Just because the PEP makes it possible to do this relatively cleanly doesn't mean anyone should do this!

from collections import MutableMapping

# The MutableMapping + dict combination should give something that
# generally behaves correctly as a mapping, while still being accepted
# as a class namespace
class ClassNamespace(MutableMapping, dict):
    def __init__(self, cls):
        self._cls = cls
    def __len__(self):
        return len(dir(self._cls))
    def __iter__(self):
        for attr in dir(self._cls):
            yield attr
    def __contains__(self, attr):
        return hasattr(self._cls, attr)
    def __getitem__(self, attr):
        return getattr(self._cls, attr)
    def __setitem__(self, attr, value):
        setattr(self._cls, attr, value)
    def __delitem__(self, attr):
        delattr(self._cls, attr)

def extend(cls):
    return lambda: ClassNamespace(cls)

class Example:
    pass

class ExtendedExample(namespace=extend(Example)):
    a = 1
    b = 2
    c = 3

>>> Example.a, Example.b, Example.c
(1, 2, 3)

Rejected Design Options

Calling __autodecorate__ from type.__init__

Calling the new hook automatically from type.__init__, would achieve most of the goals of this PEP. However, using that approach would mean that __autodecorate__ implementations would be unable to call any methods that relied on the __class__ reference (or used the zero-argument form of super()), and could not make use of those features themselves.

The current design instead ensures that the implicit decorator hook is able to do anything an explicit decorator can do by running it after the initial class creation is already complete.

Calling the automatic decoration hook __init_class__

Earlier versions of the PEP used the name __init_class__ for the name of the new hook. There were three significant problems with this name:

  • it was hard to remember if the correct spelling was __init_class__ or __class_init__
  • the use of "init" in the name suggested the signature should match that of type.__init__, which is not the case
  • the use of "init" in the name suggested the method would be run as part of initial class object creation, which is not the case

The new name __autodecorate__ was chosen to make it clear that the new initialisation hook is most usefully thought of as an implicitly invoked class decorator, rather than as being like an __init__ method.

Requiring an explicit decorator on __autodecorate__

Originally, this PEP required the explicit use of @classmethod on the __autodecorate__ decorator. It was made implicit since there's no sensible interpretation for leaving it out, and that case would need to be detected anyway in order to give a useful error message.

This decision was reinforced after noticing that the user experience of defining __prepare__ and forgetting the @classmethod method decorator is singularly incomprehensible (particularly since PEP 3115 documents it as an ordinary method, and the current documentation doesn't explicitly say anything one way or the other).

Making __autodecorate__ implicitly static, like __new__

While it accepts the class to be instantiated as the first argument, __new__ is actually implicitly treated as a static method rather than as a class method. This allows it to be readily extracted from its defining class and called directly on a subclass, rather than being coupled to the class object it is retrieved from.

Such behaviour initially appears to be potentially useful for the new __autodecorate__ hook, as it would allow __autodecorate__ methods to readily be used as explicit decorators on other classes.

However, that apparent support would be an illusion as it would only work correctly if invoked on a subclass, in which case the method can just as readily be retrieved from the subclass and called that way. Unlike __new__, there's no issue with potentially changing method signatures at different points in the inheritance chain.

Passing in the namespace directly rather than a factory function

At one point, this PEP proposed that the class namespace be passed directly as a keyword argument, rather than passing a factory function. However, this encourages an unsupported behaviour (that is, passing the same namespace to multiple classes, or retaining direct write access to a mapping used as a class namespace), so the API was switched to the factory function version.

Reference Implementation

A reference implementation for __autodecorate__ has been posted to the issue tracker [4]. It uses the original __init_class__ naming. does not yet allow the implicit decorator to replace the class with a different object and does not implement the suggested namespace parameter for type.__prepare__.

pep-0423 Naming conventions and recipes related to packaging

PEP:423
Title:Naming conventions and recipes related to packaging
Version:$Revision$
Last-Modified:$Date$
Author:Benoit Bryon <benoit at marmelune.net>
Discussions-To:<distutils-sig at python.org>
Status:Deferred
Type:Informational
Content-Type:text/x-rst
Created:24-May-2012
Post-History:

Abstract

This document deals with:

  • names of Python projects,
  • names of Python packages or modules being distributed,
  • namespace packages.

It provides guidelines and recipes for distribution authors:

PEP Deferral

Further consideration of this PEP has been deferred at least until after PEP 426 (package metadata 2.0) and related updates have been resolved.

Relationship with other PEPs

  • PEP 8 [2] deals with code style guide, including names of Python packages and modules. It covers syntax of package/modules names.
  • PEP 345 [3] deals with packaging metadata, and defines name argument of the packaging.core.setup() function.
  • PEP 420 [4] deals with namespace packages. It brings support of namespace packages to Python core. Before, namespaces packages were implemented by external libraries.
  • PEP 3108 [5] deals with transition between Python 2.x and Python 3.x applied to standard library: some modules to be deleted, some to be renamed. It points out that naming conventions matter and is an example of transition plan.

Overview

Here is a summarized list of guidelines you should follow to choose names:

If in doubt, ask

If you feel unsure after reading this document, ask Python community [6] on IRC or on a mailing list.

Top-level namespace relates to code ownership

This helps avoid clashes between project names.

Ownership could be:

  • an individual. Example: gp.fileupload [7] is owned and maintained by Gael Pasgrimaud.
  • an organization. Examples:
    • zest.releaser [8] is owned and maintained by Zest Software.
    • Django [9] is owned and maintained by the Django Software Fundation.
  • a group or community. Example: sphinx [10] is maintained by developers of the Sphinx project, not only by its author, Georg Brandl.
  • a group or community related to another package. Example: collective.recaptcha [12] is owned by its author: David Glick, Groundwire. But the "collective" namespace is owned by Plone community.

Respect ownership

Understand the purpose of namespace before you use it.

Don't plug into a namespace you don't own, unless explicitely authorized.

If in doubt, ask.

As an example, don't plug in "django.contrib" namespace because it is managed by Django's core contributors.

Exceptions can be defined by project authors. See Organize community contributions below.

Also, this rule applies to non-Python projects.

As an example, don't use "apache" as top-level namespace: "Apache" is the name of an existing project (in the case of "Apache", it is also a trademark).

Private (including closed-source) projects use a namespace

... because private projects are owned by somebody. So apply the ownership rule.

For internal/customer projects, use your company name as the namespace.

This rule applies to closed-source projects.

As an example, if you are creating a "climbing" project for the "Python Sport" company: use "pythonsport.climbing" name, even if it is closed source.

Individual projects use a namespace

... because they are owned by individuals. So apply the ownership rule.

There is no shame in releasing a project as open source even if it has an "internal" or "individual" name.

If the project comes to a point where the author wants to change ownership (i.e. the project no longer belongs to an individual), keep in mind it is easy to rename the project.

Community-owned projects can avoid namespace packages

If your project is generic enough (i.e. it is not a contrib to another product or framework), you can avoid namespace packages. The base condition is generally that your project is owned by a group (i.e. the development team) which is dedicated to this project.

Only use a "shared" namespace if you really intend the code to be community owned.

As an example, sphinx [10] project belongs to the Sphinx development team. There is no need to have some "sphinx" namespace package with only one "sphinx.sphinx" project inside.

In doubt, use an individual/organization namespace

If your project is really experimental, best choice is to use an individual or organization namespace:

  • it allows projects to be released early.
  • it won't block a name if the project is abandoned.
  • it doesn't block future changes. When a project becomes mature and there is no reason to keep individual ownership, it remains possible to rename the project.

Use a single name

Distribute only one package (or only one module) per project, and use package (or module) name as project name.

  • It avoids possible confusion between project name and distributed package or module name.

  • It makes the name consistent.

  • It is explicit: when one sees project name, he guesses package/module name, and vice versa.

  • It also limits implicit clashes between package/module names. By using a single name, when you register a project name to PyPI [11], you also perform a basic package/module name availability verification.

    As an example, pipeline [13], python-pipeline [14] and django-pipeline [15] all distribute a package or module called "pipeline". So installing two of them leads to errors. This issue wouldn't have occurred if these distributions used a single name.

Yes:

  • Package name: "kheops.pyramid", i.e. import kheops.pyramid
  • Project name: "kheops.pyramid", i.e. pip install kheops.pyramid

No:

  • Package name: "kheops"
  • Project name: "KheopsPyramid"

Note

For historical reasons, PyPI [11] contains many distributions where project and distributed package/module names differ.

Multiple packages/modules should be rare

Technically, Python distributions can provide multiple packages and/or modules. See setup script reference [16] for details.

Some distributions actually do. As an example, setuptools [17] and distribute [18] are both declaring "pkg_resources", "easy_install" and "site" modules in addition to respective "setuptools" and "distribute" packages.

Consider this use case as exceptional. In most cases, you don't need this feature. So a distribution should provide only one package or module at a time.

Distinct names should be rare

A notable exception to the Use a single name rule is when you explicitely need distinct names.

As an example, the Pillow [19] project provides an alternative to the original PIL [20] distribution. Both projects distribute a "PIL" package.

Consider this use case as exceptional. In most cases, you don't need this feature. So a distributed package name should be equal to project name.

Follow PEP 8 for syntax of package and module names

PEP 8 [2] applies to names of Python packages and modules.

If you Use a single name, PEP 8 [2] also applies to project names. The exceptions are namespace packages, where dots are required in project name.

Pick memorable names

One important thing about a project name is that it be memorable.

As an example, celery [21] is not a meaningful name. At first, it is not obvious that it deals with message queuing. But it is memorable, partly because it can be used to feed a RabbitMQ [22] server.

Pick meaningful names

Ask yourself "how would I describe in one sentence what this name is for?", and then "could anyone have guessed that by looking at the name?".

As an example, DateUtils [23] is a meaningful name. It is obvious that it deals with utilities for dates.

When you are using namespaces, try to make each part meaningful.

Use packaging metadata

Consider project names as unique identifiers on PyPI:

  • it is important that these identifiers remain human-readable.
  • it is even better when these identifiers are meaningful.
  • but the primary purpose of identifiers is not to classify or describe projects.

Classifiers and keywords metadata are made for categorization of distributions. Summary and description metadata are meant to describe the project.

As an example, there is a "Framework :: Twisted [24]" classifier. Even if names are quite heterogeneous (they don't follow a particular pattern), we get the list.

In order to Organize community contributions, conventions about names and namespaces matter, but conventions about metadata should be even more important.

As an example, we can find Plone portlets in many places:

  • plone.portlet.*
  • collective.portlet.*
  • collective.portlets.*
  • collective.*.portlets
  • some vendor-related projects such as "quintagroup.portlet.cumulus"
  • and even projects where "portlet" pattern doesn't appear in the name.

Even if Plone community has conventions, using the name to categorize distributions is inapropriate. It's impossible to get the full list of distributions that provide portlets for Plone by filtering on names. But it would be possible if all these distributions used "Framework :: Plone" classifier and "portlet" keyword.

Avoid deep nesting

The Zen of Python [25] says "Flat is better than nested".

Two levels is almost always enough

Don't define everything in deeply nested hierarchies: you will end up with projects and packages like "pythonsport.common.maps.forest". This type of name is both verbose and cumbersome (e.g. if you have many imports from the package).

Furthermore, big hierarchies tend to break down over time as the boundaries between different packages blur.

The consensus is that two levels of nesting are preferred.

For example, we have plone.principalsource instead of plone.source.principal or something like that. The name is shorter, the package structure is simpler, and there would be very little to gain from having three levels of nesting here. It would be impractical to try to put all "core Plone" sources (a source is kind of vocabulary) into the plone.source.* namespace, in part because some sources are part of other packages, and in part because sources already exist in other places. Had we made a new namespace, it would be inconsistently used from the start.

Yes: "pyranha"

Yes: "pythonsport.climbing"

Yes: "pythonsport.forestmap"

No: "pythonsport.maps.forest"

Use only one level for ownership

Don't use 3 levels to set individual/organization ownership in a community namespace.

As an example, let's consider:

  • you are pluging into a community namespace, such as "collective".
  • and you want to add a more restrictive "ownership" level, to avoid clashes inside the community.

In such a case, you'd better use the most restrictive ownership level as first level.

As an example, where "collective" is a major community namespace that "gergovie" belongs to, and "vercingetorix" it the name of "gergovie" author:

No: "collective.vercingetorix.gergovie"

Yes: "vercingetorix.gergovie"

Don't use more than 3 levels

Technically, you can create deeply nested hierarchies. However, in most cases, you shouldn't need it.

Note

Even communities where namespaces are standard don't use more than 3 levels.

Register names with PyPI

PyPI [11] is the central place for distributions in Python community. So, it is also the place where to register project and package names.

See Registering with the Package Index [27] for details.

Recipes

The following recipes will help you follow the guidelines and conventions above.

How to check for name availability?

Before you choose a project name, make sure it hasn't already been registered in the following locations:

  • PyPI [11]
  • that's all. PyPI is the only official place.

As an example, you could also check in various locations such as popular code hosting services, but keep in mind that PyPI is the only place you can register for names in Python community.

That's why it is important you register names with PyPI.

Also make sure the names of distributed packages or modules haven't already been registered:

The use a single name rule also helps you avoid clashes with package names: if a project name is available, then the package name has good chances to be available too.

How to rename a project?

Renaming a project is possible, but keep in mind that it will cause some confusions. So, pay particular attention to README and documentation, so that users understand what happened.

  1. First of all, do not remove legacy distributions from PyPI. Because some users may be using them.
  2. Copy the legacy project, then change names (project and package/module). Pay attention to, at least:
    • packaging files,
    • folder name that contains source files,
    • documentation, including README,
    • import statements in code.
  3. Assign Obsoletes-Dist metadata to new distribution in setup.cfg file. See PEP 345 about Obsolete-Dist [29] and setup.cfg specification [30].
  4. Release a new version of the renamed project, then publish it.
  5. Edit legacy project:
    • add dependency to new project,
    • drop everything except packaging stuff,
    • add the Development Status :: 7 - Inactive classifier in setup script,
    • publish a new release.

So, users of the legacy package:

  • can continue using the legacy distributions at a deprecated version,
  • can upgrade to last version of legacy distribution, which is empty...
  • ... and automatically download new distribution as a dependency of the legacy one.

Users who discover the legacy project see it is inactive.

Improved handling of renamed projects on PyPI

If many projects follow Renaming howto recipe, then many legacy distributions will have the following characteristics:

  • Development Status :: 7 - Inactive classifier.
  • latest version is empty, except packaging stuff.
  • lastest version "redirects" to another distribution. E.g. it has a single dependency on the renamed project.
  • referenced as Obsoletes-Dist in a newer distribution.

So it will be possible to detect renamed projects and improve readability on PyPI. So that users can focus on active distributions. But this feature is not required now. There is no urge. It won't be covered in this document.

How to apply naming guidelines on existing projects?

There is no obligation for existing projects to be renamed. The choice is left to project authors and mainteners for obvious reasons.

However, project authors are invited to:

State about current naming

The important thing, at first, is that you state about current choices:

  • Ask yourself "why did I choose the current name?", then document it.
  • If there are differences with the guidelines provided in this document, you should tell your users.
  • If possible, create issues in the project's bugtracker, at least for record. Then you are free to resolve them later, or maybe mark them as "wontfix".

Projects that are meant to receive contributions from community should also organize community contributions.

Promote migrations

Every Python developer should migrate whenever possible, or promote the migrations in their respective communities.

Apply these guidelines on your projects, then the community will see it is safe.

In particular, "leaders" such as authors of popular projects are influential, they have power and, thus, responsability over communities.

Apply these guidelines on popular projects, then communities will adopt the conventions too.

Projects should promote migrations when they release a new (major) version, particularly if this version introduces support for Python 3.x, new standard library's packaging or namespace packages.

Opportunity

As of Python 3.3 being developed:

  • many projects are not Python 3.x compatible. It includes "big" products or frameworks. It means that many projects will have to do a migration to support Python 3.x.
  • packaging (aka distutils2) is on the starting blocks. When it is released, projects will be invited to migrate and use new packaging.
  • PEP 420 [4] brings official support of namespace packages to Python.

It means that most active projects should be about to migrate in the next year(s) to support Python 3.x, new packaging or new namespace packages.

Such an opportunity is unique and won't come again soon! So let's introduce and promote naming conventions as soon as possible (i.e. now).

References

Additional background:

References and footnotes:

[1]http://docs.python.org/dev/packaging/introduction.html#general-python-terminology
[2](1, 2, 3) http://www.python.org/dev/peps/pep-0008/#package-and-module-names
[3]http://www.python.org/dev/peps/pep-0345/
[4](1, 2) http://www.python.org/dev/peps/pep-0420/
[5]http://www.python.org/dev/peps/pep-3108/
[6]http://www.python.org/community/
[7]http://pypi.python.org/pypi/gp.fileupload/
[8]http://pypi.python.org/pypi/zest.releaser/
[9]http://djangoproject.com/
[10](1, 2) http://sphinx.pocoo.org
[11](1, 2, 3, 4) http://pypi.python.org
[12]http://pypi.python.org/pypi/collective.recaptcha/
[13]http://pypi.python.org/pypi/pipeline/
[14]http://pypi.python.org/pypi/python-pipeline/
[15]http://pypi.python.org/pypi/django-pipeline/
[16]http://docs.python.org/dev/packaging/setupscript.html
[17]http://pypi.python.org/pypi/setuptools
[18]http://packages.python.org/distribute/
[19]http://pypi.python.org/pypi/Pillow/
[20]http://pypi.python.org/pypi/PIL/
[21]http://pypi.python.org/pypi/celery/
[22]http://www.rabbitmq.com
[23]http://pypi.python.org/pypi/DateUtils/
[24]http://pypi.python.org/pypi?:action=browse&show=all&c=525
[25]http://www.python.org/dev/peps/pep-0020/
[26]http://plone.org/community/develop
[27]http://docs.python.org/dev/packaging/packageindex.html
[28]http://docs.python.org/library/index.html
[29]http://www.python.org/dev/peps/pep-0345/#obsoletes-dist-multiple-use
[30]http://docs.python.org/dev/packaging/setupcfg.html
[31]http://www.martinaspeli.net/articles/the-naming-of-things-package-names-and-namespaces
[32]http://docs.python.org/dev/packaging/
[33]http://guide.python-distribute.org/specification.html#naming-specification

pep-0424 A method for exposing a length hint

PEP:424
Title:A method for exposing a length hint
Version:$Revision$
Last-Modified:$Date$
Author:Alex Gaynor <alex.gaynor at gmail.com>
Status:Final
Type:Standards Track
Content-Type:text/x-rst
Created:14-July-2012
Python-Version:3.4
Post-History:http://mail.python.org/pipermail/python-dev/2012-July/120920.html

Abstract

CPython currently defines a __length_hint__ method on several types, such as various iterators. This method is then used by various other functions (such as list) to presize lists based on the estimate returned by __length_hint__. Types which are not sized, and thus should not define __len__, can then define __length_hint__, to allow estimating or computing a size (such as many iterators).

Specification

This PEP formally documents __length_hint__ for other interpreters and non-standard-library Python modules to implement.

__length_hint__ must return an integer (else a TypeError is raised) or NotImplemented, and is not required to be accurate. It may return a value that is either larger or smaller than the actual size of the container. A return value of NotImplemented indicates that there is no finite length estimate. It may not return a negative value (else a ValueError is raised).

In addition, a new function operator.length_hint hint is added, with the following semantics (which define how __length_hint__ should be used):

def length_hint(obj, default=0):
    """Return an estimate of the number of items in obj.

    This is useful for presizing containers when building from an
    iterable.

    If the object supports len(), the result will be
    exact. Otherwise, it may over- or under-estimate by an
    arbitrary amount. The result will be an integer >= 0.
    """
    try:
        return len(obj)
    except TypeError:
        try:
            get_hint = type(obj).__length_hint__
        except AttributeError:
            return default
        try:
            hint = get_hint(obj)
        except TypeError:
            return default
        if hint is NotImplemented:
            return default
        if not isinstance(hint, int):
            raise TypeError("Length hint must be an integer, not %r" %
                            type(hint))
        if hint < 0:
            raise ValueError("__length_hint__() should return >= 0")
        return hint

Rationale

Being able to pre-allocate lists based on the expected size, as estimated by __length_hint__, can be a significant optimization. CPython has been observed to run some code faster than PyPy, purely because of this optimization being present.

pep-0425 Compatibility Tags for Built Distributions

PEP:425
Title:Compatibility Tags for Built Distributions
Version:$Revision$
Last-Modified:07-Aug-2012
Author:Daniel Holth <dholth at gmail.com>
BDFL-Delegate:Nick Coghlan <ncoghlan@gmail.com>
Status:Accepted
Type:Standards Track
Content-Type:text/x-rst
Created:27-Jul-2012
Python-Version:3.4
Post-History:8-Aug-2012, 18-Oct-2012, 15-Feb-2013
Resolution:http://mail.python.org/pipermail/python-dev/2013-February/124116.html

Abstract

This PEP specifies a tagging system to indicate with which versions of Python a built or binary distribution is compatible. A set of three tags indicate which Python implementation and language version, ABI, and platform a built distribution requires. The tags are terse because they will be included in filenames.

PEP Acceptance

This PEP was accepted by Nick Coghlan on 17th February, 2013.

Rationale

Today "python setup.py bdist" generates the same filename on PyPy and CPython, but an incompatible archive, making it inconvenient to share built distributions in the same folder or index. Instead, built distributions should have a file naming convention that includes enough information to decide whether or not a particular archive is compatible with a particular implementation.

Previous efforts come from a time where CPython was the only important implementation and the ABI was the same as the Python language release. This specification improves upon the older schemes by including the Python implementation, language version, ABI, and platform as a set of tags.

By comparing the tags it supports with the tags listed by the distribution, an installer can make an educated decision about whether to download a particular built distribution without having to read its full metadata.

Overview

The tag format is {python tag}-{abi tag}-{platform tag}

python tag
‘py27’, ‘cp33’
abi tag
‘cp32dmu’, ‘none’
platform tag
‘linux_x86_64’, ‘any’

For example, the tag py27-none-any indicates compatible with Python 2.7 (any Python 2.7 implementation) with no abi requirement, on any platform.

Use

The wheel built package format includes these tags in its filenames, of the form {distribution}-{version}(-{build tag})?-{python tag}-{abi tag}-{platform tag}.whl. Other package formats may have their own conventions.

Details

Python Tag

The Python tag indicates the implementation and version required by a distribution. Major implementations have abbreviated codes, initially:

  • py: Generic Python (does not require implementation-specific features)
  • cp: CPython
  • ip: IronPython
  • pp: PyPy
  • jy: Jython

Other Python implementations should use sys.implementation.name.

The version is py_version_nodot. CPython gets away with no dot, but if one is needed the underscore _ is used instead. PyPy should probably use its own versions here pp18, pp19.

The version can be just the major version 2 or 3 py2, py3 for many pure-Python distributions.

Importantly, major-version-only tags like py2 and py3 are not shorthand for py20 and py30. Instead, these tags mean the packager intentionally released a cross-version-compatible distribution.

A single-source Python 2/3 compatible distribution can use the compound tag py2.py3. See Compressed Tag Sets, below.

ABI Tag

The ABI tag indicates which Python ABI is required by any included extension modules. For implementation-specific ABIs, the implementation is abbreviated in the same way as the Python Tag, e.g. cp33d would be the CPython 3.3 ABI with debugging.

The CPython stable ABI is abi3 as in the shared library suffix.

Implementations with a very unstable ABI may use the first 6 bytes (as 8 base64-encoded characters) of the SHA-256 hash of ther source code revision and compiler flags, etc, but will probably not have a great need to distribute binary distributions. Each implementation's community may decide how to best use the ABI tag.

Platform Tag

The platform tag is simply distutils.util.get_platform() with all hyphens - and periods . replaced with underscore _.

  • win32
  • linux_i386
  • linux_x86_64

Use

The tags are used by installers to decide which built distribution (if any) to download from a list of potential built distributions. The installer maintains a list of (pyver, abi, arch) tuples that it will support. If the built distribution's tag is in the list, then it can be installed.

It is recommended that installers try to choose the most feature complete built distribution available (the one most specific to the installation environment) by default before falling back to pure Python versions published for older Python releases. Installers are also recommended to provide a way to configure and re-order the list of allowed compatibility tags; for example, a user might accept only the *-none-any tags to only download built packages that advertise themselves as being pure Python.

Another desirable installer feature might be to include "re-compile from source if possible" as more preferable than some of the compatible but legacy pre-built options.

This example list is for an installer running under CPython 3.3 on a linux_x86_64 system. It is in order from most-preferred (a distribution with a compiled extension module, built for the current version of Python) to least-preferred (a pure-Python distribution built with an older version of Python):

  1. cp33-cp33m-linux_x86_64
  2. cp33-abi3-linux_x86_64
  3. cp3-abi3-linux_x86_64
  4. cp33-none-linux_x86_64*
  5. cp3-none-linux_x86_64*
  6. py33-none-linux_x86_64*
  7. py3-none-linux_x86_64*
  8. cp33-none-any
  9. cp3-none-any
  10. py33-none-any
  11. py3-none-any
  12. py32-none-any
  13. py31-none-any
  14. py30-none-any
  • Built distributions may be platform specific for reasons other than C extensions, such as by including a native executable invoked as a subprocess.

Sometimes there will be more than one supported built distribution for a particular version of a package. For example, a packager could release a package tagged cp33-abi3-linux_x86_64 that contains an optional C extension and the same distribution tagged py3-none-any that does not. The index of the tag in the supported tags list breaks the tie, and the package with the C extension is installed in preference to the package without because that tag appears first in the list.

Compressed Tag Sets

To allow for compact filenames of bdists that work with more than one compatibility tag triple, each tag in a filename can instead be a '.'-separated, sorted, set of tags. For example, pip, a pure-Python package that is written to run under Python 2 and 3 with the same source code, could distribute a bdist with the tag py2.py3-none-any. The full list of simple tags is:

for x in pytag.split('.'):
    for y in abitag.split('.'):
        for z in archtag.split('.'):
            yield '-'.join((x, y, z))

A bdist format that implements this scheme should include the expanded tags in bdist-specific metadata. This compression scheme can generate large numbers of unsupported tags and "impossible" tags that are supported by no Python implementation e.g. "cp33-cp31u-win64", so use it sparingly.

FAQ

What tags are used by default?
Tools should use the most-preferred architecture dependent tag e.g. cp33-cp33m-win32 or the most-preferred pure python tag e.g. py33-none-any by default. If the packager overrides the default it indicates that they intended to provide cross-Python compatibility.
What tag do I use if my distribution uses a feature exclusive to the newest version of Python?
Compatibility tags aid installers in selecting the most compatible build of a single version of a distribution. For example, when there is no Python 3.3 compatible build of beaglevote-1.2.0 (it uses a Python 3.4 exclusive feature) it may still use the py3-none-any tag instead of the py34-none-any tag. A Python 3.3 user must combine other qualifiers, such as a requirement for the older release beaglevote-1.1.0 that does not use the new feature, to get a compatible build.
Why isn't there a . in the Python version number?
CPython has lasted 20+ years without a 3-digit major release. This should continue for some time. Other implementations may use _ as a delimeter, since both - and . delimit the surrounding filename.
Why normalise hyphens and other non-alphanumeric characters to underscores?
To avoid conflicting with the "." and "-" characters that separate components of the filename, and for better compatibility with the widest range of filesystem limitations for filenames (including being usable in URL paths without quoting).
Why not use special character <X> rather than "." or "-"?
Either because that character is inconvenient or potentially confusing in some contexts (for example, "+" must be quoted in URLs, "~" is used to denote the user's home directory in POSIX), or because the advantages weren't sufficiently compelling to justify changing the existing reference implementation for the wheel format defined in PEP 427 (for example, using "," rather than "." to separate components in a compressed tag).
Who will maintain the registry of abbreviated implementations?
New two-letter abbreviations can be requested on the python-dev mailing list. As a rule of thumb, abbreviations are reserved for the current 4 most prominent implementations.
Does the compatibility tag go into METADATA or PKG-INFO?
No. The compatibility tag is part of the built distribution's metadata. METADATA / PKG-INFO should be valid for an entire distribution, not a single build of that distribution.
Why didn't you mention my favorite Python implementation?
The abbreviated tags facilitate sharing compiled Python code in a public index. Your Python implementation can use this specification too, but with longer tags. Recall that all "pure Python" built distributions just use 'py'.
Why is the ABI tag (the second tag) sometimes "none" in the reference implementation?
Since Python 2 does not have an easy way to get to the SOABI (the concept comes from newer versions of Python 3) the reference implentation at the time of writing guesses "none". Ideally it would detect "py27(d|m|u)" analagous to newer versions of Python, but in the meantime "none" is a good enough way to say "don't know".

Acknowledgements

The author thanks Paul Moore, Nick Coghlan, Mark Abramowitz, and Mr. Michele Lacchia for their valuable help and advice.

pep-0426 Metadata for Python Software Packages 2.0

PEP:426
Title:Metadata for Python Software Packages 2.0
Version:$Revision$
Last-Modified:$Date$
Author:Nick Coghlan <ncoghlan at gmail.com>, Daniel Holth <dholth at gmail.com>, Donald Stufft <donald at stufft.io>
BDFL-Delegate:Nick Coghlan <ncoghlan@gmail.com>
Discussions-To:Distutils SIG <distutils-sig at python.org>
Status:Draft
Type:Standards Track
Content-Type:text/x-rst
Requires:440
Created:30 Aug 2012
Post-History:14 Nov 2012, 5 Feb 2013, 7 Feb 2013, 9 Feb 2013, 27 May 2013, 20 Jun 2013, 23 Jun 2013, 14 Jul 2013, 21 Dec 2013
Replaces:345

Contents

Abstract

This PEP describes a mechanism for publishing and exchanging metadata related to Python distributions. It includes specifics of the field names, and their semantics and usage.

This document specifies version 2.0 of the metadata format. Version 1.0 is specified in PEP 241. Version 1.1 is specified in PEP 314. Version 1.2 is specified in PEP 345.

Version 2.0 of the metadata format migrates from a custom key-value format to a JSON-compatible in-memory representation.

This version also adds fields designed to make third-party packaging of Python software easier, defines a formal extension mechanism, and adds support for optional dependencies. Finally, this version addresses several issues with the previous iteration of the standard version identification scheme.

Note

"I" in this doc refers to Nick Coghlan. Daniel and Donald either wrote or contributed to earlier versions, and have been providing feedback as this JSON-based rewrite has taken shape. Daniel and Donald have also been vetting the proposal as we go to ensure it is practical to implement for both clients and index servers.

Metadata 2.0 represents a major upgrade to the Python packaging ecosystem, and attempts to incorporate experience gained over the 15 years(!) since distutils was first added to the standard library. Some of that is just incorporating existing practices from setuptools/pip/etc, some of it is copying from other distribution systems (like Linux distros or other development language communities) and some of it is attempting to solve problems which haven't yet been well solved by anyone (like supporting clean conversion of Python source packages to distro policy compliant source packages for at least Debian and Fedora, and perhaps other platform specific distribution systems).

There will eventually be a suite of PEPs covering various aspects of the metadata 2.0 format and related systems:

  • this PEP, covering the core metadata format
  • PEP 440, covering the versioning identification and selection scheme
  • PEP 459, covering several standard extensions
  • a yet-to-be-written PEP to define v2.0 of the sdist format
  • an updated wheel PEP (v1.1) to add pydist.json (and possibly convert the wheel metadata file from Key:Value to JSON)
  • an updated installation database PEP to add pydist.json
  • a PEP to standardise the expected command line interface for setup.py as an interface to an application's build system (rather than requiring that the build system support the distutils command system)

It's going to take a while to work through all of these and make them a reality. The main change from our last attempt at this is that we're trying to design the different pieces so we can implement them independently of each other, without requiring users to switch to a whole new tool chain (although they may have to upgrade their existing ones to start enjoying the benefits in their own work).

Many of the inline notes in this version of the PEP are there to aid reviewers that are familiar with the old metadata standards. Before this version is finalised, most of that content will be moved down to the "rationale" section at the end of the document, as it would otherwise be an irrelevant distraction for future readers.

Purpose

The purpose of this PEP is to define a common metadata interchange format for communication between software publication tools and software integration tools in the Python ecosystem. One key aim is to support full dependency analysis in that ecosystem without requiring the execution of arbitrary Python code by those doing the analysis. Another aim is to encourage good software distribution practices by default, while continuing to support the current practices of almost all existing users of the Python Package Index (both publishers and integrators). Finally, the aim is to support an upgrade path from the existing setuptools defined dependency and entry point metadata formats that is transparent to end users.

The design draws on the Python community's 15 years of experience with distutils based software distribution, and incorporates ideas and concepts from other distribution systems, including Python's setuptools, pip and other projects, Ruby's gems, Perl's CPAN, Node.js's npm, PHP's composer and Linux packaging systems such as RPM and APT.

While the specifics of this format are aimed at the Python ecosystem, some of the ideas may also be useful in the future evolution of other dependency management ecosystems.

A Note on Time Frames

There's a lot of work going on in the Python packaging space at the moment. In the near term (up until the release of Python 3.4), those efforts are focused on the existing metadata standards, both those defined in Python Enhancement Proposals, and the de facto standards defined by the setuptools project.

This PEP is about setting out a longer term goal for the ecosystem that captures those existing capabilities in a format that is easier to work with. There are still a number of key open questions (mostly related to source based distribution), and those won't be able to receive proper attention from the development community until the other near term concerns have been resolved.

At this point in time, the PEP is quite possibly still overengineered, as we're still trying to make sure we have all the use cases covered. The "transparent upgrade path from setuptools" goal brings in a lot of required functionality though, and then the aim of supporting automated creation of policy compliant downstream packages for Linux distributions adds more. However, we've at least reached the point where we're taking a critical look at the core metadata, and are pushing as much functionality out to standard metadata extensions as we can.

Development, Distribution and Deployment of Python Software

The metadata design in this PEP is based on a particular conceptual model of the software development and distribution process. This model consists of the following phases:

  • Software development: this phase involves working with a source checkout for a particular application to add features and fix bugs. It is expected that developers in this phase will need to be able to build the software, run the software's automated test suite, run project specific utility scripts and publish the software.
  • Software publication: this phase involves taking the developed software and making it available for use by software integrators. This includes creating the descriptive metadata defined in this PEP, as well as making the software available (typically by uploading it to an index server).
  • Software integration: this phase involves taking published software components and combining them into a coherent, integrated system. This may be done directly using Python specific cross-platform tools, or it may be handled through conversion to development language neutral platform specific packaging systems.
  • Software deployment: this phase involves taking integrated software components and deploying them on to the target system where the software will actually execute.

The publication and integration phases are collectively referred to as the distribution phase, and the individual software components distributed in that phase are formally referred to as "distributions", but are more colloquially known as "packages" (relying on context to disambiguate them from the "module with submodules" kind of Python package).

The exact details of these phases will vary greatly for particular use cases. Deploying a web application to a public Platform-as-a-Service provider, publishing a new release of a web framework or scientific library, creating an integrated Linux distribution or upgrading a custom application running in a secure enclave are all situations this metadata design should be able to handle.

The complexity of the metadata described in this PEP thus arises directly from the actual complexities associated with software development, distribution and deployment in a wide range of scenarios.

Supporting definitions

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.

"Projects" are software components that are made available for integration. Projects include Python libraries, frameworks, scripts, plugins, applications, collections of data or other resources, and various combinations thereof. Public Python projects are typically registered on the Python Package Index [3].

"Releases" are uniquely identified snapshots of a project.

"Distributions" are the packaged files which are used to publish and distribute a release.

Depending on context, "package" may refer to either a distribution, or to an importable Python module that has a __path__ attribute and hence may also have importable submodules.

"Source archive" and "VCS checkout" both refer to the raw source code for a release, prior to creation of an sdist or binary archive.

An "sdist" is a publication format providing the distribution metadata and and any source files that are essential to creating a binary archive for the distribution. Creating a binary archive from an sdist requires that the appropriate build tools be available on the system.

"Binary archives" only require that prebuilt files be moved to the correct location on the target system. As Python is a dynamically bound cross-platform language, many so-called "binary" archives will contain only pure Python source code.

"Contributors" are individuals and organizations that work together to develop a software component.

"Publishers" are individuals and organizations that make software components available for integration (typically by uploading distributions to an index server)

"Integrators" are individuals and organizations that incorporate published distributions as components of an application or larger system.

"Build tools" are automated tools intended to run on development systems, producing source and binary distribution archives. Build tools may also be invoked by integration tools in order to build software distributed as sdists rather than prebuilt binary archives.

"Index servers" are active distribution registries which publish version and dependency metadata and place constraints on the permitted metadata.

"Public index servers" are index servers which allow distribution uploads from untrusted third parties. The Python Package Index [3] is a public index server.

"Publication tools" are automated tools intended to run on development systems and upload source and binary distribution archives to index servers.

"Integration tools" are automated tools that consume the metadata and distribution archives published by an index server or other designated source, and make use of them in some fashion, such as installing them or converting them to a platform specific packaging format.

"Installation tools" are integration tools specifically intended to run on deployment targets, consuming source and binary distribution archives from an index server or other designated location and deploying them to the target system.

"Automated tools" is a collective term covering build tools, index servers, publication tools, integration tools and any other software that produces or consumes distribution version and dependency metadata.

"Legacy metadata" refers to earlier versions of this metadata specification, along with the supporting metadata file formats defined by the setuptools project.

"Distro" is used as the preferred term for Linux distributions, to help avoid confusion with the Python-specific meaning of the term "distribution".

"Dist" is the preferred abbreviation for "distributions" in the sense defined in this PEP.

"Qualified name" is a dotted Python identifier. For imported modules and packages, the qualified name is available as the __name__ attribute, while for functions and classes it is available as the __qualname__ attribute.

A "fully qualified name" uniquely locates an object in the Python module namespace. For imported modules and packages, it is the same as the qualified name. For other Python objects, the fully qualified name consists of the qualified name of the containing module or package, a colon (:) and the qualified name of the object relative to the containing module or package.

A "prefixed name" starts with a qualified name, but is not necessarily a qualified name - it may contain additional dot separated segments which are not valid identifiers.

Integration and deployment of distributions

The primary purpose of the distribution metadata is to support integration and deployment of distributions as part of larger applications and systems.

Integration and deployment can in turn be broken down into further substeps.

  • Build: the build step is the process of turning a VCS checkout, source archive or sdist into a binary archive. Dependencies must be available in order to build and create a binary archive of the distribution (including any documentation that is installed on target systems).
  • Installation: the installation step involves getting the distribution and all of its runtime dependencies onto the target system. In this step, the distribution may already be on the system (when upgrading or reinstalling) or else it may be a completely new installation.
  • Runtime: this is normal usage of a distribution after it has been installed on the target system.

These three steps may all occur directly on the target system. Alternatively the build step may be separated out by using binary archives provided by the publisher of the distribution, or by creating the binary archives on a separate system prior to deployment. The advantage of the latter approach is that it minimizes the dependencies that need to be installed on deployment targets (as the build dependencies will be needed only on the build systems).

The published metadata for distributions SHOULD allow integrators, with the aid of build and integration tools, to:

  • obtain the original source code that was used to create a distribution
  • identify and retrieve the dependencies (if any) required to use a distribution
  • identify and retrieve the dependencies (if any) required to build a distribution from source
  • identify and retrieve the dependencies (if any) required to run a distribution's test suite
  • find resources on using and contributing to the project
  • access sufficiently rich metadata to support contacting distribution publishers through appropriate channels, as well as finding distributions that are relevant to particular problems

Development and publication of distributions

The secondary purpose of the distribution metadata is to support effective collaboration amongst software contributors and publishers during the development phase.

The published metadata for distributions SHOULD allow contributors and publishers, with the aid of build and publication tools, to:

  • perform all the same activities needed to effectively integrate and deploy the distribution
  • identify and retrieve the additional dependencies needed to develop and publish the distribution
  • specify the dependencies (if any) required to use the distribution
  • specify the dependencies (if any) required to build the distribution from source
  • specify the dependencies (if any) required to run the distribution's test suite
  • specify the additional dependencies (if any) required to develop and publish the distribution

Standard build system

Note

The standard build system currently described in the PEP is a draft based on existing practices for projects using distutils or setuptools as their build system (or other projects, like d2to1, that expose a setup.py file for backwards compatibility with existing tools)

The specification doesn't currently cover expected argument support for the commands, which is a limitation that needs to be addressed before the PEP can be considered ready for acceptance.

It is also possible that the "meta build system" will be separated out into a distinct PEP in the coming months (similar to the separation of the versioning and requirement specification standard out to PEP 440).

If a suitable API can be worked out, then it may even be possible to switch to a more declarative API for build system specification.

Both development and integration of distributions relies on the ability to build extension modules and perform other operations in a distribution independent manner.

The current iteration of the metadata relies on the distutils/setuptools commands system to support these necessary development and integration activities:

  • python setup.py dist_info: generate distribution metadata in place given a source archive or VCS checkout
  • python setup.py sdist: create an sdist from a source archive or VCS checkout
  • python setup.py build_ext --inplace: build extension modules in place given an sdist, source archive or VCS checkout
  • python setup.py test: run the distribution's test suite in place given an sdist, source archive or VCS checkout
  • python setup.py bdist_wheel: create a binary archive from an sdist, source archive or VCS checkout

Metadata format

The format defined in this PEP is an in-memory representation of Python distribution metadata as a string-keyed dictionary. Permitted values for individual entries are strings, lists of strings, and additional nested string-keyed dictionaries.

Except where otherwise noted, dictionary keys in distribution metadata MUST be valid Python identifiers in order to support attribute based metadata access APIs.

The individual field descriptions show examples of the key name and value as they would be serialised as part of a JSON mapping.

The fields identified as core metadata are required. Automated tools MUST NOT accept distributions with missing core metadata as valid Python distributions.

All other fields are optional. Automated tools MUST operate correctly if a distribution does not provide them, except for those operations which specifically require the omitted fields.

Automated tools MUST NOT insert dummy data for missing fields. If a valid value is not provided for a required field then the metadata and the associated distribution MUST be rejected as invalid. If a valid value is not provided for an optional field, that field MUST be omitted entirely. Automated tools MAY automatically derive valid values from other information sources (such as a version control system).

Automated tools, especially public index servers, MAY impose additional length restrictions on metadata beyond those enumerated in this PEP. Such limits SHOULD be imposed where necessary to protect the integrity of a service, based on the available resources and the service provider's judgment of reasonable metadata capacity requirements.

Metadata files

The information defined in this PEP is serialised to pydist.json files for some use cases. These are files containing UTF-8 encoded JSON metadata.

Each metadata file consists of a single serialised mapping, with fields as described in this PEP. When serialising metadata, automated tools SHOULD lexically sort any keys and list elements in order to simplify reviews of any changes.

There are three standard locations for these metadata files:

  • as a {distribution}-{version}.dist-info/pydist.json file in an sdist source distribution archive
  • as a {distribution}-{version}.dist-info/pydist.json file in a wheel binary distribution archive
  • as a {distribution}-{version}.dist-info/pydist.json file in a local Python installation database

Note

These locations are to be confirmed, since they depend on the definition of sdist 2.0 and the revised installation database standard. There will also be a wheel 1.1 format update after this PEP is approved that mandates provision of 2.0+ metadata.

Note that these metadata files SHOULD NOT be processed if the version of the containing location is too low to indicate that they are valid. Specifically, unversioned sdist archives, unversioned installation database directories and version 1.0 of the wheel specification do not cover pydist.json files.

Other tools involved in Python distribution MAY also use this format.

As JSON files are generally awkward to edit by hand, it is RECOMMENDED that these metadata files be generated by build tools based on other input formats (such as setup.py) rather than being used directly as a data input format. Generating the metadata as part of the publication process also helps to deal with version specific fields (including the source URL and the version field itself).

For backwards compatibility with older installation tools, metadata 2.0 files MAY be distributed alongside legacy metadata.

Index servers MAY allow distributions to be uploaded and installation tools MAY allow distributions to be installed with only legacy metadata.

Automated tools MAY attempt to automatically translate legacy metadata to the format described in this PEP. Advice for doing so effectively is given in Appendix A.

Metadata validation

A jsonschema description of the distribution metadata is available.

This schema does NOT currently handle validation of some of the more complex string fields (instead treating them as opaque strings).

Except where otherwise noted, all URL fields in the metadata MUST comply with RFC 3986.

Note

The current version of the schema file covers the previous draft of the PEP, and has not yet been updated for the split into the essential dependency resolution metadata and multiple standard extensions.

Core metadata

This section specifies the core metadata fields that are required for every Python distribution.

Publication tools MUST ensure at least these fields are present when publishing a distribution.

Index servers MUST ensure at least these fields are present in the metadata when distributions are uploaded.

Installation tools MUST refuse to install distributions with one or more of these fields missing by default, but MAY allow users to force such an installation to occur.

Metadata version

Version of the file format; "2.0" is the only legal value.

Automated tools consuming metadata SHOULD warn if metadata_version is greater than the highest version they support, and MUST fail if metadata_version has a greater major version than the highest version they support (as described in PEP 440, the major version is the value before the first dot).

For broader compatibility, build tools MAY choose to produce distribution metadata using the lowest metadata version that includes all of the needed fields.

Example:

"metadata_version": "2.0"

Generator

Name (and optional version) of the program that generated the file, if any. A manually produced file would omit this field.

Example:

"generator": "setuptools (0.9)"

Name

The name of the distribution.

As distribution names are used as part of URLs, filenames, command line parameters and must also interoperate with other packaging systems, the permitted characters are constrained to:

  • ASCII letters ([a-zA-Z])
  • ASCII digits ([0-9])
  • underscores (_)
  • hyphens (-)
  • periods (.)

Distribution names MUST start and end with an ASCII letter or digit.

Automated tools MUST reject non-compliant names.

All comparisons of distribution names MUST be case insensitive, and MUST consider hyphens and underscores to be equivalent.

Index servers MAY consider "confusable" characters (as defined by the Unicode Consortium in TR39: Unicode Security Mechanisms) to be equivalent.

Index servers that permit arbitrary distribution name registrations from untrusted sources SHOULD consider confusable characters to be equivalent when registering new distributions (and hence reject them as duplicates).

Integration tools MUST NOT silently accept a confusable alternate spelling as matching a requested distribution name.

At time of writing, the characters in the ASCII subset designated as confusables by the Unicode Consortium are:

  • 1 (DIGIT ONE), l (LATIN SMALL LETTER L), and I (LATIN CAPITAL LETTER I)
  • 0 (DIGIT ZERO), and O (LATIN CAPITAL LETTER O)

Example:

"name": "ComfyChair"

Version

The distribution's public or local version identifier, as defined in PEP 440. Version identifiers are designed for consumption by automated tools and support a variety of flexible version specification mechanisms (see PEP 440 for details).

Version identifiers MUST comply with the format defined in PEP 440.

Version identifiers MUST be unique within each project.

Index servers MAY place restrictions on the use of local version identifiers as described in PEP 440.

Example:

"version": "1.0a2"

Summary

A short summary of what the distribution does.

This field SHOULD contain fewer than 512 characters and MUST contain fewer than 2048.

This field SHOULD NOT contain any line breaks.

A more complete description SHOULD be included as a separate file in the sdist for the distribution. Refer to the python-details extension in PEP 459 for more information.

Example:

"summary": "A module that is more fiendish than soft cushions."

Source code metadata

This section specifies fields that provide identifying details for the source code used to produce this distribution.

All of these fields are optional. Automated tools MUST operate correctly if a distribution does not provide them, including failing cleanly when an operation depending on one of these fields is requested.

Source labels

Source labels are text strings with minimal defined semantics. They are intended to allow the original source code to be unambiguously identified, even if an integrator has applied additional local modifications to a particular distribution.

To ensure source labels can be readily incorporated as part of file names and URLs, and to avoid formatting inconsistencies in hexadecimal hash representations they MUST be limited to the following set of permitted characters:

  • Lowercase ASCII letters ([a-z])
  • ASCII digits ([0-9])
  • underscores (_)
  • hyphens (-)
  • periods (.)
  • plus signs (+)

Source labels MUST start and end with an ASCII letter or digit.

A source label for a project MUST NOT match any defined version for that project. This restriction ensures that there is no ambiguity between version identifiers and source labels.

Examples:

"source_label": "1.0.0-alpha.1"

"source_label": "1.3.7+build.11.e0f985a"

"source_label": "v1.8.1.301.ga0df26f"

"source_label": "2013.02.17.dev123"

Source URL

A string containing a full URL where the source for this specific version of the distribution can be downloaded.

Source URLs MUST be unique within each project. This means that the URL can't be something like "https://github.com/pypa/pip/archive/master.zip", but instead must be "https://github.com/pypa/pip/archive/1.3.1.zip".

The source URL MUST reference either a source archive or a tag or specific commit in an online version control system that permits creation of a suitable VCS checkout. It is intended primarily for integrators that wish to recreate the distribution from the original source form.

All source URL references SHOULD specify a secure transport mechanism (such as https) AND include an expected hash value in the URL for verification purposes. If a source URL is specified without any hash information, with hash information that the tool doesn't understand, or with a selected hash algorithm that the tool considers too weak to trust, automated tools SHOULD at least emit a warning and MAY refuse to rely on the URL. If such a source URL also uses an insecure transport, automated tools SHOULD NOT rely on the URL.

It is RECOMMENDED that only hashes which are unconditionally provided by the latest version of the standard library's hashlib module be used for source archive hashes. At time of writing, that list consists of 'md5', 'sha1', 'sha224', 'sha256', 'sha384', and 'sha512'.

For source archive references, an expected hash value may be specified by including a <hash-algorithm>=<expected-hash> entry as part of the URL fragment.

For version control references, the VCS+protocol scheme SHOULD be used to identify both the version control system and the secure transport, and a version control system with hash based commit identifiers SHOULD be used. Automated tools MAY omit warnings about missing hashes for version control systems that do not provide hash based commit identifiers.

To handle version control systems that do not support including commit or tag references directly in the URL, that information may be appended to the end of the URL using the @<commit-hash> or the @<tag>#<commit-hash> notation.

Note

This isn't quite the same as the existing VCS reference notation supported by pip. Firstly, the distribution name is moved in front rather than embedded as part of the URL. Secondly, the commit hash is included even when retrieving based on a tag, in order to meet the requirement above that every link should include a hash to make things harder to forge (creating a malicious repo with a particular tag is easy, creating one with a specific hash, less so).

Example:

"source_url": "https://github.com/pypa/pip/archive/1.3.1.zip#sha1=da9234ee9982d4bbb3c72346a6de940a148ea686"
"source_url": "git+https://github.com/pypa/pip.git@1.3.1#7921be1537eac1e97bc40179a57f0349c2aee67d"
"source_url": "git+https://github.com/pypa/pip.git@7921be1537eac1e97bc40179a57f0349c2aee67d"

Semantic dependencies

Dependency metadata allows distributions to make use of functionality provided by other distributions, without needing to bundle copies of those distributions.

Semantic dependencies allow publishers to indicate not only which other distributions are needed, but also why they're needed. This additional information allows integrators to install just the dependencies they need for specific activities, making it easier to minimise installation footprints in constrained environments (regardless of the reasons for those constraints).

Distributions may declare five differents kinds of dependency:

  • Runtime dependencies: other distributions that are needed to actually use this distribution (but are not considered subdistributions).
  • "Meta" dependencies: subdistributions that are grouped together into a single larger metadistribution for ease of reference and installation.
  • Test dependencies: other distributions that are needed to run the automated test suite for this distribution (but are not needed just to use it).
  • Build dependencies: other distributions that are needed to build this distribution.
  • Development dependencies: other distributions that are needed when working on this distribution (but do not fit into one of the other dependency categories).

Within each of these categories, distributions may also declare "Extras". Extras are dependencies that may be needed for some optional functionality, or which are otherwise complementary to the distribution.

Dependency management is heavily dependent on the version identification and specification scheme defined in PEP 440.

All of these fields are optional. Automated tools MUST operate correctly if a distribution does not provide them, by assuming that a missing field indicates "Not applicable for this distribution".

Dependency specifiers

While many dependencies will be needed to use a distribution at all, others are needed only on particular platforms or only when particular optional features of the distribution are needed. To handle this, dependency specifiers are mappings with the following subfields:

requires is the only required subfield. When it is the only subfield, the dependencies are said to be unconditional. If extra or environment is specified, then the dependencies are conditional.

All three fields may be supplied, indicating that the dependencies are needed only when the named extra is requested in a particular environment.

Automated tools MUST combine related dependency specifiers (those with common values for extra and environment) into a single specifier listing multiple requirements when serialising metadata or passing it to an install hook.

Despite this required normalisation, the same extra name or environment marker MAY appear in multiple conditional dependencies. This may happen, for example, if an extra itself only needs some of its dependencies in specific environments. It is only the combination of extras and environment markers that is required to be unique in a list of dependency specifiers.

Any extras referenced from a dependency specifier MUST be named in the Extras field for this distribution. This helps avoid typographical errors and also makes it straightforward to identify the available extras without scanning the full set of dependencies.

Requirement specifiers

Individual requirements are defined as strings containing a distribution name (as found in the name field). The distribution name may be followed by an extras specifier (enclosed in square brackets) and by a version specifier or direct reference.

Whitespace is permitted between the distribution name and an opening square bracket or parenthesis. Whitespace is also permitted between a closing square bracket and the version specifier.

See Extras (optional dependencies) for details on extras and PEP 440 for details on version specifiers and direct references.

The distribution names should correspond to names as found on the Python Package Index [3]; while these names are often the same as the module names as accessed with import x, this is not always the case (especially for distributions that provide multiple top level modules or packages).

Example requirement specifiers:

"Flask"
"Django"
"Pyramid"
"SciPy ~= 0.12"
"ComfyChair[warmup]"
"ComfyChair[warmup] > 0.1"

Mapping dependencies to development and distribution activities

The different categories of dependency are based on the various distribution and development activities identified above, and govern which dependencies should be installed for the specified activities:

  • Implied runtime dependencies:

    • run_requires
    • meta_requires
  • Implied build dependencies:

    • build_requires
    • If running the distribution's test suite as part of the build process, request the :run:, :meta:, and :test: extras to also install:
      • run_requires
      • meta_requires
      • test_requires
  • Implied development and publication dependencies:

    • run_requires
    • meta_requires
    • build_requires
    • test_requires
    • dev_requires

The notation described in Extras (optional dependencies) SHOULD be used to determine exactly what gets installed for various operations.

Installation tools SHOULD report an error if dependencies cannot be satisfied, MUST at least emit a warning, and MAY allow the user to force the installation to proceed regardless.

See Appendix B for an overview of mapping these dependencies to an RPM spec file.

Extras

A list of optional sets of dependencies that may be used to define conditional dependencies in dependency fields. See Extras (optional dependencies) for details.

The names of extras MUST abide by the same restrictions as those for distribution names.

Example:

"extras": ["warmup"]

Run requires

A list of other distributions needed to actually run this distribution.

Automated tools MUST NOT allow strict version matching clauses or direct references in this field - if permitted at all, such clauses should appear in meta_requires instead.

Example:

"run_requires":
  {
    "requires": ["SciPy", "PasteDeploy", "zope.interface > 3.5.0"]
  },
  {
    "requires": ["pywin32 > 1.0"],
    "environment": "sys_platform == 'win32'"
  },
  {
    "requires": ["SoftCushions"],
    "extra": "warmup"
  }
]

Meta requires

An abbreviation of "metadistribution requires". This is a list of subdistributions that can easily be installed and used together by depending on this metadistribution.

In this field, automated tools:

  • MUST allow strict version matching
  • MUST NOT allow more permissive version specifiers.
  • MAY allow direct references

Public index servers SHOULD NOT allow the use of direct references in uploaded distributions. Direct references are intended primarily as a tool for software integrators rather than publishers.

Distributions that rely on direct references to platform specific binary archives SHOULD define appropriate constraints in their supports_environments field.

Example:

"meta_requires":
  {
    "requires": ["ComfyUpholstery == 1.0a2",
                 "ComfySeatCushion == 1.0a2"]
  },
  {
    "requires": ["CupOfTeaAtEleven == 1.0a2"],
    "environment": "'linux' in sys_platform"
  }
]

Test requires

A list of other distributions needed in order to run the automated tests for this distribution..

Automated tools MAY disallow strict version matching clauses and direct references in this field and SHOULD at least emit a warning for such clauses.

Public index servers SHOULD NOT allow strict version matching clauses or direct references in this field.

Example:

"test_requires":
  {
    "requires": ["unittest2"]
  },
  {
    "requires": ["pywin32 > 1.0"],
    "environment": "sys_platform == 'win32'"
  },
  {
    "requires": ["CompressPadding"],
    "extra": "warmup"
  }
]

Build requires

A list of other distributions needed when this distribution is being built (creating a binary archive from an sdist, source archive or VCS checkout).

Note that while these are build dependencies for the distribution being built, the installation is a deployment scenario for the dependencies.

Automated tools MAY disallow strict version matching clauses and direct references in this field and SHOULD at least emit a warning for such clauses.

Public index servers SHOULD NOT allow strict version matching clauses or direct references in this field.

Example:

"build_requires":
  {
    "requires": ["setuptools >= 0.7"]
  },
  {
    "requires": ["pywin32 > 1.0"],
    "environment": "sys_platform == 'win32'"
  },
  {
    "requires": ["cython"],
    "extra": "c-accelerators"
  }
]

Dev requires

A list of any additional distributions needed during development of this distribution that aren't already covered by the deployment and build dependencies.

Additional dependencies that may be listed in this field include:

  • tools needed to create an sdist from a source archive or VCS checkout
  • tools needed to generate project documentation that is published online rather than distributed along with the rest of the software

Automated tools MAY disallow strict version matching clauses and direct references in this field and SHOULD at least emit a warning for such clauses.

Public index servers SHOULD NOT allow strict version matching clauses or direct references in this field.

Example:

"dev_requires":
  {
    "requires": ["hgtools", "sphinx >= 1.0"]
  },
  {
    "requires": ["pywin32 > 1.0"],
    "environment": "sys_platform == 'win32'"
  }
]

Provides

A list of strings naming additional dependency requirements that are satisfied by installing this distribution. These strings must be of the form Name or Name (Version), as for the requires field.

While dependencies are usually resolved based on distribution names and versions, a distribution may provide additional names explicitly in the provides field.

For example, this may be used to indicate that multiple projects have been merged into and replaced by a single distribution or to indicate that this project is a substitute for another.

For instance, with distribute merged back into setuptools, the merged project is able to include a "provides": ["distribute"] entry to satisfy any projects that require the now obsolete distribution's name.

To avoid malicious hijacking of names, when interpreting metadata retrieved from a public index server, automated tools MUST NOT pay any attention to "provides" entries that do not correspond to a published distribution.

However, to appropriately handle project forks and mergers, automated tools MUST accept "provides" entries that name other distributions when the entry is retrieved from a local installation database or when there is a corresponding "obsoleted_by" entry in the metadata for the named distribution.

A distribution may wish to depend on a "virtual" project name, which does not correspond to any separately distributed project: such a name might be used to indicate an abstract capability which could be supplied by one of multiple projects. For example, multiple projects might supply PostgreSQL bindings for use with SQL Alchemy: each project might declare that it provides sqlalchemy-postgresql-bindings, allowing other projects to depend only on having at least one of them installed.

To handle this case in a way that doesn't allow for name hijacking, the authors of the distribution that first defines the virtual dependency should create a project on the public index server with the corresponding name, and depend on the specific distribution that should be used if no other provider is already installed. This also has the benefit of publishing the default provider in a way that automated tools will understand.

A version declaration may be supplied as part of an entry in the provides field and must follow the rules described in PEP 440. The distribution's version identifier will be implied if none is specified.

Example:

"provides": ["AnotherProject (3.4)", "virtual-package"]

Obsoleted by

A string that indicates that this project is no longer being developed. The named project provides a substitute or replacement.

A version declaration may be supplied and must follow the rules described in PEP 440.

An inactive project may be explicitly indicated by setting this field to None (which is serialised as null in JSON as usual).

Automated tools SHOULD report a warning when installing an obsolete project.

Possible uses for this field include handling project name changes and project mergers.

For instance, with distribute merging back into setuptools, a new version of distribute may be released that depends on the new version of setuptools, and also explicitly indicates that distribute itself is now obsolete.

Note that without a corresponding provides, there is no expectation that the replacement project will be a "drop-in" replacement for the obsolete project - at the very least, upgrading to the new distribution is likely to require changes to import statements.

Examples:

"name": "BadName",
"obsoleted_by": "AcceptableName"

"name": "distribute",
"obsoleted_by": "setuptools >= 0.7"

Metadata Extensions

Extensions to the metadata MAY be present in a mapping under the extensions key. The keys MUST be valid prefixed names, while the values MUST themselves be nested mappings.

Two key names are reserved and MUST NOT be used by extensions, except as described below:

  • extension_version
  • installer_must_handle

The following example shows the python.details and python.commands standard extensions from PEP 459:

"extensions" : {
  "python.details": {
    "license": "GPL version 3, excluding DRM provisions",
    "keywords": [
      "comfy", "chair", "cushions", "too silly", "monty python"
    ],
    "classifiers": [
      "Development Status :: 4 - Beta",
      "Environment :: Console (Text Based)",
      "License :: OSI Approved :: GNU General Public License v3 (GPLv3)"
    ],
    "document_names": {
        "description": "README.rst",
        "license": "LICENSE.rst",
        "changelog": "NEWS"
    }
  },
  "python.commands": {
    "wrap_console": [{"chair": "chair:run_cli"}],
    "wrap_gui": [{"chair-gui": "chair:run_gui"}],
    "prebuilt": ["reduniforms"]
  },
}

Extension names are defined by distributions that will then make use of the additional published metadata in some way.

To reduce the chance of name conflicts, extension names SHOULD use a prefix that corresponds to a module name in the distribution that defines the meaning of the extension. This practice will also make it easier to find authoritative documentation for metadata extensions.

Metadata extensions allow development tools to record information in the metadata that may be useful during later phases of distribution, but is not essential for dependency resolution or building the software.

Extension versioning

Extensions MUST be versioned, using the extension_version key. However, if this key is omitted, then the implied version is 1.0.

Automated tools consuming extension metadata SHOULD warn if extension_version is greater than the highest version they support, and MUST fail if extension_version has a greater major version than the highest version they support (as described in PEP 440, the major version is the value before the first dot).

For broader compatibility, build tools MAY choose to produce extension metadata using the lowest metadata version that includes all of the needed fields.

Required extension handling

A project may consider correct handling of some extensions to be essential to correct installation of the software. This is indicated by setting the installer_must_handle field to true. Setting it to false or omitting it altogether indicates that processing the extension when installing the distribution is not considered mandatory by the developers.

Installation tools MUST fail if installer_must_handle is set to true for an extension and the tool does not have any ability to process that particular extension (whether directly or through a tool-specific plugin system).

If an installation tool encounters a required extension it doesn't understand when attempting to install from a wheel archive, it MAY fall back on attempting to install from source rather than failing entirely.

Extras (optional dependencies)

Extras are additional dependencies that enable an optional aspect of the distribution, often corresponding to a try: import optional_dependency ... block in the code. To support the use of the distribution with or without the optional dependencies they are listed separately from the distribution's core dependencies and must be requested explicitly, either in the dependency specifications of another distribution, or else when issuing a command to an installation tool.

Note that installation of extras is not tracked directly by installation tools: extras are merely a convenient way to indicate a set of dependencies that is needed to provide some optional functionality of the distribution. If selective installation of components is desired, then multiple distributions must be defined rather than relying on the extras system.

The names of extras MUST abide by the same restrictions as those for distribution names.

Example of a distribution with optional dependencies:

"name": "ComfyChair",
"extras": ["warmup", "c-accelerators"]
"run_requires": [
  {
    "requires": ["SoftCushions"],
    "extra": "warmup"
  }
]
"build_requires": [
  {
    "requires": ["cython"],
    "extra": "c-accelerators"
  }
]

Other distributions require the additional dependencies by placing the relevant extra names inside square brackets after the distribution name when specifying the dependency.

Extra specifications MUST allow the following additional syntax:

  • Multiple extras can be requested by separating them with a comma within the brackets.
  • The following special extras request processing of the corresponding lists of dependencies:
    • :meta:: meta_requires
    • :run:: run_requires
    • :test:: test_requires
    • :build:: build_requires
    • :dev:: dev_requires
    • :*:: process all dependency lists
  • The * character as an extra is a wild card that enables all of the entries defined in the distribution's extras field.
  • Extras may be explicitly excluded by prefixing their name with a - character (this is useful in conjunction with * to exclude only particular extras that are definitely not wanted, while enabling all others).
  • The - character as an extra specification indicates that the distribution itself should NOT be installed, and also disables the normally implied processing of :meta: and :run: dependencies (those may still be requested explicitly using the appropriate extra specifications).

Command line based installation tools SHOULD support this same syntax to allow extras to be requested explicitly.

The full set of dependency requirements is then based on the top level dependencies, along with those of any requested extras.

Dependency examples (showing just the requires subfield):

"requires": ["ComfyChair[warmup]"]
    -> requires ``ComfyChair`` and ``SoftCushions``

"requires": ["ComfyChair[*]"]
    -> requires ``ComfyChair`` and ``SoftCushions``, but will also
       pick up any new extras defined in later versions

Command line examples:

pip install ComfyChair
    -> installs ComfyChair with applicable :meta: and :run: dependencies

pip install ComfyChair[*]
    -> as above, but also installs all extra dependencies

pip install ComfyChair[-,:build:,*]
    -> installs just the build dependencies with all extras

pip install ComfyChair[-,:build:,:run:,:meta:,:test:,*]
    -> as above, but also installs dependencies needed to run the tests

pip install ComfyChair[-,:*:,*]
    -> installs the full set of development dependencies, but avoids
       installing ComfyChair itself

Environment markers

An environment marker describes a condition about the current execution environment. They are used to indicate when certain dependencies are only required in particular environments, and to indicate supported platforms for distributions with additional constraints beyond the availability of a Python runtime.

Here are some examples of such markers:

"sys_platform == 'win32'"
"platform_machine == 'i386'"
"python_version == '2.4' or python_version == '2.5'"
"'linux' in sys_platform"

And here's an example of some conditional metadata for a distribution that requires PyWin32 both at runtime and buildtime when using Windows:

"name": "ComfyChair",
"run_requires": [
  {
    "requires": ["pywin32 > 1.0"],
    "environment": "sys.platform == 'win32'"
  }
]
"build_requires": [
  {
    "requires": ["pywin32 > 1.0"],
    "environment": "sys.platform == 'win32'"
  }
]

The micro-language behind this is a simple subset of Python: it compares only strings, with the == and in operators (and their opposites), and with the ability to combine expressions. Parentheses are supported for grouping.

The pseudo-grammar is

MARKER: EXPR [(and|or) EXPR]*
EXPR: ("(" MARKER ")") | (SUBEXPR [CMPOP SUBEXPR])
CMPOP: (==|!=|<|>|<=|>=|in|not in)

where SUBEXPR is either a Python string (such as '2.4', or 'win32') or one of the following marker variables:

  • python_version: '{0.major}.{0.minor}'.format(sys.version_info)
  • python_full_version: see definition below
  • os_name``: os.name
  • sys_platform``: sys.platform
  • platform_release: platform.release()
  • platform_version: platform.version()
  • platform_machine: platform.machine()
  • platform_python_implementation: platform.python_implementation()
  • implementation_name``: sys.implementation.name
  • implementation_version``: see definition below

If a particular value is not available (such as the sys.implementation subattributes in versions of Python prior to 3.3), the corresponding marker variable MUST be considered equivalent to the empty string.

Note that all subexpressions are restricted to strings or one of the marker variable names (which refer to string values), meaning that it is not possible to use other sequences like tuples or lists on the right side of the in and not in operators.

Chaining of comparison operations is permitted using the normal Python semantics of an implied and.

The python_full_version and implementation_version marker variables are derived from sys.version_info() and sys.implementation.version respectively, in accordance with the following algorithm:

def format_full_version(info):
    version = '{0.major}.{0.minor}.{0.micro}'.format(info)
    kind = info.releaselevel
    if kind != 'final':
        version += kind[0] + str(info.serial)
    return version

python_full_version = format_full_version(sys.version_info)
implementation_version = format_full_version(sys.implementation.version)

python_full_version will typically correspond to the leading segment of sys.version().

Updating the metadata specification

The metadata specification may be updated with clarifications without requiring a new PEP or a change to the metadata version.

Changing the meaning of existing fields or adding new features (other than through the extension mechanism) requires a new metadata version defined in a new PEP.

Appendix A: Conversion notes for legacy metadata

The reference implementations for converting from legacy metadata to metadata 2.0 are:

  • the wheel project, which adds the bdist_wheel command to setuptools
  • the Warehouse project, which will eventually be migrated to the Python Packaging Authority as the next generation Python Package Index implementation
  • the distlib project which is derived from the core packaging infrastructure created for the distutils2 project

Note

These tools have yet to be updated for the switch to standard extensions for several fields.

While it is expected that there may be some edge cases where manual intervention is needed for clean conversion, the specification has been designed to allow fully automated conversion of almost all projects on PyPI.

Metadata conversion (especially on the part of the index server) is a necessary step to allow installation and analysis tools to start benefiting from the new metadata format, without having to wait for developers to upgrade to newer build systems.

Appendix B: Mapping dependency declarations to an RPM SPEC file

As an example of mapping this PEP to Linux distro packages, assume an example project without any extras defined is split into 2 RPMs in a SPEC file: example and example-devel.

The meta_requires and run_requires dependencies would be mapped to the Requires dependencies for the "example" RPM (a mapping from environment markers relevant to Linux to SPEC file conditions would also allow those to be handled correctly)

The build_requires dependencies would be mapped to the BuildRequires dependencies for the "example" RPM.

All defined dependencies relevant to Linux, including those in dev_requires and test_requires would become Requires dependencies for the "example-devel" RPM.

A documentation toolchain dependency like Sphinx would either go in build_requires (for example, if man pages were included in the built distribution) or in dev_requires (for example, if the documentation is published solely through ReadTheDocs or the project website). This would be enough to allow an automated converter to map it to an appropriate dependency in the spec file.

If the project did define any extras, those could be mapped to additional virtual RPMs with appropriate BuildRequires and Requires entries based on the details of the dependency specifications. Alternatively, they could be mapped to other system package manager features (such as package lists in yum).

Other system package managers may have other options for dealing with extras (Debian packagers, for example, would have the option to map them to "Recommended" or "Suggested" package entries).

The metadata extension format should also allow distribution specific hints to be included in the upstream project metadata without needing to manually duplicate any of the upstream metadata in a distribution specific format.

Appendix C: Summary of differences from PEP 345

  • Metadata-Version is now 2.0, with semantics specified for handling version changes
  • The increasingly complex ad hoc "Key: Value" format has been replaced by a more structured JSON compatible format that is easily represented as Python dictionaries, strings, lists.
  • Most fields are now optional and filling in dummy data for omitted fields is explicitly disallowed
  • Explicit permission for in-place clarifications without releasing a new version of the specification
  • The PEP now attempts to provide more of an explanation of why the fields exist and how they are intended to be used, rather than being a simple description of the permitted contents
  • Changed the version scheme to be based on PEP 440 rather than PEP 386
  • Added the source label mechanism as described in PEP 440
  • Support for different kinds of dependencies
  • The "Extras" optional dependency mechanism
  • A well-defined metadata extension mechanism, and migration of any fields not needed for dependency resolution to standard extensions.
  • Clarify and simplify various aspects of environment markers:
    • allow use of parentheses for grouping in the pseudo-grammar
    • consistently use underscores instead of periods in the variable names
    • allow ordered string comparisons and chained comparisons
  • New constraint mechanism to define supported environments and ensure compatibility between independently built binary components at installation time
  • Updated obsolescence mechanism
  • More flexible system for defining contact points and contributors
  • Defined a recommended set of project URLs
  • Identification of supporting documents in the dist-info directory:
    • Allows markup formats to be indicated through file extensions
    • Standardises the common practice of taking the description from README
    • Also supports inclusion of license files and changelogs
  • With all due respect to Charles Schulz and Peanuts, many of the examples have been updated to be more thematically appropriate [4] for Python ;)

The rationale for major changes is given in the following sections.

Metadata-Version semantics

The semantics of major and minor version increments are now specified, and follow the same model as the format version semantics specified for the wheel format in PEP 427: minor version increments must behave reasonably when processed by a tool that only understand earlier metadata versions with the same major version, while major version increments may include changes that are not compatible with existing tools.

The major version number of the specification has been incremented accordingly, as interpreting PEP 426 metadata obviously cannot be interpreted in accordance with earlier metadata specifications.

Whenever the major version number of the specification is incremented, it is expected that deployment will take some time, as either metadata consuming tools must be updated before other tools can safely start producing the new format, or else the sdist and wheel formats, along with the installation database definition, will need to be updated to support provision of multiple versions of the metadata in parallel.

Existing tools won't abide by this guideline until they're updated to support the new metadata standard, so the new semantics will first take effect for a hypothetical 2.x -> 3.0 transition. For the 1.x -> 2.0 transition, we will use the approach where tools continue to produce the existing supplementary files (such as entry_points.txt) in addition to any equivalents specified using the new features of the standard metadata format (including the formal extension mechanism).

Switching to a JSON compatible format

The old "Key:Value" format was becoming increasingly limiting, with various complexities like parsers needing to know which fields were permitted to occur more than once, which fields supported the environment marker syntax (with an optional ";" to separate the value from the marker) and eventually even the option to embed arbitrary JSON inside particular subfields.

The old serialisation format also wasn't amenable to easy conversion to standard Python data structures for use in the new install hook APIs, or in future extensions to the importer APIs to allow them to provide information for inclusion in the installation database.

Accordingly, we've taken the step of switching to a JSON-compatible metadata format. This works better for APIs and is much easier for tools to parse and generate correctly. Changing the name of the metadata file also makes it easy to distribute 1.x and 2.x metadata in parallel, greatly simplifying several aspects of the migration to the new metadata format.

The specific choice of pydist.json as the preferred file name relates to the fact that the metadata described in these files applies to the distribution as a whole, rather than to any particular build. Additional metadata formats may be defined in the future to hold information that can only be determined after building a binary distribution for a particular target environment.

Changing the version scheme

See PEP 440 for a detailed rationale for the various changes made to the versioning scheme.

Source labels

The new source label support is intended to make it clearer that the constraints on public version identifiers are there primarily to aid in the creation of reliable automated dependency analysis tools. Projects are free to use whatever versioning scheme they like internally, so long as they are able to translate it to something the dependency analysis tools will understand.

Source labels also make it straightforward to record specific details of a version, like a hash or tag name that allows the release to be reconstructed from the project version control system.

Support for different kinds of dependencies

The separation of the five different kinds of dependency allows a distribution to indicate whether a dependency is needed specifically to develop, build, test or use the distribution.

To allow for metadistributions like PyObjC, while still actively discouraging overly strict dependency specifications, the separate meta dependency fields are used to separate out those dependencies where exact version specifications are appropriate.

The advantage of having these distinctions supported in the upstream Python specific metadata is that even if a project doesn't care about these distinction themselves, they may be more amenable to patches from downstream redistributors that separate the fields appropriately. Over time, this should allow much greater control over where and when particular dependencies end up being installed.

The names for the dependency fields have been deliberately chosen to avoid conflicting with the existing terminology in setuptools and previous versions of the metadata standard. Specifically, the names requires, install_requires and setup_requires are not used, which will hopefully reduce confusion when converting legacy metadata to the new standard.

Support for optional dependencies for distributions

The new extras system allows distributions to declare optional behaviour, and to use the dependency fields to indicate when particular dependencies are needed only to support that behaviour. It is derived from the equivalent system that is already in widespread use as part of setuptools and allows that aspect of the legacy setuptools metadata to be accurately represented in the new metadata format.

The additions to the extras syntax relative to setuptools are defined to make it easier to express the various possible combinations of dependencies, in particular those associated with build systems (with optional support for running the test suite) and development systems.

Support for metadata extensions

The new extension effectively allows sections of the metadata namespace to be delegated to other distributions, while preserving a standard overal format metadata format for easy of processing by distribution tools that do not support a particular extension.

It also works well in combination with the new build_requires field to allow a distribution to depend on tools which do know how to handle the chosen extension, and the new extras mechanism, allowing support for particular extensions to be provided as optional features.

Possible future uses for extensions include declaration of plugins for other distributions and hints for automatic conversion to Linux system packages.

The ability to declare an extension as required is included primarily to allow the definition of the metadata hooks extension to be deferred until some time after the initial adoption of the metadata 2.0 specification. If a distribution needs a postinstall hook to run in order to complete the installation successfully, then earlier versions of tools should fall back to installing from source rather than installing from a wheel file and then failing to run the expected postinstall hook.

Changes to environment markers

There are three substantive changes to environment markers in this version:

  • platform_release was added, as it provides more useful information than platform_version on at least Linux and Mac OS X (specifically, it provides details of the running kernel version)
  • ordered comparison of strings is allowed, as this is more useful for setting minimum and maximum versions where conditional dependencies are needed or where a platform is supported
  • comparison chaining is explicitly allowed, as this becomes useful in the presence of ordered comparisons

The other changes to environment markers are just clarifications and simplifications to make them easier to use.

The arbitrariness of the choice of . and _ in the different variables was addressed by standardising on _ (as these are all predefined variables rather than live references into the Python module namespace)

The use of parentheses for grouping was explicitly noted to address some underspecified behaviour in the previous version of the specification.

Updated contact information

This feature is provided by the python.project and python.integrator extensions in PEP 459.

The switch to JSON made it possible to provide a more flexible system for defining multiple contact points for a project, as well as listing other contributors.

The type concept allows for preservation of the distinction between the original author of a project, and a lead maintainer that takes over at a later date.

Changes to project URLs

This feature is provided by the python.project and python.integrator extensions in PEP 459.

In addition to allow arbitrary strings as project URL labels, the new metadata standard also defines a recommend set of four URL labels for a distribution's home page, documentation, source control and issue tracker.

Changes to platform support

This feature is provided by the python.constraints extension in PEP 459.

The new environment marker system makes it possible to define supported platforms in a way that is actually amenable to automated processing. This has been used to replace several older fields with poorly defined semantics.

The constraints mechanism also allows additional information to be conveyed through metadata extensions and then checked for consistency at install time.

For the moment, the old Requires-External field has been removed entirely. The metadata extension mechanism will hopefully prove to be a more useful replacement.

Updated obsolescence mechanism

The marker to indicate when a project is obsolete and should be replaced has been moved to the obsolete project (the new obsoleted_by field), replacing the previous marker on the replacement project (the removed Obsoletes-Dist field).

This should allow distribution tools to more easily warn users of obsolete projects and their suggested replacements.

The Obsoletes-Dist header is removed rather than deprecated as it is not widely supported, and so removing it does not present any significant barrier to tools and projects adopting the new metadata format.

Included text documents

This feature is provided by the python.details extension in PEP 459.

Currently, PyPI attempts to determine the description's markup format by rendering it as reStructuredText, and if that fails, treating it as plain text.

Furthermore, many projects simply read their long description in from an existing README file in setup.py. The popularity of this practice is only expected to increase, as many online version control systems (including both GitHub and BitBucket) automatically display such files on the landing page for the project.

Standardising on the inclusion of the long description as a separate file in the dist-info directory allows this to be simplified:

  • An existing file can just be copied into the dist-info directory as part of creating the sdist
  • The expected markup format can be determined by inspecting the file extension of the specified path

Allowing the intended format to be stated explicitly in the path allows the format guessing to be removed and more informative error reports to be provided to users when a rendering error occurs.

This is especially helpful since PyPI applies additional restrictions to the rendering process for security reasons, thus a description that renders correctly on a developer's system may still fail to render on the server.

The document naming system used to achieve this then makes it relatively straightforward to allow declaration of alternative markup formats like HTML, Markdown and AsciiDoc through the use of appropriate file extensions, as well as to define similar included documents for the project's license and changelog.

Grouping the included document names into a single top level field gives automated tools the option of treating them as arbitrary documents without worrying about their contents.

Requiring that the included documents be added to the dist-info metadata directory means that the complete metadata for the distribution can be extracted from an sdist or binary archive simply by extracting that directory, without needing to check for references to other files in the sdist.

Appendix D: Deferred features

Several potentially useful features have been deliberately deferred in order to better prioritise our efforts in migrating to the new metadata standard. These all reflect information that may be nice to have in the new metadata, but which can be readily added in metadata 2.1 without breaking any use cases already supported by metadata 2.0.

Once the pypi, setuptools, pip, wheel and distlib projects support creation and consumption of metadata 2.0, then we may revisit the creation of metadata 2.1 with some or all of these additional features.

MIME type registration

At some point after acceptance of the PEP, we may submit the following MIME type registration requests to IANA:

  • Full metadata: application/vnd.python.pydist+json
  • Essential dependency resolution metadata: application/vnd.python.pydist-dependencies+json

It's even possible we may be able to just register the vnd.python namespace under the banner of the PSF rather than having to register the individual subformats.

String methods in environment markers

Supporting at least ".startswith" and ".endswith" string methods in environment markers would allow some conditions to be written more naturally. For example, "sys.platform.startswith('win')" is a somewhat more intuitive way to mark Windows specific dependencies, since "'win' in sys.platform" is incorrect thanks to cygwin and the fact that 64-bit Windows still shows up as win32 is more than a little strange.

Support for metadata hooks

While a draft proposal for a metadata hook system has been created, that proposal is not part of the initial set of standard metadata extensions in PEP 459.

A metadata hook system would allow the wheel format to fully replace direct installation on deployment targets, by allowing projects to explicitly define code that should be executed following installation from a wheel file.

This may be something relatively simple, like the two line refresh of the Twisted plugin caches that the Twisted developers recommend for any project that provides Twisted plugins, to more complex platform dependent behaviour, potentially in conjunction with appropriate metadata extensions and supports_environments entries.

For example, upstream declaration of external dependencies for various Linux distributions in a distribution neutral format may be supported by defining an appropriate metadata extension that is read by a postinstall hook and converted into an appropriate invocation of the system package manager. Other operations (such as registering COM DLLs on Windows, registering services for automatic startup on any platform, or altering firewall settings) may need to be undertaken with elevated privileges, meaning they cannot be deferred to implicit execution on first use of the distribution.

For the time being, any such system is being left to the realm of tool specific metadata extensions. This does mean that affected projects may choose not to publish wheel files, instead continuing to rely on source distributions until the relevant extension is well defined and widely supported.

Metabuild system

This version of the metadata specification continues to use setup.py and the distutils command syntax to invoke build and test related operations on a source archive or VCS checkout.

It may be desirable to replace these in the future with tool independent entry points that support:

  • Generating the metadata file on a development system
  • Generating an sdist on a development system
  • Generating a binary archive on a build system
  • Running the test suite on a built (but not installed) distribution

Metadata 2.0 deliberately focuses on wheel based installation, leaving sdist, source archive, and VCS checkout based installation to use the existing setup.py based distutils command interface.

In the meantime, the above operations will be handled through the distutils/setuptools command system:

  • python setup.py dist_info
  • python setup.py sdist
  • python setup.py build_ext --inplace
  • python setup.py test
  • python setup.py bdist_wheel

The following metabuild hooks may be defined in metadata 2.1 to cover these operations without relying on setup.py:

  • make_dist_info: generate the sdist's dist_info directory
  • make_sdist: create the contents of an sdist
  • build_dist: create the contents of a binary wheel archive from an unpacked sdist
  • test_built_dist: run the test suite for a built distribution

Tentative signatures have been designed for those hooks, but in order to better focus initial development efforts on the integration and installation use cases, they will not be pursued further until metadata 2.1:

def make_dist_info(source_dir, info_dir):
    """Generate the contents of dist_info for an sdist archive

    *source_dir* points to a source checkout or unpacked tarball
    *info_dir* is the destination where the sdist metadata files should
    be written

    Returns the distribution metadata as a dictionary.
    """

def make_sdist(source_dir, contents_dir, info_dir):
    """Generate the contents of an sdist archive

    *source_dir* points to a source checkout or unpacked tarball
    *contents_dir* is the destination where the sdist contents should be
    written (note that archiving the contents is the responsibility of
    the metabuild tool rather than the hook function)
    *info_dir* is the destination where the sdist metadata files should
    be written

    Returns the distribution metadata as a dictionary.
    """

def build_dist(sdist_dir, built_dir, info_dir, compatibility=None):
    """Generate the contents of a binary wheel archive

    *sdist_dir* points to an unpacked sdist
    *built_dir* is the destination where the wheel contents should be
    written (note that archiving the contents is the responsibility of
    the metabuild tool rather than the hook function)
    *info_dir* is the destination where the wheel metadata files should
    be written
    *compatibility* is an optional PEP 425 compatibility tag indicating
    the desired target compatibility for the build. If the tag cannot
    be satisfied, the hook should throw ``ValueError``.

    Returns the actual compatibility tag for the build
    """

def test_built_dist(sdist_dir, built_dir, info_dir):
    """Check a built (but not installed) distribution works as expected

    *sdist_dir* points to an unpacked sdist
    *built_dir* points to a platform appropriate unpacked wheel archive
    (which may be missing the wheel metadata directory)
    *info_dir* points to the appropriate wheel metadata directory

    Requires that the distribution's test dependencies be installed
    (indicated by the ``:test:`` extra).

    Returns ``True`` if the check passes, ``False`` otherwise.
    """

As with the existing install hooks, checking for extras would be done using the same import based checks as are used for runtime extras. That way it doesn't matter if the additional dependencies were requested explicitly or just happen to be available on the system.

There are still a number of open questions with this design, such as whether a single build hook is sufficient to cover both "build for testing" and "prep for deployment", as well as various complexities like support for cross-compilation of binaries, specification of target platforms and Python versions when creating wheel files, etc.

Opting to retain the status quo for now allows us to make progress on improved metadata publication and binary installation support, rather than having to delay that awaiting the creation of a viable metabuild framework.

Appendix E: Rejected features

The following features have been explicitly considered and rejected as introducing too much additional complexity for too small a gain in expressiveness.

Separate lists for conditional and unconditional dependencies

Earlier versions of this PEP used separate lists for conditional and unconditional dependencies. This turned out to be annoying to handle in automated tools and removing it also made the PEP and metadata schema substantially shorter, suggesting it was actually harder to explain as well.

Disallowing underscores in distribution names

Debian doesn't actually permit underscores in names, but that seems unduly restrictive for this spec given the common practice of using valid Python identifiers as Python distribution names. A Debian side policy of converting underscores to hyphens seems easy enough to implement (and the requirement to consider hyphens and underscores as equivalent ensures that doing so won't introduce any conflicts).

Allowing the use of Unicode in distribution names

This PEP deliberately avoids following Python 3 down the path of arbitrary Unicode identifiers, as the security implications of doing so are substantially worse in the software distribution use case (it opens up far more interesting attack vectors than mere code obfuscation).

In addition, the existing tools really only work properly if you restrict names to ASCII and changing that would require a lot of work for all the automated tools in the chain.

It may be reasonable to revisit this question at some point in the (distant) future, but setting up a more reliable software distribution system is challenging enough without adding more general Unicode identifier support into the mix.

Single list for conditional and unconditional dependencies

It's technically possible to store the conditional and unconditional dependencies of each kind in a single list and switch the handling based on the entry type (string or mapping).

However, the current *requires vs *may-require two list design seems easier to understand and work with, since it's only the conditional dependencies that need to be checked against the requested extras list and the target installation environment.

Depending on source labels

There is no mechanism to express a dependency on a source label - they are included in the metadata for internal project reference only. Instead, dependencies must be expressed in terms of either public versions or else direct URL references.

Alternative dependencies

An earlier draft of this PEP considered allowing lists in place of the usual strings in dependency specifications to indicate that there aren multiple ways to satisfy a dependency.

If at least one of the individual dependencies was already available, then the entire dependency would be considered satisfied, otherwise the first entry would be added to the dependency set.

Alternative dependency specification example:

["Pillow", "PIL"]
["mysql", "psycopg2 >= 4", "sqlite3"]

However, neither of the given examples is particularly compelling, since Pillow/PIL style forks aren't common, and the database driver use case would arguably be better served by an SQL Alchemy defined "supported database driver" metadata extension where a project depends on SQL Alchemy, and then declares in the extension which database drivers are checked for compatibility by the upstream project (similar to the advisory supports_environments field in the main metadata).

We're also getting better support for "virtual provides" in this version of the metadata standard, so this may end up being an installer and index server problem to better track and publish those.

Compatible release comparisons in environment markers

PEP 440 defines a rich syntax for version comparisons that could potentially be useful with python_version and python_full_version in environment markers. However, allowing the full syntax would mean environment markers are no longer a Python subset, while allowing only some of the comparisons would introduce yet another special case to handle.

Given that environment markers are only used in cases where a higher level "or" is implied by the metadata structure, it seems easier to require the use of multiple comparisons against specific Python versions for the rare cases where this would be useful.

Conditional provides

Under the revised metadata design, conditional "provides" based on runtime features or the environment would go in a separate "may_provide" field. However, it isn't clear there's any use case for doing that, so the idea is rejected unless someone can present a compelling use case (and even then the idea won't be reconsidered until metadata 2.1 at the earliest).

References

This document specifies version 2.0 of the metadata format. Version 1.0 is specified in PEP 241. Version 1.1 is specified in PEP 314. Version 1.2 is specified in PEP 345.

The initial attempt at a standardised version scheme, along with the justifications for needing such a standard can be found in PEP 386.

[1]reStructuredText markup: http://docutils.sourceforge.net/
[2]PEP 301: http://www.python.org/dev/peps/pep-0301/
[3](1, 2, 3) http://pypi.python.org/pypi/
[4]https://www.youtube.com/watch?v=CSe38dzJYkY

pep-0427 The Wheel Binary Package Format 1.0

PEP:427
Title:The Wheel Binary Package Format 1.0
Version:$Revision$
Last-Modified:$Date$
Author:Daniel Holth <dholth at gmail.com>
BDFL-Delegate:Nick Coghlan <ncoghlan@gmail.com>
Discussions-To:<distutils-sig at python.org>
Status:Accepted
Type:Standards Track
Content-Type:text/x-rst
Created:20-Sep-2012
Post-History:18-Oct-2012, 15-Feb-2013
Resolution:http://mail.python.org/pipermail/python-dev/2013-February/124103.html

Abstract

This PEP describes a built-package format for Python called "wheel".

A wheel is a ZIP-format archive with a specially formatted file name and the .whl extension. It contains a single distribution nearly as it would be installed according to PEP 376 with a particular installation scheme. Although a specialized installer is recommended, a wheel file may be installed by simply unpacking into site-packages with the standard 'unzip' tool while preserving enough information to spread its contents out onto their final paths at any later time.

PEP Acceptance

This PEP was accepted, and the defined wheel version updated to 1.0, by Nick Coghlan on 16th February, 2013 [1]

Rationale

Python needs a package format that is easier to install than sdist. Python's sdist packages are defined by and require the distutils and setuptools build systems, running arbitrary code to build-and-install, and re-compile, code just so it can be installed into a new virtualenv. This system of conflating build-install is slow, hard to maintain, and hinders innovation in both build systems and installers.

Wheel attempts to remedy these problems by providing a simpler interface between the build system and the installer. The wheel binary package format frees installers from having to know about the build system, saves time by amortizing compile time over many installations, and removes the need to install a build system in the target environment.

Details

Installing a wheel 'distribution-1.0-py32-none-any.whl'

Wheel installation notionally consists of two phases:

  • Unpack.
    1. Parse distribution-1.0.dist-info/WHEEL.
    2. Check that installer is compatible with Wheel-Version. Warn if minor version is greater, abort if major version is greater.
    3. If Root-Is-Purelib == 'true', unpack archive into purelib (site-packages).
    4. Else unpack archive into platlib (site-packages).
  • Spread.
    1. Unpacked archive includes distribution-1.0.dist-info/ and (if there is data) distribution-1.0.data/.
    2. Move each subtree of distribution-1.0.data/ onto its destination path. Each subdirectory of distribution-1.0.data/ is a key into a dict of destination directories, such as distribution-1.0.data/(purelib|platlib|headers|scripts|data). The initially supported paths are taken from distutils.command.install.
    3. If applicable, update scripts starting with #!python to point to the correct interpreter.
    4. Update distribution-1.0.dist.info/RECORD with the installed paths.
    5. Remove empty distribution-1.0.data directory.
    6. Compile any installed .py to .pyc. (Uninstallers should be smart enough to remove .pyc even if it is not mentioned in RECORD.)

File Format

File name convention

The wheel filename is {distribution}-{version}(-{build tag})?-{python tag}-{abi tag}-{platform tag}.whl.

distribution
Distribution name, e.g. 'django', 'pyramid'.
version
Distribution version, e.g. 1.0.
build tag
Optional build number. Must start with a digit. A tie breaker if two wheels have the same version. Sort as the empty string if unspecified, else sort the initial digits as a number, and the remainder lexicographically.
language implementation and version tag
E.g. 'py27', 'py2', 'py3'.
abi tag
E.g. 'cp33m', 'abi3', 'none'.
platform tag
E.g. 'linux_x86_64', 'any'.

For example, distribution-1.0-1-py27-none-any.whl is the first build of a package called 'distribution', and is compatible with Python 2.7 (any Python 2.7 implementation), with no ABI (pure Python), on any CPU architecture.

The last three components of the filename before the extension are called "compatibility tags." The compatibility tags express the package's basic interpreter requirements and are detailed in PEP 425.

Escaping and Unicode

Each component of the filename is escaped by replacing runs of non-alphanumeric characters with an underscore _:

re.sub("[^\w\d.]+", "_", distribution, re.UNICODE)

The archive filename is Unicode. It will be some time before the tools are updated to support non-ASCII filenames, but they are supported in this specification.

The filenames inside the archive are encoded as UTF-8. Although some ZIP clients in common use do not properly display UTF-8 filenames, the encoding is supported by both the ZIP specification and Python's zipfile.

File contents

The contents of a wheel file, where {distribution} is replaced with the name of the package, e.g. beaglevote and {version} is replaced with its version, e.g. 1.0.0, consist of:

  1. /, the root of the archive, contains all files to be installed in purelib or platlib as specified in WHEEL. purelib and platlib are usually both site-packages.

  2. {distribution}-{version}.dist-info/ contains metadata.

  3. {distribution}-{version}.data/ contains one subdirectory for each non-empty install scheme key not already covered, where the subdirectory name is an index into a dictionary of install paths (e.g. data, scripts, include, purelib, platlib).

  4. Python scripts must appear in scripts and begin with exactly b'#!python' in order to enjoy script wrapper generation and #!python rewriting at install time. They may have any or no extension.

  5. {distribution}-{version}.dist-info/METADATA is Metadata version 1.1 or greater format metadata.

  6. {distribution}-{version}.dist-info/WHEEL is metadata about the archive itself in the same basic key: value format:

    Wheel-Version: 1.0
    Generator: bdist_wheel 1.0
    Root-Is-Purelib: true
    Tag: py2-none-any
    Tag: py3-none-any
    Build: 1
    
  7. Wheel-Version is the version number of the Wheel specification.

  8. Generator is the name and optionally the version of the software that produced the archive.

  9. Root-Is-Purelib is true if the top level directory of the archive should be installed into purelib; otherwise the root should be installed into platlib.

  10. Tag is the wheel's expanded compatibility tags; in the example the filename would contain py2.py3-none-any.

  11. Build is the build number and is omitted if there is no build number.

  12. A wheel installer should warn if Wheel-Version is greater than the version it supports, and must fail if Wheel-Version has a greater major version than the version it supports.

  13. Wheel, being an installation format that is intended to work across multiple versions of Python, does not generally include .pyc files.

  14. Wheel does not contain setup.py or setup.cfg.

This version of the wheel specification is based on the distutils install schemes and does not define how to install files to other locations. The layout offers a superset of the functionality provided by the existing wininst and egg binary formats.

The .dist-info directory

  1. Wheel .dist-info directories include at a minimum METADATA, WHEEL, and RECORD.
  2. METADATA is the package metadata, the same format as PKG-INFO as found at the root of sdists.
  3. WHEEL is the wheel metadata specific to a build of the package.
  4. RECORD is a list of (almost) all the files in the wheel and their secure hashes. Unlike PEP 376, every file except RECORD, which cannot contain a hash of itself, must include its hash. The hash algorithm must be sha256 or better; specifically, md5 and sha1 are not permitted, as signed wheel files rely on the strong hashes in RECORD to validate the integrity of the archive.
  5. PEP 376's INSTALLER and REQUESTED are not included in the archive.
  6. RECORD.jws is used for digital signatures. It is not mentioned in RECORD.
  7. RECORD.p7s is allowed as a courtesy to anyone who would prefer to use S/MIME signatures to secure their wheel files. It is not mentioned in RECORD.
  8. During extraction, wheel installers verify all the hashes in RECORD against the file contents. Apart from RECORD and its signatures, installation will fail if any file in the archive is not both mentioned and correctly hashed in RECORD.

The .data directory

Any file that is not normally installed inside site-packages goes into the .data directory, named as the .dist-info directory but with the .data/ extension:

distribution-1.0.dist-info/

distribution-1.0.data/

The .data directory contains subdirectories with the scripts, headers, documentation and so forth from the distribution. During installation the contents of these subdirectories are moved onto their destination paths.

Signed wheel files

Wheel files include an extended RECORD that enables digital signatures. PEP 376's RECORD is altered to include a secure hash digestname=urlsafe_b64encode_nopad(digest) (urlsafe base64 encoding with no trailing = characters) as the second column instead of an md5sum. All possible entries are hashed, including any generated files such as .pyc files, but not RECORD which cannot contain its own hash. For example:

file.py,sha256=AVTFPZpEKzuHr7OvQZmhaU3LvwKz06AJw8mT\_pNh2yI,3144
distribution-1.0.dist-info/RECORD,,

The signature file(s) RECORD.jws and RECORD.p7s are not mentioned in RECORD at all since they can only be added after RECORD is generated. Every other file in the archive must have a correct hash in RECORD or the installation will fail.

If JSON web signatures are used, one or more JSON Web Signature JSON Serialization (JWS-JS) signatures is stored in a file RECORD.jws adjacent to RECORD. JWS is used to sign RECORD by including the SHA-256 hash of RECORD as the signature's JSON payload:

{ "hash": "sha256=ADD-r2urObZHcxBW3Cr-vDCu5RJwT4CaRTHiFmbcIYY" }

(The hash value is the same format used in RECORD.)

If RECORD.p7s is used, it must contain a detached S/MIME format signature of RECORD.

A wheel installer is not required to understand digital signatures but MUST verify the hashes in RECORD against the extracted file contents. When the installer checks file hashes against RECORD, a separate signature checker only needs to establish that RECORD matches the signature.

See

Comparison to .egg

  1. Wheel is an installation format; egg is importable. Wheel archives do not need to include .pyc and are less tied to a specific Python version or implementation. Wheel can install (pure Python) packages built with previous versions of Python so you don't always have to wait for the packager to catch up.
  2. Wheel uses .dist-info directories; egg uses .egg-info. Wheel is compatible with the new world of Python packaging and the new concepts it brings.
  3. Wheel has a richer file naming convention for today's multi-implementation world. A single wheel archive can indicate its compatibility with a number of Python language versions and implementations, ABIs, and system architectures. Historically the ABI has been specific to a CPython release, wheel is ready for the stable ABI.
  4. Wheel is lossless. The first wheel implementation bdist_wheel always generates egg-info, and then converts it to a .whl. It is also possible to convert existing eggs and bdist_wininst distributions.
  5. Wheel is versioned. Every wheel file contains the version of the wheel specification and the implementation that packaged it. Hopefully the next migration can simply be to Wheel 2.0.
  6. Wheel is a reference to the other Python.

FAQ

Wheel defines a .data directory. Should I put all my data there?

This specification does not have an opinion on how you should organize your code. The .data directory is just a place for any files that are not normally installed inside site-packages or on the PYTHONPATH. In other words, you may continue to use pkgutil.get_data(package, resource) even though those files will usually not be distributed in wheel's .data directory.

Why does wheel include attached signatures?

Attached signatures are more convenient than detached signatures because they travel with the archive. Since only the individual files are signed, the archive can be recompressed without invalidating the signature or individual files can be verified without having to download the whole archive.

Why does wheel allow JWS signatures?

The JOSE specifications of which JWS is a part are designed to be easy to implement, a feature that is also one of wheel's primary design goals. JWS yields a useful, concise pure-Python implementation.

Why does wheel also allow S/MIME signatures?

S/MIME signatures are allowed for users who need or want to use existing public key infrastructure with wheel.

Signed packages are only a basic building block in a secure package update system. Wheel only provides the building block.

What's the deal with "purelib" vs. "platlib"?

Wheel preserves the "purelib" vs. "platlib" distinction, which is significant on some platforms. For example, Fedora installs pure Python packages to '/usr/lib/pythonX.Y/site-packages' and platform dependent packages to '/usr/lib64/pythonX.Y/site-packages'.

A wheel with "Root-Is-Purelib: false" with all its files in {name}-{version}.data/purelib is equivalent to a wheel with "Root-Is-Purelib: true" with those same files in the root, and it is legal to have files in both the "purelib" and "platlib" categories.

In practice a wheel should have only one of "purelib" or "platlib" depending on whether it is pure Python or not and those files should be at the root with the appropriate setting given for "Root-is-purelib".

Is it possible to import Python code directly from a wheel file?

Technically, due to the combination of supporting installation via simple extraction and using an archive format that is compatible with zipimport, a subset of wheel files do support being placed directly on sys.path. However, while this behaviour is a natural consequence of the format design, actually relying on it is generally discouraged.

Firstly, wheel is designed primarily as a distribution format, so skipping the installation step also means deliberately avoiding any reliance on features that assume full installation (such as being able to use standard tools like pip and virtualenv to capture and manage dependencies in a way that can be properly tracked for auditing and security update purposes, or integrating fully with the standard build machinery for C extensions by publishing header files in the appropriate place).

Secondly, while some Python software is written to support running directly from a zip archive, it is still common for code to be written assuming it has been fully installed. When that assumption is broken by trying to run the software from a zip archive, the failures can often be obscure and hard to diagnose (especially when they occur in third party libraries). The two most common sources of problems with this are the fact that importing C extensions from a zip archive is not supported by CPython (since doing so is not supported directly by the dynamic loading machinery on any platform) and that when running from a zip archive the __file__ attribute no longer refers to an ordinary filesystem path, but to a combination path that includes both the location of the zip archive on the filesystem and the relative path to the module inside the archive. Even when software correctly uses the abstract resource APIs internally, interfacing with external components may still require the availability of an actual on-disk file.

Like metaclasses, monkeypatching and metapath importers, if you're not already sure you need to take advantage of this feature, you almost certainly don't need it. If you do decide to use it anyway, be aware that many projects will require a failure to be reproduced with a fully installed package before accepting it as a genuine bug.

Appendix

Example urlsafe-base64-nopad implementation:

# urlsafe-base64-nopad for Python 3
import base64

def urlsafe_b64encode_nopad(data):
    return base64.urlsafe_b64encode(data).rstrip(b'=')

def urlsafe_b64decode_nopad(data):
    pad = b'=' * (4 - (len(data) & 3))
    return base64.urlsafe_b64decode(data + pad)

pep-0428 The pathlib module -- object-oriented filesystem paths

PEP:428
Title:The pathlib module -- object-oriented filesystem paths
Version:$Revision$
Last-Modified:$Date$
Author:Antoine Pitrou <solipsis at pitrou.net>
Status:Final
Type:Standards Track
Content-Type:text/x-rst
Created:30-July-2012
Python-Version:3.4
Post-History:http://mail.python.org/pipermail/python-ideas/2012-October/016338.html
Resolution:https://mail.python.org/pipermail/python-dev/2013-November/130424.html

Abstract

This PEP proposes the inclusion of a third-party module, pathlib [1], in the standard library. The inclusion is proposed under the provisional label, as described in PEP 411. Therefore, API changes can be done, either as part of the PEP process, or after acceptance in the standard library (and until the provisional label is removed).

The aim of this library is to provide a simple hierarchy of classes to handle filesystem paths and the common operations users do over them.

Implementation

The implementation of this proposal is tracked in the pep428 branch of pathlib's Mercurial repository [6].

Why an object-oriented API

The rationale to represent filesystem paths using dedicated classes is the same as for other kinds of stateless objects, such as dates, times or IP addresses. Python has been slowly moving away from strictly replicating the C language's APIs to providing better, more helpful abstractions around all kinds of common functionality. Even if this PEP isn't accepted, it is likely that another form of filesystem handling abstraction will be adopted one day into the standard library.

Indeed, many people will prefer handling dates and times using the high-level objects provided by the datetime module, rather than using numeric timestamps and the time module API. Moreover, using a dedicated class allows to enable desirable behaviours by default, for example the case insensitivity of Windows paths.

Proposal

Class hierarchy

The pathlib [1] module implements a simple hierarchy of classes:

                +----------+
                |          |
       ---------| PurePath |--------
       |        |          |       |
       |        +----------+       |
       |             |             |
       |             |             |
       v             |             v
+---------------+    |    +-----------------+
|               |    |    |                 |
| PurePosixPath |    |    | PureWindowsPath |
|               |    |    |                 |
+---------------+    |    +-----------------+
       |             v             |
       |          +------+         |
       |          |      |         |
       |   -------| Path |------   |
       |   |      |      |     |   |
       |   |      +------+     |   |
       |   |                   |   |
       |   |                   |   |
       v   v                   v   v
  +-----------+           +-------------+
  |           |           |             |
  | PosixPath |           | WindowsPath |
  |           |           |             |
  +-----------+           +-------------+

This hierarchy divides path classes along two dimensions:

  • a path class can be either pure or concrete: pure classes support only operations that don't need to do any actual I/O, which are most path manipulation operations; concrete classes support all the operations of pure classes, plus operations that do I/O.
  • a path class is of a given flavour according to the kind of operating system paths it represents. pathlib [1] implements two flavours: Windows paths for the filesystem semantics embodied in Windows systems, POSIX paths for other systems.

Any pure class can be instantiated on any system: for example, you can manipulate PurePosixPath objects under Windows, PureWindowsPath objects under Unix, and so on. However, concrete classes can only be instantiated on a matching system: indeed, it would be error-prone to start doing I/O with WindowsPath objects under Unix, or vice-versa.

Furthermore, there are two base classes which also act as system-dependent factories: PurePath will instantiate either a PurePosixPath or a PureWindowsPath depending on the operating system. Similarly, Path will instantiate either a PosixPath or a WindowsPath.

It is expected that, in most uses, using the Path class is adequate, which is why it has the shortest name of all.

No confusion with builtins

In this proposal, the path classes do not derive from a builtin type. This contrasts with some other Path class proposals which were derived from str. They also do not pretend to implement the sequence protocol: if you want a path to act as a sequence, you have to lookup a dedicated attribute (the parts attribute).

Not behaving like one of the basic builtin types also minimizes the potential for confusion if a path is combined by accident with genuine builtin types.

Immutability

Path objects are immutable, which makes them hashable and also prevents a class of programming errors.

Sane behaviour

Little of the functionality from os.path is reused. Many os.path functions are tied by backwards compatibility to confusing or plain wrong behaviour (for example, the fact that os.path.abspath() simplifies ".." path components without resolving symlinks first).

Comparisons

Paths of the same flavour are comparable and orderable, whether pure or not:

>>> PurePosixPath('a') == PurePosixPath('b')
False
>>> PurePosixPath('a') < PurePosixPath('b')
True
>>> PurePosixPath('a') == PosixPath('a')
True

Comparing and ordering Windows path objects is case-insensitive:

>>> PureWindowsPath('a') == PureWindowsPath('A')
True

Paths of different flavours always compare unequal, and cannot be ordered:

>>> PurePosixPath('a') == PureWindowsPath('a')
False
>>> PurePosixPath('a') < PureWindowsPath('a')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unorderable types: PurePosixPath() < PureWindowsPath()

Paths compare unequal to, and are not orderable with instances of builtin types (such as str) and any other types.

Useful notations

The API tries to provide useful notations all the while avoiding magic. Some examples:

>>> p = Path('/home/antoine/pathlib/setup.py')
>>> p.name
'setup.py'
>>> p.suffix
'.py'
>>> p.root
'/'
>>> p.parts
('/', 'home', 'antoine', 'pathlib', 'setup.py')
>>> p.relative_to('/home/antoine')
PosixPath('pathlib/setup.py')
>>> p.exists()
True

Pure paths API

The philosophy of the PurePath API is to provide a consistent array of useful path manipulation operations, without exposing a hodge-podge of functions like os.path does.

Definitions

First a couple of conventions:

  • All paths can have a drive and a root. For POSIX paths, the drive is always empty.
  • A relative path has neither drive nor root.
  • A POSIX path is absolute if it has a root. A Windows path is absolute if it has both a drive and a root. A Windows UNC path (e.g. \\host\share\myfile.txt) always has a drive and a root (here, \\host\share and \, respectively).
  • A path which has either a drive or a root is said to be anchored. Its anchor is the concatenation of the drive and root. Under POSIX, "anchored" is the same as "absolute".

Construction

We will present construction and joining together since they expose similar semantics.

The simplest way to construct a path is to pass it its string representation:

>>> PurePath('setup.py')
PurePosixPath('setup.py')

Extraneous path separators and "." components are eliminated:

>>> PurePath('a///b/c/./d/')
PurePosixPath('a/b/c/d')

If you pass several arguments, they will be automatically joined:

>>> PurePath('docs', 'Makefile')
PurePosixPath('docs/Makefile')

Joining semantics are similar to os.path.join, in that anchored paths ignore the information from the previously joined components:

>>> PurePath('/etc', '/usr', 'bin')
PurePosixPath('/usr/bin')

However, with Windows paths, the drive is retained as necessary:

>>> PureWindowsPath('c:/foo', '/Windows')
PureWindowsPath('c:/Windows')
>>> PureWindowsPath('c:/foo', 'd:')
PureWindowsPath('d:')

Also, path separators are normalized to the platform default:

>>> PureWindowsPath('a/b') == PureWindowsPath('a\\b')
True

Extraneous path separators and "." components are eliminated, but not ".." components:

>>> PurePosixPath('a//b/./c/')
PurePosixPath('a/b/c')
>>> PurePosixPath('a/../b')
PurePosixPath('a/../b')

Multiple leading slashes are treated differently depending on the path flavour. They are always retained on Windows paths (because of the UNC notation):

>>> PureWindowsPath('//some/path')
PureWindowsPath('//some/path/')

On POSIX, they are collapsed except if there are exactly two leading slashes, which is a special case in the POSIX specification on pathname resolution [7] (this is also necessary for Cygwin compatibility):

>>> PurePosixPath('///some/path')
PurePosixPath('/some/path')
>>> PurePosixPath('//some/path')
PurePosixPath('//some/path')

Calling the constructor without any argument creates a path object pointing to the logical "current directory" (without looking up its absolute path, which is the job of the cwd() classmethod on concrete paths):

>>> PurePosixPath()
PurePosixPath('.')

Representing

To represent a path (e.g. to pass it to third-party libraries), just call str() on it:

>>> p = PurePath('/home/antoine/pathlib/setup.py')
>>> str(p)
'/home/antoine/pathlib/setup.py'
>>> p = PureWindowsPath('c:/windows')
>>> str(p)
'c:\\windows'

To force the string representation with forward slashes, use the as_posix() method:

>>> p.as_posix()
'c:/windows'

To get the bytes representation (which might be useful under Unix systems), call bytes() on it, which internally uses os.fsencode():

>>> bytes(p)
b'/home/antoine/pathlib/setup.py'

To represent the path as a file: URI, call the as_uri() method:

>>> p = PurePosixPath('/etc/passwd')
>>> p.as_uri()
'file:///etc/passwd'
>>> p = PureWindowsPath('c:/Windows')
>>> p.as_uri()
'file:///c:/Windows'

The repr() of a path always uses forward slashes, even under Windows, for readability and to remind users that forward slashes are ok:

>>> p = PureWindowsPath('c:/Windows')
>>> p
PureWindowsPath('c:/Windows')

Properties

Several simple properties are provided on every path (each can be empty):

>>> p = PureWindowsPath('c:/Downloads/pathlib.tar.gz')
>>> p.drive
'c:'
>>> p.root
'\\'
>>> p.anchor
'c:\\'
>>> p.name
'pathlib.tar.gz'
>>> p.stem
'pathlib.tar'
>>> p.suffix
'.gz'
>>> p.suffixes
['.tar', '.gz']

Deriving new paths

Joining

A path can be joined with another using the / operator:

>>> p = PurePosixPath('foo')
>>> p / 'bar'
PurePosixPath('foo/bar')
>>> p / PurePosixPath('bar')
PurePosixPath('foo/bar')
>>> 'bar' / p
PurePosixPath('bar/foo')

As with the constructor, multiple path components can be specified, either collapsed or separately:

>>> p / 'bar/xyzzy'
PurePosixPath('foo/bar/xyzzy')
>>> p / 'bar' / 'xyzzy'
PurePosixPath('foo/bar/xyzzy')

A joinpath() method is also provided, with the same behaviour:

>>> p.joinpath('Python')
PurePosixPath('foo/Python')

Changing the path's final component

The with_name() method returns a new path, with the name changed:

>>> p = PureWindowsPath('c:/Downloads/pathlib.tar.gz')
>>> p.with_name('setup.py')
PureWindowsPath('c:/Downloads/setup.py')

It fails with a ValueError if the path doesn't have an actual name:

>>> p = PureWindowsPath('c:/')
>>> p.with_name('setup.py')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pathlib.py", line 875, in with_name
    raise ValueError("%r has an empty name" % (self,))
ValueError: PureWindowsPath('c:/') has an empty name
>>> p.name
''

The with_suffix() method returns a new path with the suffix changed. However, if the path has no suffix, the new suffix is added:

>>> p = PureWindowsPath('c:/Downloads/pathlib.tar.gz')
>>> p.with_suffix('.bz2')
PureWindowsPath('c:/Downloads/pathlib.tar.bz2')
>>> p = PureWindowsPath('README')
>>> p.with_suffix('.bz2')
PureWindowsPath('README.bz2')

Making the path relative

The relative_to() method computes the relative difference of a path to another:

>>> PurePosixPath('/usr/bin/python').relative_to('/usr')
PurePosixPath('bin/python')

ValueError is raised if the method cannot return a meaningful value:

>>> PurePosixPath('/usr/bin/python').relative_to('/etc')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pathlib.py", line 926, in relative_to
    .format(str(self), str(formatted)))
ValueError: '/usr/bin/python' does not start with '/etc'

Sequence-like access

The parts property returns a tuple providing read-only sequence access to a path's components:

>>> p = PurePosixPath('/etc/init.d')
>>> p.parts
('/', 'etc', 'init.d')

Windows paths handle the drive and the root as a single path component:

>>> p = PureWindowsPath('c:/setup.py')
>>> p.parts
('c:\\', 'setup.py')

(separating them would be wrong, since C: is not the parent of C:\\).

The parent property returns the logical parent of the path:

>>> p = PureWindowsPath('c:/python33/bin/python.exe')
>>> p.parent
PureWindowsPath('c:/python33/bin')

The parents property returns an immutable sequence of the path's logical ancestors:

>>> p = PureWindowsPath('c:/python33/bin/python.exe')
>>> len(p.parents)
3
>>> p.parents[0]
PureWindowsPath('c:/python33/bin')
>>> p.parents[1]
PureWindowsPath('c:/python33')
>>> p.parents[2]
PureWindowsPath('c:/')

Querying

is_relative() returns True if the path is relative (see definition above), False otherwise.

is_reserved() returns True if a Windows path is a reserved path such as CON or NUL. It always returns False for POSIX paths.

match() matches the path against a glob pattern. It operates on individual parts and matches from the right:

>>> p = PurePosixPath('/usr/bin')
>>> p.match('/usr/b*')
True
>>> p.match('usr/b*')
True
>>> p.match('b*')
True
>>> p.match('/u*')
False

This behaviour respects the following expectations:

  • A simple pattern such as "*.py" matches arbitrarily long paths as long as the last part matches, e.g. "/usr/foo/bar.py".
  • Longer patterns can be used as well for more complex matching, e.g. "/usr/foo/*.py" matches "/usr/foo/bar.py".

Concrete paths API

In addition to the operations of the pure API, concrete paths provide additional methods which actually access the filesystem to query or mutate information.

Constructing

The classmethod cwd() creates a path object pointing to the current working directory in absolute form:

>>> Path.cwd()
PosixPath('/home/antoine/pathlib')

File metadata

The stat() returns the file's stat() result; similarly, lstat() returns the file's lstat() result (which is different iff the file is a symbolic link):

>>> p.stat()
posix.stat_result(st_mode=33277, st_ino=7483155, st_dev=2053, st_nlink=1, st_uid=500, st_gid=500, st_size=928, st_atime=1343597970, st_mtime=1328287308, st_ctime=1343597964)

Higher-level methods help examine the kind of the file:

>>> p.exists()
True
>>> p.is_file()
True
>>> p.is_dir()
False
>>> p.is_symlink()
False
>>> p.is_socket()
False
>>> p.is_fifo()
False
>>> p.is_block_device()
False
>>> p.is_char_device()
False

The file owner and group names (rather than numeric ids) are queried through corresponding methods:

>>> p = Path('/etc/shadow')
>>> p.owner()
'root'
>>> p.group()
'shadow'

Path resolution

The resolve() method makes a path absolute, resolving any symlink on the way (like the POSIX realpath() call). It is the only operation which will remove ".." path components. On Windows, this method will also take care to return the canonical path (with the right casing).

Directory walking

Simple (non-recursive) directory access is done by calling the iterdir() method, which returns an iterator over the child paths:

>>> p = Path('docs')
>>> for child in p.iterdir(): child
...
PosixPath('docs/conf.py')
PosixPath('docs/_templates')
PosixPath('docs/make.bat')
PosixPath('docs/index.rst')
PosixPath('docs/_build')
PosixPath('docs/_static')
PosixPath('docs/Makefile')

This allows simple filtering through list comprehensions:

>>> p = Path('.')
>>> [child for child in p.iterdir() if child.is_dir()]
[PosixPath('.hg'), PosixPath('docs'), PosixPath('dist'), PosixPath('__pycache__'), PosixPath('build')]

Simple and recursive globbing is also provided:

>>> for child in p.glob('**/*.py'): child
...
PosixPath('test_pathlib.py')
PosixPath('setup.py')
PosixPath('pathlib.py')
PosixPath('docs/conf.py')
PosixPath('build/lib/pathlib.py')

File opening

The open() method provides a file opening API similar to the builtin open() method:

>>> p = Path('setup.py')
>>> with p.open() as f: f.readline()
...
'#!/usr/bin/env python3\n'

Filesystem modification

Several common filesystem operations are provided as methods: touch(), mkdir(), rename(), replace(), unlink(), rmdir(), chmod(), lchmod(), symlink_to(). More operations could be provided, for example some of the functionality of the shutil module.

Detailed documentation of the proposed API can be found at the pathlib docs [8].

Discussion

Division operator

The division operator came out first in a poll [9] about the path joining operator. Initial versions of pathlib [1] used square brackets (i.e. __getitem__) instead.

joinpath()

The joinpath() method was initially called join(), but several people objected that it could be confused with str.join() which has different semantics. Therefore it was renamed to joinpath().

Case-sensitivity

Windows users consider filesystem paths to be case-insensitive and expect path objects to observe that characteristic, even though in some rare situations some foreign filesystem mounts may be case-sensitive under Windows.

In the words of one commenter,

"If glob("*.py") failed to find SETUP.PY on Windows, that would be a usability disaster".

—Paul Moore in https://mail.python.org/pipermail/python-dev/2013-April/125254.html

pep-0429 Python 3.4 Release Schedule

PEP:429
Title:Python 3.4 Release Schedule
Version:$Revision$
Last-Modified:$Date$
Author:Larry Hastings <larry at hastings.org>
Status:Active
Type:Informational
Content-Type:text/x-rst
Created:17-Oct-2012
Python-Version:3.4

Abstract

This document describes the development and release schedule for Python 3.4. The schedule primarily concerns itself with PEP-sized items.

Release Manager and Crew

  • 3.4 Release Manager: Larry Hastings
  • Windows installers: Martin v. Lรถwis
  • Mac installers: Ned Deily
  • Documentation: Georg Brandl

Release Schedule

The releases:

  • 3.4.0 alpha 1: August 3, 2013
  • 3.4.0 alpha 2: September 9, 2013
  • 3.4.0 alpha 3: September 29, 2013
  • 3.4.0 alpha 4: October 20, 2013
  • 3.4.0 beta 1: November 24, 2013
  • 3.4.0 beta 2: January 5, 2014
  • 3.4.0 beta 3: January 26, 2014
  • 3.4.0 candidate 1: February 10, 2014
  • 3.4.0 candidate 2: February 23, 2014
  • 3.4.0 candidate 3: March 9, 2014
  • 3.4.0 final: March 16, 2014

(Beta 1 was also "feature freeze"--no new features beyond this point.)

3.4.1 schedule

  • 3.4.1 candidate 1: May 5, 2014
  • 3.4.1 final: May 18, 2014

3.4.2 schedule

  • 3.4.2 candidate 1: September 22, 2014
  • 3.4.2 final: October 6, 2014

3.4.3 schedule

  • 3.4.3 candidate 1: February 8, 2015
  • 3.4.3 final: February 25, 2015

Features for 3.4

Implemented / Final PEPs:

  • PEP 428, a "pathlib" module providing object-oriented filesystem paths
  • PEP 435, a standardized "enum" module
  • PEP 436, a build enhancement that will help generate introspection information for builtins
  • PEP 442, improved semantics for object finalization
  • PEP 443, adding single-dispatch generic functions to the standard library
  • PEP 445, a new C API for implementing custom memory allocators
  • PEP 446, changing file descriptors to not be inherited by default in subprocesses
  • PEP 450, a new "statistics" module
  • PEP 451, standardizing module metadata for Python's module import system
  • PEP 453, a bundled installer for the pip package manager
  • PEP 454, a new "tracemalloc" module for tracing Python memory allocations
  • PEP 456, a new hash algorithm for Python strings and binary data
  • PEP 3154, a new and improved protocol for pickled objects
  • PEP 3156, a new "asyncio" module, a new framework for asynchronous I/O

Deferred to post-3.4:

  • PEP 431, improved support for time zone databases
  • PEP 441, improved Python zip application support
  • PEP 447, support for __locallookup__ metaclass method
  • PEP 448, additional unpacking generalizations
  • PEP 455, key transforming dictionary

pep-0430 Migrating to Python 3 as the default online documentation

PEP:430
Title:Migrating to Python 3 as the default online documentation
Version:$Revision$
Last-Modified:$Date$
Author:Nick Coghlan <ncoghlan at gmail.com>
BDFL-Delegate:Georg Brandl
Status:Final
Type:Informational
Content-Type:text/x-rst
Created:27-Oct-2012

Abstract

This PEP proposes a strategy for migrating the default version of the Python documentation presented to users of Python when accessing docs.python.org from 2.7 to Python 3.3.

It proposes a backwards compatible scheme that preserves the meaning of existing deep links in to the Python 2 documentation, while still presenting the Python 3 documentation by default, and presenting the Python 2 and 3 documentation in a way that avoids making the Python 3 documentation look like a second-class citizen.

Background

With the transition of the overall Python ecosystem from Python 2 to Python 3 still in progress, one question which arises periodically [1, 2] is when and how to handle the change from providing the Python 2 documentation as the default version displayed at the docs.python.org root URL to providing the Python 3 documentation.

Key Concerns

There are a couple of key concerns that any migration proposal needs to address.

Don't Confuse Beginners

Many beginners learn Python through third party resources. These resources, not all of which are online resources, may reference in to the python.org online documentation for additional background and details.

Importantly, even when the online documentation is updated, the "version added" and "version changed" tags usually provide enough information for users to adjust appropriately for the specific version they are using.

While deep links in to the python.org documentation may occasionally break within the Python 2 series, this is very rare.

Migrating to Python 3 is a very different matter. Many links would break due to renames and removals, and the "version added" and "version changed" information for the Python 2 series is completely absent.

Don't Break Useful Resources

There are many useful Python resources out there, such as the mailing list archives on python.org and question-and-answer sites like Stack Overflow, where links are highly unlikely to be updated, no matter how much notice is provided.

Old posts and answers to questions all currently link to docs.python.org expecting to get the Python 2 documentation at unqualified URLs. Links from answers that relate to Python 3 are explicitly qualified with /py3k/ in the path component.

Proposal

This PEP (based on an idea originally put forward back in May [3]) is to not migrate the Python 2 specific deep links at all, and instead adopt a scheme where all URLs presented to users on docs.python.org are qualified appropriately with the relevant release series.

Visitors to the root URL at http://docs.python.org will be automatically redirected to http://docs.python.org/3/, but links deeper in the version-specific hierarchy, such as to http://docs.python.org/library/os, will instead be redirected to a Python 2 specific link such as http://docs.python.org/2/library/os.

The specific subpaths which will be redirected to explicitly qualified paths for the Python 2 docs are:

  • /c-api/
  • /distutils/
  • /extending/
  • /faq/
  • /howto/
  • /library/
  • /reference/
  • /tutorial/
  • /using/
  • /whatsnew/
  • /about.html
  • /bugs.html
  • /contents.html
  • /copyright.html
  • /license.html
  • /genindex.html
  • /glossary.html
  • /py-modindex.html
  • /search.html

The existing /py3k/ subpath will be redirected to the new /3/ subpath.

Presented URLs

With this scheme, the following URLs would be presented to users after resolution of any aliasing and rewriting rules:

  • http://docs.python.org/x/*
  • http://docs.python.org/x.y/*
  • http://docs.python.org/dev/*
  • http://docs.python.org/release/x.y.z/*
  • http://docs.python.org/devguide

The /x/ URLs mean "give me the latest documentation for a released version in this release series". It will draw the documentation from the relevant maintenance branch in source control (this will always be the 2.7 branch for Python 2 and is currently 3.3 for Python 3). Differences relative to previous versions in the release series will be available through "version added" and "version changed" markers.

The /x.y/ URLs mean "give me the latest documentation for this release". It will draw the documentation from the relevant maintenance branch in source control (or the default branch for the currently in development version). It differs from the status quo in that the URLs will actually remain available in the user's browser for easy copy and pasting. (Currently, references to specific versions that are not the latest in their release series will resolve to a stable URL for a specific maintenance version in the "release" hierarchy, while the current latest version in the release series resolves to the release series URL. This makes it hard to get a "latest version specific URL", since it is always necessary to construct them manually).

The /dev/ URL means the documentation for the default branch in source control.

The /release/x.y.x/ URLs will refer to the documentation of those releases, exactly as it was at the time of the release.

The developer's guide is not version specific, and thus retains its own stable /devguide/ URL.

Rationale

There is some desire to switch the unqualified references to mean Python 3 as a sign of confidence in Python 3. Such a move would either break a lot of things, or else involve an awful lot of work to avoid breaking things.

I believe we can get much the same effect without breaking the world by:

  1. Deprecating the use of unqualified references to the online documentation (while promising to preserve the meaning of such references indefinitely)
  2. Updating all python.org and python-dev controlled links to use qualified references (excluding archived email)
  3. Redirecting visitors to the root of http://docs.python.org to http://docs.python.org/3.x

Most importantly, because this scheme doesn't alter the behaviour of any existing deep links, it could be implemented with a significantly shorter warning period than would be required for a scheme that risked breaking deep links, or started to redirect unqualified links to Python 3. The only part of the scheme which would require any warning at all is the step of redirecting the "http://docs.python.org/" landing page to the Python 3.3 documentation.

Namespaces are one honking great idea - let's do more of those.

Note that the approach described in this PEP gives two ways to access the content of the default branch: as /dev/ or using the appropriate /x.y/ reference. This is deliberate, as the default branch is referenced for two different purposes:

  • to provide additional information when discussing an upcoming feature of the next release (a /x.y/ URL is appropriate)
  • to provide a stable destination for developers to access the documentation of the next feature release, regardless of the version (a /dev/ URL is appropriate)

Implementation

The URLs on docs.python.org are controlled by the python.org infrastructure team rather than through the CPython source repo, so acceptance and implementation of the ideas in this PEP will be up to the team.

pep-0431 Time zone support improvements

PEP:431
Title:Time zone support improvements
Version:$Revision$
Last-Modified:$Date$
Author:Lennart Regebro <regebro at gmail.com>
BDFL-Delegate:Barry Warsaw <barry@python.org>
Status:Draft
Type:Standards Track
Content-Type:text/x-rst
Created:11-Dec-2012
Post-History:11-Dec-2012, 28-Dec-2012, 28-Jan-2013

Abstract

This PEP proposes the implementation of concrete time zone support in the Python standard library, and also improvements to the time zone API to deal with ambiguous time specifications during DST changes.

Proposal

Concrete time zone support

The time zone support in Python has no concrete implementation in the standard library outside of a tzinfo baseclass that supports fixed offsets. To properly support time zones you need to include a database over all time zones, both current and historical, including daylight saving changes. But such information changes frequently, so even if we include the last information in a Python release, that information would be outdated just a few months later.

Time zone support has therefore only been available through two third-party modules, pytz and dateutil, both who include and wrap the "zoneinfo" database. This database, also called "tz" or "The Olsen database", is the de-facto standard time zone database over time zones, and it is included in most Unix and Unix-like operating systems, including OS X.

This gives us the opportunity to include the code that supports the zoneinfo data in the standard library, but by default use the operating system's copy of the data, which typically will be kept updated by the updating mechanism of the operating system or distribution.

For those who have an operating system that does not include the zoneinfo database, for example Windows, the Python source distribution will include a copy of the zoneinfo database, and a distribution containing the latest zoneinfo database will also be available at the Python Package Index, so it can be easily installed with the Python packaging tools such as easy_install or pip. This could also be done on Unices that are no longer receiving updates and therefore have an outdated database.

With such a mechanism Python would have full time zone support in the standard library on any platform, and a simple package installation would provide an updated time zone database on those platforms where the zoneinfo database isn't included, such as Windows, or on platforms where OS updates are no longer provided.

The time zone support will be implemented by making the datetime module into a package, and adding time zone support to datetime based on Stuart Bishop's pytz module.

Getting the local time zone

On Unix there is no standard way of finding the name of the time zone that is being used. All the information that is available is the time zone abbreviations, such as EST and PDT, but many of those abbreviations are ambiguous and therefore you can't rely on them to figure out which time zone you are located in.

There is however a standard for finding the compiled time zone information since it's located in /etc/localtime. Therefore it is possible to create a local time zone object with the correct time zone information even though you don't know the name of the time zone. A function in datetime should be provided to return the local time zone.

The support for this will be made by integrating Lennart Regebro's tzlocal module into the new datetime module.

For Windows it will look up the local Windows time zone name, and use a mapping between Windows time zone names and zoneinfo time zone names provided by the Unicode consortium to convert that to a zoneinfo time zone.

The mapping should be updated before each major or bugfix release, scripts for doing so will be provided in the Tools/ directory.

Ambiguous times

When changing over from daylight savings time (DST) the clock is turned back one hour. This means that the times during that hour happens twice, once with DST and then once without DST. Similarly, when changing to daylight savings time, one hour goes missing.

The current time zone API can not differentiate between the two ambiguous times during a change from DST. For example, in Stockholm the time of 2012-11-28 02:00:00 happens twice, both at UTC 2012-11-28 00:00:00 and also at 2012-11-28 01:00:00.

The current time zone API can not disambiguate this and therefore it's unclear which time should be returned:

# This could be either 00:00 or 01:00 UTC:
>>> dt = datetime(2012, 10, 28, 2, 0, tzinfo=zoneinfo('Europe/Stockholm'))
# But we can not specify which:
>>> dt.astimezone(zoneinfo('UTC'))
datetime.datetime(2012, 10, 28, 1, 0, tzinfo=<UTC>)

pytz solved this problem by adding is_dst parameters to several methods of the tzinfo objects to make it possible to disambiguate times when this is desired.

This PEP proposes to add these is_dst parameters to the relevant methods of the datetime API, and therefore add this functionality directly to datetime. This is likely the hardest part of this PEP as this involves updating the C version of the datetime library with this functionality, as this involved writing new code, and not just reorganizing existing external libraries.

Implementation API

The zoneinfo database

The latest version of the zoneinfo database should exist in the Lib/tzdata directory of the Python source control system. This copy of the database should be updated before every Python feature and bug-fix release, but not for releases of Python versions that are in security-fix-only-mode.

Scripts to update the database will be provided in Tools/, and the release instructions will be updated to include this update.

New configure options --enable-internal-timezone-database and --disable-internal-timezone-database will be implemented to enable and disable the installation of this database when installing from source. A source install will default to installing them.

Binary installers for systems that have a system-provided zoneinfo database may skip installing the included database since it would never be used for these platforms. For other platforms, for example Windows, binary installers must install the included database.

Changes in the datetime-module

The public API of the new time zone support contains one new class, one new function, one new exception and four new collections. In addition to this, several methods on the datetime object gets a new is_dst parameter.

New class dsttimezone

This class provides a concrete implementation of the tzinfo base class that implements DST support.

New function zoneinfo(name=None, db_path=None)

This function takes a name string that must be a string specifying a valid zoneinfo time zone, i.e. "US/Eastern", "Europe/Warsaw" or "Etc/GMT". If not given, the local time zone will be looked up. If an invalid zone name is given, or the local time zone can not be retrieved, the function raises UnknownTimeZoneError.

The function also takes an optional path to the location of the zoneinfo database which should be used. If not specified, the function will look for databases in the following order:

  1. Check if the tzdata-update module is installed, and then use that database.
  2. Use the database in /usr/share/zoneinfo, if it exists.
  3. Use the Python-provided database in Lib/tzdata.

If no database is found an UnknownTimeZoneError or subclass thereof will be raised with a message explaining that no zoneinfo database can be found, but that you can install one with the tzdata-update package.

New parameter is_dst

A new is_dst parameter is added to several methods to handle time ambiguity during DST changeovers.

  • tzinfo.utcoffset(dt, is_dst=False)
  • tzinfo.dst(dt, is_dst=False)
  • tzinfo.tzname(dt, is_dst=False)
  • datetime.astimezone(tz, is_dst=False)

The is_dst parameter can be False (default), True, or None.

False will specify that the given datetime should be interpreted as not happening during daylight savings time, i.e. that the time specified is after the change from DST. This is default to preserve existing behavior.

True will specify that the given datetime should be interpreted as happening during daylight savings time, i.e. that the time specified is before the change from DST.

None will raise an AmbiguousTimeError exception if the time specified was during a DST change over. It will also raise a NonExistentTimeError if a time is specified during the "missing time" in a change to DST.

New exceptions

  • UnknownTimeZoneError

    This exception is a subclass of KeyError and raised when giving a time zone specification that can't be found:

    >>> datetime.zoneinfo('Europe/New_York')
    Traceback (most recent call last):
    ...
    UnknownTimeZoneError: There is no time zone called 'Europe/New_York'
    
  • InvalidTimeError

    This exception serves as a base for AmbiguousTimeError and NonExistentTimeError, to enable you to trap these two separately. It will subclass from ValueError, so that you can catch these errors together with inputs like the 29th of February 2011.

  • AmbiguousTimeError

    This exception is raised when giving a datetime specification that is ambiguous while setting is_dst to None:

    >>> datetime(2012, 11, 28, 2, 0, tzinfo=zoneinfo('Europe/Stockholm'), is_dst=None)
    >>>
    Traceback (most recent call last):
    ...
    AmbiguousTimeError: 2012-10-28 02:00:00 is ambiguous in time zone Europe/Stockholm
    
  • NonExistentTimeError

    This exception is raised when giving a datetime specification for a time that due to daylight saving does not exist, while setting is_dst to None:

    >>> datetime(2012, 3, 25, 2, 0, tzinfo=zoneinfo('Europe/Stockholm'), is_dst=None)
    >>>
    Traceback (most recent call last):
    ...
    NonExistentTimeError: 2012-03-25 02:00:00 does not exist in time zone Europe/Stockholm
    

New collections

  • all_timezones is the exhaustive list of the time zone names that can be used, listed alphabethically.
  • common_timezones is a list of useful, current time zones, listed alphabethically.

The tzdata-update-package

The zoneinfo database will be packaged for easy installation with easy_install/pip/buildout. This package will not install any Python code, and will not contain any Python code except that which is needed for installation.

It will be kept updated with the same tools as the internal database, but released whenever the zoneinfo-database is updated, and use the same version schema.

Differences from the pytz API

  • pytz has the functions localize() and normalize() to work around that tzinfo doesn't have is_dst. When is_dst is implemented directly in datetime.tzinfo they are no longer needed.
  • The timezone() function is called zoneinfo() to avoid clashing with the timezone class introduced in Python 3.2.
  • zoneinfo() will return the local time zone if called without arguments.
  • The class pytz.StaticTzInfo is there to provide the is_dst support for static time zones. When is_dst support is included in datetime.tzinfo it is no longer needed.
  • InvalidTimeError subclasses from ValueError.

pep-0432 Simplifying the CPython startup sequence

PEP:432
Title:Simplifying the CPython startup sequence
Version:$Revision$
Last-Modified:$Date$
Author:Nick Coghlan <ncoghlan at gmail.com>
Status:Draft
Type:Standards Track
Content-Type:text/x-rst
Created:28-Dec-2012
Python-Version:3.5
Post-History:28-Dec-2012, 2-Jan-2013

Abstract

This PEP proposes a mechanism for simplifying the startup sequence for CPython, making it easier to modify the initialization behaviour of the reference interpreter executable, as well as making it easier to control CPython's startup behaviour when creating an alternate executable or embedding it as a Python execution engine inside a larger application.

Note: TBC = To Be Confirmed, TBD = To Be Determined. The appropriate resolution for most of these should become clearer as the reference implementation is developed.

Proposal

This PEP proposes that CPython move to an explicit multi-phase initialization process, where a preliminary interpreter is put in place with limited OS interaction capabilities early in the startup sequence. This essential core remains in place while all of the configuration settings are determined, until a final configuration call takes those settings and finishes bootstrapping the interpreter immediately before locating and executing the main module.

In the new design, the interpreter will move through the following well-defined phases during the initialization sequence:

  • Pre-Initialization - no interpreter available
  • Initializing - interpreter partially available
  • Initialized - interpreter available, __main__ related metadata incomplete

With the interpreter itself fully initialised, main module execution will then proceed through two phases:

  • Main Preparation - __main__ related metadata populated
  • Main Execution - bytecode executing in the __main__ module namespace

(Embedding applications may choose not to use the Main Preparation and Execution phases)

As a concrete use case to help guide any design changes, and to solve a known problem where the appropriate defaults for system utilities differ from those for running user scripts, this PEP also proposes the creation and distribution of a separate system Python (pysystem) executable which, by default, ignores user site directories and environment variables, and does not implicitly set sys.path[0] based on the current directory or the script being executed (it will, however, still support virtual environments).

To keep the implementation complexity under control, this PEP does not propose wholesale changes to the way the interpreter state is accessed at runtime. Changing the order in which the existing initialization steps occur in order to make the startup sequence easier to maintain is already a substantial change, and attempting to make those other changes at the same time will make the change significantly more invasive and much harder to review. However, such proposals may be suitable topics for follow-on PEPs or patches - one key benefit of this PEP is decreasing the coupling between the internal storage model and the configuration interface, so such changes should be easier once this PEP has been implemented.

Background

Over time, CPython's initialization sequence has become progressively more complicated, offering more options, as well as performing more complex tasks (such as configuring the Unicode settings for OS interfaces in Python 3 [10], bootstrapping a pure Python implementation of the import system, and implementing an isolated mode more suitable for system applications that run with elevated privileges [6]).

Much of this complexity is formally accessible only through the Py_Main and Py_Initialize APIs, offering embedding applications little opportunity for customisation. This creeping complexity also makes life difficult for maintainers, as much of the configuration needs to take place prior to the Py_Initialize call, meaning much of the Python C API cannot be used safely.

A number of proposals are on the table for even more sophisticated startup behaviour, such as better control over sys.path initialization (easily adding additional directories on the command line in a cross-platform fashion [7], as well as controlling the configuration of sys.path[0] [8]), easier configuration of utilities like coverage tracing when launching Python subprocesses [9].

Rather than continuing to bolt such behaviour onto an already complicated system, this PEP proposes to start simplifying the status quo by introducing a more stuctured startup sequence, with the aim of making these further feature requests easier to implement.

Key Concerns

There are a couple of key concerns that any change to the startup sequence needs to take into account.

Maintainability

The current CPython startup sequence is difficult to understand, and even more difficult to modify. It is not clear what state the interpreter is in while much of the initialization code executes, leading to behaviour such as lists, dictionaries and Unicode values being created prior to the call to Py_Initialize when the -X or -W options are used [1].

By moving to an explicitly multi-phase startup sequence, developers should only need to understand which features are not available in the core bootstrapping phase, as the vast majority of the configuration process will now take place during that phase.

By basing the new design on a combination of C structures and Python data types, it should also be easier to modify the system in the future to add new configuration options.

Performance

CPython is used heavily to run short scripts where the runtime is dominated by the interpreter initialization time. Any changes to the startup sequence should minimise their impact on the startup overhead.

Experience with the importlib migration suggests that the startup time is dominated by IO operations. However, to monitor the impact of any changes, a simple benchmark can be used to check how long it takes to start and then tear down the interpreter:

python3 -m timeit -s "from subprocess import call" "call(['./python', '-c', 'pass'])"

Current numbers on my system for Python 3.5 (using the 3.4 subprocess and timeit modules to execute the check, all with non-debug builds):

$ python3 -m timeit -s "from subprocess import call" "call(['./python', '-c', 'pass'])"
10 loops, best of 3: 18.2 msec per loop

This PEP is not expected to have any significant effect on the startup time, as it is aimed primarily at reordering the existing initialization sequence, without making substantial changes to the individual steps.

However, if this simple check suggests that the proposed changes to the initialization sequence may pose a performance problem, then a more sophisticated microbenchmark will be developed to assist in investigation.

Required Configuration Settings

A comprehensive configuration scheme requires that an embedding application be able to control the following aspects of the final interpreter state:

  • Whether or not to use randomised hashes (and if used, potentially specify a specific random seed)
  • Whether or not to enable the import system (required by CPython's build process when freezing the importlib._bootstrap bytecode)
  • The "Where is Python located?" elements in the sys module: * sys.executable * sys.base_exec_prefix * sys.base_prefix * sys.exec_prefix * sys.prefix
  • The path searched for imports from the filesystem (and other path hooks): * sys.path
  • The command line arguments seen by the interpeter: * sys.argv
  • The filesystem encoding used by: * sys.getfsencoding * os.fsencode * os.fsdecode
  • The IO encoding (if any) and the buffering used by: * sys.stdin * sys.stdout * sys.stderr
  • The initial warning system state: * sys.warnoptions
  • Arbitrary extended options (e.g. to automatically enable faulthandler): * sys._xoptions
  • Whether or not to implicitly cache bytecode files: * sys.dont_write_bytecode
  • Whether or not to enforce correct case in filenames on case-insensitive platforms * os.environ["PYTHONCASEOK"]
  • The other settings exposed to Python code in sys.flags:
    • debug (Enable debugging output in the pgen parser)
    • inspect (Enter interactive interpreter after __main__ terminates)
    • interactive (Treat stdin as a tty)
    • optimize (__debug__ status, write .pyc or .pyo, strip doc strings)
    • no_user_site (don't add the user site directory to sys.path)
    • no_site (don't implicitly import site during startup)
    • ignore_environment (whether environment vars are used during config)
    • verbose (enable all sorts of random output)
    • bytes_warning (warnings/errors for implicit str/bytes interaction)
    • quiet (disable banner output even if verbose is also enabled or stdin is a tty and the interpreter is launched in interactive mode)
  • Whether or not CPython's signal handlers should be installed
  • What code (if any) should be executed as __main__:
    • Nothing (just create an empty module)
    • A filesystem path referring to a Python script (source or bytecode)
    • A filesystem path referring to a valid sys.path entry (typically a directory or zipfile)
    • A given string (equivalent to the "-c" option)
    • A module or package (equivalent to the "-m" option)
    • Standard input as a script (i.e. a non-interactive stream)
    • Standard input as an interactive interpreter session

<TBD: Did I miss anything?>

Note that this just covers settings that are currently configurable in some manner when using the main CPython executable. While this PEP aims to make adding additional configuration settings easier in the future, it deliberately avoids adding any new settings of its own (except where such additional settings arise naturally in the course of migrating existing settings to the new structure).

Design Details

(Note: details here are still very much in flux, but preliminary feedback is appreciated anyway)

The main theme of this proposal is to create the interpreter state for the main interpreter much earlier in the startup process. This will allow most of the CPython API to be used during the remainder of the initialization process, potentially simplifying a number of operations that currently need to rely on basic C functionality rather than being able to use the richer data structures provided by the CPython C API.

In the following, the term "embedding application" also covers the standard CPython command line application.

Interpreter Initialization Phases

Three distinct interpreter initialisation phases are proposed:

  • Pre-Initialization:
    • no interpreter is available.
    • Py_IsInitializing() returns 0
    • Py_IsInitialized() returns 0
    • The embedding application determines the settings required to create the main interpreter and moves to the next phase by calling Py_BeginInitialization.
  • Initializing:
    • the main interpreter is available, but only partially configured.
    • Py_IsInitializing() returns 1
    • Py_IsInitialized() returns 0
    • The embedding application determines and applies the settings required to complete the initialization process by calling Py_ReadConfig and Py_EndInitialization.
  • Initialized:
    • the main interpreter is available and fully operational, but __main__ related metadata is incomplete
    • Py_IsInitializing() returns 0
    • Py_IsInitialized() returns 1

Main Execution Phases

After initializing the interpreter, the embedding application may continue on to execute code in the __main__ module namespace.

  • Main Preparation:
    • subphase of Initialized (not separately identified at runtime)
    • fully populates __main__ related metadata
    • may execute code in __main__ namespace (e.g. PYTHONSTARTUP)
    • invoked as PyRun_PrepareMain
  • Main Execution:
    • subphase of Initialized (not separately identified at runtime)
    • user supplied bytecode is being executed in the __main__ namespace
    • invoked as PyRun_ExecMain

Invocation of Phases

All listed phases will be used by the standard CPython interpreter and the proposed System Python interpreter. Other embedding applications may choose to skip the step of executing code in the __main__ namespace.

An embedding application may still continue to leave initialization almost entirely under CPython's control by using the existing Py_Initialize API. Alternatively, if an embedding application wants greater control over CPython's initial state, it will be able to use the new, finer grained API, which allows the embedding application greater control over the initialization process:

/* Phase 1: Pre-Initialization */
PyCoreConfig core_config = PyCoreConfig_INIT;
PyConfig config = PyConfig_INIT;
/* Easily control the core configuration */
core_config.ignore_environment = 1; /* Ignore environment variables */
core_config.use_hash_seed = 0;      /* Full hash randomisation */
Py_BeginInitialization(&core_config);
/* Phase 2: Initialization */
/* Optionally preconfigure some settings here - they will then be
 * used to derive other settings */
Py_ReadConfig(&config);
/* Can completely override derived settings here */
Py_EndInitialization(&config);
/* Phase 3: Initialized */
/* If an embedding application has no real concept of a main module
 * it can just stop the initialization process here.
 * Alternatively, it can launch __main__ via the PyRun_*Main functions.
 */

Pre-Initialization Phase

The pre-initialization phase is where an embedding application determines the settings which are absolutely required before the interpreter can be initialized at all. Currently, the primary configuration settings in this category are those related to the randomised hash algorithm - the hash algorithms must be consistent for the lifetime of the process, and so they must be in place before the core interpreter is created.

The specific settings needed are a flag indicating whether or not to use a specific seed value for the randomised hashes, and if so, the specific value for the seed (a seed value of zero disables randomised hashing). In addition, due to the possible use of PYTHONHASHSEED in configuring the hash randomisation, the question of whether or not to consider environment variables must also be addressed early. Finally, to support the CPython build process, an option is offered to completely disable the import system.

The proposed API for this step in the startup sequence is:

void Py_BeginInitialization(const PyCoreConfig *config);

Like Py_Initialize, this part of the new API treats initialization failures as fatal errors. While that's still not particularly embedding friendly, the operations in this step really shouldn't be failing, and changing them to return error codes instead of aborting would be an even larger task than the one already being proposed.

The new PyCoreConfig struct holds the settings required for preliminary configuration:

/* Note: if changing anything in PyCoreConfig, also update
 * PyCoreConfig_INIT */
typedef struct {
    int ignore_environment;   /* -E switch, -I switch */
    int use_hash_seed;        /* PYTHONHASHSEED */
    unsigned long hash_seed;  /* PYTHONHASHSEED */
    int _disable_importlib;   /* Needed by freeze_importlib */
} PyCoreConfig;

#define PyCoreConfig_INIT {0, -1, 0, 0}

The core configuration settings pointer may be NULL, in which case the default values are ignore_environment = -1 and use_hash_seed = -1.

The PyCoreConfig_INIT macro is designed to allow easy initialization of a struct instance with sensible defaults:

PyCoreConfig core_config = PyCoreConfig_INIT;

ignore_environment controls the processing of all Python related environment variables. If the flag is zero, then environment variables are processed normally. Otherwise, all Python-specific environment variables are considered undefined (exceptions may be made for some OS specific environment variables, such as those used on Mac OS X to communicate between the App bundle and the main Python binary).

use_hash_seed controls the configuration of the randomised hash algorithm. If it is zero, then randomised hashes with a random seed will be used. It it is positive, then the value in hash_seed will be used to seed the random number generator. If the hash_seed is zero in this case, then the randomised hashing is disabled completely.

If use_hash_seed is negative (and ignore_environment is zero), then CPython will inspect the PYTHONHASHSEED environment variable. If the environment variable is not set, is set to the empty string, or to the value "random", then randomised hashes with a random seed will be used. If the environment variable is set to the string "0" the randomised hashing will be disabled. Otherwise, the hash seed is expected to be a string representation of an integer in the range [0; 4294967295].

To make it easier for embedding applications to use the PYTHONHASHSEED processing with a different data source, the following helper function will be added to the C API:

int Py_ReadHashSeed(char *seed_text,
                    int *use_hash_seed,
                    unsigned long *hash_seed);

This function accepts a seed string in seed_text and converts it to the appropriate flag and seed values. If seed_text is NULL, the empty string or the value "random", both use_hash_seed and hash_seed will be set to zero. Otherwise, use_hash_seed will be set to 1 and the seed text will be interpreted as an integer and reported as hash_seed. On success the function will return zero. A non-zero return value indicates an error (most likely in the conversion to an integer).

The _disable_importlib setting is used as part of the CPython build process to create an interpreter with no import capability at all. It is considered private to the CPython development team (hence the leading underscore), as the only known use case is to permit compiler changes that invalidate the previously frozen bytecode for importlib._bootstrap without breaking the build process.

The aim is to keep this initial level of configuration as small as possible in order to keep the bootstrapping environment consistent across different embedding applications. If we can create a valid interpreter state without the setting, then the setting should go in the configuration passed to Py_EndInitialization() rather than in the core configuration.

A new query API will allow code to determine if the interpreter is in the bootstrapping state between the creation of the interpreter state and the completion of the bulk of the initialization process:

int Py_IsInitializing();

Attempting to call Py_BeginInitialization() again when Py_IsInitializing() or Py_IsInitialized() is true is a fatal error.

While in the initializing state, the interpreter should be fully functional except that:

  • compilation is not allowed (as the parser and compiler are not yet configured properly)
  • creation of subinterpreters is not allowed
  • creation of additional thread states is not allowed
  • The following attributes in the sys module are all either missing or None: * sys.path * sys.argv * sys.executable * sys.base_exec_prefix * sys.base_prefix * sys.exec_prefix * sys.prefix * sys.warnoptions * sys.flags * sys.dont_write_bytecode * sys.stdin * sys.stdout
  • The filesystem encoding is not yet defined
  • The IO encoding is not yet defined
  • CPython signal handlers are not yet installed
  • only builtin and frozen modules may be imported (due to above limitations)
  • sys.stderr is set to a temporary IO object using unbuffered binary mode
  • The warnings module is not yet initialized
  • The __main__ module does not yet exist

<TBD: identify any other notable missing functionality>

The main things made available by this step will be the core Python datatypes, in particular dictionaries, lists and strings. This allows them to be used safely for all of the remaining configuration steps (unlike the status quo).

In addition, the current thread will possess a valid Python thread state, allowing any further configuration data to be stored on the interpreter object rather than in C process globals.

Any call to Py_BeginInitialization() must have a matching call to Py_Finalize(). It is acceptable to skip calling Py_EndInitialization() in between (e.g. if attempting to read the configuration settings fails)

Determining the remaining configuration settings

The next step in the initialization sequence is to determine the full settings needed to complete the process. No changes are made to the interpreter state at this point. The core API for this step is:

int Py_ReadConfig(PyConfig *config);

The config argument should be a pointer to a config struct (which may be a temporary one stored on the C stack). For any already configured value (i.e. non-NULL pointer or non-negative numeric value), CPython will sanity check the supplied value, but otherwise accept it as correct.

A struct is used rather than a Python dictionary as the struct is easier to work with from C, the list of supported fields is fixed for a given CPython version and only a read-only view needs to be exposed to Python code (which is relatively straightforward, thanks to the infrastructure already put in place to expose sys.implementation).

Unlike Py_Initialize and Py_BeginInitialization, this call will raise an exception and report an error return rather than exhibiting fatal errors if a problem is found with the config data.

Any supported configuration setting which is not already set will be populated appropriately in the supplied configuration struct. The default configuration can be overridden entirely by setting the value before calling Py_ReadConfiguration. The provided value will then also be used in calculating any other settings derived from that value.

Alternatively, settings may be overridden after the Py_ReadConfiguration call (this can be useful if an embedding application wants to adjust a setting rather than replace it completely, such as removing sys.path[0]).

Merely reading the configuration has no effect on the interpreter state: it only modifies the passed in configuration struct. The settings are not applied to the running interpreter until the Py_EndInitialization call (see below).

Supported configuration settings

The new PyConfig struct holds the settings required to complete the interpreter configuration. All fields are either pointers to Python data types (not set == NULL) or numeric flags (not set == -1):

/* Note: if changing anything in PyConfig, also update PyConfig_INIT */
typedef struct {
    /* Argument processing */
    PyListObject *raw_argv;
    PyListObject *argv;
    PyListObject *warnoptions; /* -W switch, PYTHONWARNINGS */
    PyDictObject *xoptions;    /* -X switch */

    /* Filesystem locations */
    PyUnicodeObject *program_name;
    PyUnicodeObject *executable;
    PyUnicodeObject *prefix;           /* PYTHONHOME */
    PyUnicodeObject *exec_prefix;      /* PYTHONHOME */
    PyUnicodeObject *base_prefix;      /* pyvenv.cfg */
    PyUnicodeObject *base_exec_prefix; /* pyvenv.cfg */

    /* Site module */
    int enable_site_config;  /* -S switch (inverted) */
    int no_user_site;        /* -s switch, PYTHONNOUSERSITE */

    /* Import configuration */
    int dont_write_bytecode;        /* -B switch, PYTHONDONTWRITEBYTECODE */
    int ignore_module_case;         /* PYTHONCASEOK */
    PyListObject    *import_path;   /* PYTHONPATH (etc) */

    /* Standard streams */
    int use_unbuffered_io;            /* -u switch, PYTHONUNBUFFEREDIO */
    PyUnicodeObject *stdin_encoding;  /* PYTHONIOENCODING */
    PyUnicodeObject *stdin_errors;    /* PYTHONIOENCODING */
    PyUnicodeObject *stdout_encoding; /* PYTHONIOENCODING */
    PyUnicodeObject *stdout_errors;   /* PYTHONIOENCODING */
    PyUnicodeObject *stderr_encoding; /* PYTHONIOENCODING */
    PyUnicodeObject *stderr_errors;   /* PYTHONIOENCODING */

    /* Filesystem access */
    PyUnicodeObject *fs_encoding;

    /* Debugging output */
    int debug_parser;    /* -d switch, PYTHONDEBUG */
    int verbosity;       /* -v switch */

    /* Code generation */
    int bytes_warnings;  /* -b switch */
    int optimize;        /* -O switch */

    /* Signal handling */
    int install_signal_handlers;

    /* Implicit execution */
    PyUnicodeObject *startup_file;  /* PYTHONSTARTUP */

    /* Main module
     *
     * If prepare_main is set, at most one of the main_* settings should
     * be set before calling PyRun_PrepareMain (Py_ReadConfiguration will
     * set one of them based on the command line arguments if prepare_main
     * is non-zero when that API is called).
    int prepare_main;
    PyUnicodeObject *main_source; /* -c switch */
    PyUnicodeObject *main_path;   /* filesystem path */
    PyUnicodeObject *main_module; /* -m switch */
    PyCodeObject    *main_code;   /* Run directly from a code object */
    PyObject        *main_stream; /* Run from stream */
    int run_implicit_code;        /* Run implicit code during prep */

    /* Interactive main
     *
     * Note: Settings related to interactive mode are very much in flux.
     */
    PyObject *prompt_stream;      /* Output interactive prompt */
    int show_banner;              /* -q switch (inverted) */
    int inspect_main;             /* -i switch, PYTHONINSPECT */

} PyConfig;


/* Struct initialization is pretty ugly in C89. Avoiding this mess would
 * be the most attractive aspect of using a PyDictObject* instead... */
#define _PyArgConfig_INIT  NULL, NULL, NULL, NULL
#define _PyLocationConfig_INIT  NULL, NULL, NULL, NULL, NULL, NULL
#define _PySiteConfig_INIT  -1, -1
#define _PyImportConfig_INIT  -1, -1, NULL
#define _PyStreamConfig_INIT  -1, NULL, NULL, NULL, NULL, NULL, NULL
#define _PyFilesystemConfig_INIT  NULL
#define _PyDebuggingConfig_INIT  -1, -1, -1
#define _PyCodeGenConfig_INIT  -1, -1
#define _PySignalConfig_INIT  -1
#define _PyImplicitConfig_INIT  NULL
#define _PyMainConfig_INIT  -1, NULL, NULL, NULL, NULL, NULL, -1
#define _PyInteractiveConfig_INIT  NULL, -1, -1

#define PyConfig_INIT {_PyArgConfig_INIT, _PyLocationConfig_INIT,
                       _PySiteConfig_INIT, _PyImportConfig_INIT,
                       _PyStreamConfig_INIT, _PyFilesystemConfig_INIT,
                       _PyDebuggingConfig_INIT, _PyCodeGenConfig_INIT,
                       _PySignalConfig_INIT, _PyImplicitConfig_INIT,
                       _PyMainConfig_INIT, _PyInteractiveConfig_INIT}

<TBD: did I miss anything?>

Completing the interpreter initialization

The final step in the initialization process is to actually put the configuration settings into effect and finish bootstrapping the interpreter up to full operation:

int Py_EndInitialization(const PyConfig *config);

Like Py_ReadConfiguration, this call will raise an exception and report an error return rather than exhibiting fatal errors if a problem is found with the config data.

All configuration settings are required - the configuration struct should always be passed through Py_ReadConfig() to ensure it is fully populated.

After a successful call, Py_IsInitializing() will be false, while Py_IsInitialized() will become true. The caveats described above for the interpreter during the initialization phase will no longer hold.

Attempting to call Py_EndInitialization() again when Py_IsInitializing() is false or Py_IsInitialized() is true is an error.

However, some metadata related to the __main__ module may still be incomplete:

  • sys.argv[0] may not yet have its final value

    • it will be -m when executing a module or package with CPython

    • it will be the same as sys.path[0] rather than the location of the __main__ module when executing a valid sys.path entry (typically a zipfile or directory)

    • otherwise, it will be accurate:

      • the script name if running an ordinary script
      • -c if executing a supplied string
      • - or the empty string if running from stdin
  • the metadata in the __main__ module will still indicate it is a builtin module

This function will normally implicitly import site as its final operation (after Py_IsInitialized() is already set). Clearing the "enable_site_config" flag in the configuration settings will disable this behaviour, as well as eliminating any side effects on global state if import site is later explicitly executed in the process.

Preparing the main module

This subphase completes the population of the __main__ module related metadata, without actually starting execution of the __main__ module code.

It is handled by calling the following API:

int PyRun_PrepareMain();

The actual processing is driven by the main related settings stored in the interpreter state as part of the configuration struct.

If prepare_main is zero, this call does nothing.

If all of main_source, main_path, main_module, main_stream and main_code are NULL, this call does nothing.

If more than one of main_source, main_path, main_module, main_stream or main_code are set, RuntimeError will be reported.

If main_code is already set, then this call does nothing.

If main_stream is set, and run_implicit_code is also set, then the file identified in startup_file will be read, compiled and executed in the __main__ namespace.

If main_source, main_path or main_module are set, then this call will take whatever steps are needed to populate main_code:

  • For main_source, the supplied string will be compiled and saved to main_code.

  • For main_path:
    • if the supplied path is recognised as a valid sys.path entry, it is inserted as sys.path[0], main_module is set to __main__ and processing continues as for main_module below.
    • otherwise, path is read as a CPython bytecode file
    • if that fails, it is read as a Python source file and compiled
    • in the latter two cases, the code object is saved to main_code and __main__.__file__ is set appropriately
  • For main_module:
    • any parent package is imported
    • the loader for the module is determined
    • if the loader indicates the module is a package, add .__main__ to the end of main_module and try again (if the final name segment is already .__main__ then fail immediately)
    • once the module source code is located, save the compiled module code as main_code and populate the following attributes in __main__ appropriately: __name__, __loader__, __file__, __cached__, __package__.

(Note: the behaviour described in this section isn't new, it's a write-up of the current behaviour of the CPython interpreter adjusted for the new configuration system)

Executing the main module

This subphase covers the execution of the actual __main__ module code.

It is handled by calling the following API:

int PyRun_ExecMain();

The actual processing is driven by the main related settings stored in the interpreter state as part of the configuration struct.

If both main_stream and main_code are NULL, this call does nothing.

If both main_stream and main_code are set, RuntimeError will be reported.

If main_stream and prompt_stream are both set, main execution will be delegated to a new API:

int PyRun_InteractiveMain(PyObject *input, PyObject* output);

If main_stream is set and prompt_stream is NULL, main execution will be delegated to a new API:

int PyRun_StreamInMain(PyObject *input);

If main_code is set, main execution will be delegated to a new API:

int PyRun_CodeInMain(PyCodeObject *code);

After execution of main completes, if inspect_main is set, or the PYTHONINSPECT environment variable has been set, then PyRun_ExecMain will invoke PyRun_InteractiveMain(sys.__stdin__, sys.__stdout__).

Internal Storage of Configuration Data

The interpreter state will be updated to include details of the configuration settings supplied during initialization by extending the interpreter state object with an embedded copy of the PyCoreConfig and PyConfig structs.

For debugging purposes, the configuration settings will be exposed as a sys._configuration simple namespace (similar to sys.flags and sys.implementation. Field names will match those in the configuration structs, except for hash_seed, which will be deliberately excluded.

An underscored attribute is chosen deliberately, as these configuration settings are part of the CPython implementation, rather than part of the Python language definition. If settings are needed to support cross-implementation compatibility in the standard library, then those should be agreed with the other implementations and exposed as new required attributes on sys.implementation, as described in PEP 421.

These are snapshots of the initial configuration settings. They are not modified by the interpreter during runtime (except as noted above).

Creating and Configuring Subinterpreters

As the new configuration settings are stored in the interpreter state, they need to be initialised when a new subinterpreter is created. This turns out to be trickier than one might think due to PyThreadState_Swap(NULL); (which is fortunately exercised by CPython's own embedding tests, allowing this problem to be detected during development).

To provide a straightforward solution for this case, the PEP proposes to add a new API:

Py_InterpreterState *Py_InterpreterState_Main();

This will be a counterpart to Py_InterpreterState_Head(), reporting the oldest currently existing interpreter rather than the newest. If Py_NewInterpreter() is called from a thread with an existing thread state, then the interpreter configuration for that thread will be used when initialising the new subinterpreter. If there is no current thread state, the configuration from Py_InterpreterState_Main() will be used.

While the existing Py_InterpreterState_Head() API could be used instead, that reference changes as subinterpreters are created and destroyed, while PyInterpreterState_Main() will always refer to the initial interpreter state created in Py_BeginInitialization().

A new constraint is also added to the embedding API: attempting to delete the main interpreter while subinterpreters still exist will now be a fatal error.

Stable ABI

Most of the APIs proposed in this PEP are excluded from the stable ABI, as embedding a Python interpreter involves a much higher degree of coupling than merely writing an extension.

The only newly exposed API that will be part of the stable ABI is the Py_IsInitializing() query.

Build time configuration

This PEP makes no changes to the handling of build time configuration settings, and thus has no effect on the contents of sys.implementation or the result of sysconfig.get_config_vars().

Backwards Compatibility

Backwards compatibility will be preserved primarily by ensuring that Py_ReadConfig() interrogates all the previously defined configuration settings stored in global variables and environment variables, and that Py_EndInitialization() writes affected settings back to the relevant locations.

One acknowledged incompatiblity is that some environment variables which are currently read lazily may instead be read once during interpreter initialization. As the PEP matures, these will be discussed in more detail on a case by case basis. The environment variables which are currently known to be looked up dynamically are:

  • PYTHONCASEOK: writing to os.environ['PYTHONCASEOK'] will no longer dynamically alter the interpreter's handling of filename case differences on import (TBC)
  • PYTHONINSPECT: os.environ['PYTHONINSPECT'] will still be checked after execution of the __main__ module terminates

The Py_Initialize() style of initialization will continue to be supported. It will use (at least some elements of) the new API internally, but will continue to exhibit the same behaviour as it does today, ensuring that sys.argv is not populated until a subsequent PySys_SetArgv call. All APIs that currently support being called prior to Py_Initialize() will continue to do so, and will also support being called prior to Py_BeginInitialization().

To minimise unnecessary code churn, and to ensure the backwards compatibility is well tested, the main CPython executable may continue to use some elements of the old style initialization API. (very much TBC)

A System Python Executable

When executing system utilities with administrative access to a system, many of the default behaviours of CPython are undesirable, as they may allow untrusted code to execute with elevated privileges. The most problematic aspects are the fact that user site directories are enabled, environment variables are trusted and that the directory containing the executed file is placed at the beginning of the import path.

Issue 16499 [6] proposes adding a -I option to change the behaviour of the normal CPython executable, but this is a hard to discover solution (and adds yet another option to an already complex CLI). This PEP proposes to instead add a separate pysystem executable

Currently, providing a separate executable with different default behaviour would be prohibitively hard to maintain. One of the goals of this PEP is to make it possible to replace much of the hard to maintain bootstrapping code with more normal CPython code, as well as making it easier for a separate application to make use of key components of Py_Main. Including this change in the PEP is designed to help avoid acceptance of a design that sounds good in theory but proves to be problematic in practice.

Cleanly supporting this kind of "alternate CLI" is the main reason for the proposed changes to better expose the core logic for deciding between the different execution modes supported by CPython:

  • script execution
  • directory/zipfile execution
  • command execution ("-c" switch)
  • module or package execution ("-m" switch)
  • execution from stdin (non-interactive)
  • interactive stdin

Actually implementing this may also reveal the need for some better argument parsing infrastructure for use during the initializing phase.

Open Questions

  • Error details for Py_ReadConfiguration and Py_EndInitialization (these should become clear as the implementation progresses)
  • Should there be Py_PreparingMain() and Py_RunningMain() query APIs?
  • Should the answer to Py_IsInitialized() be exposed via the sys module?
  • Is initialisation of the PyConfig struct too unwieldy to be maintainable? Would a Python dictionary be a better choice, despite being harder to work with from C code?
  • Would it be better to manage the flag variables in PyConfig as Python integers or as "negative means false, positive means true, zero means not set" so the struct can be initialized with a simple memset(&config, 0, sizeof(*config)), eliminating the need to update both PyConfig and PyConfig_INIT when adding new fields?
  • The name of the new system Python executable is a bikeshed waiting to be painted. The 3 options considered so far are spython, pysystem and python-minimal. The PEP text reflects my current preferred choice (pysystem).

Implementation

The reference implementation is being developed as a feature branch in my BitBucket sandbox [2]. Pull requests to fix the inevitably broken Windows builds are welcome, but the basic design is still in too much flux for other pull requests to be feasible just yet. Once the overall design settles down and it's a matter of migrating individual settings over to the new design, that level of collaboration should become more practical.

As the number of application binaries created by the build process is now four, the reference implementation also creates a new top level "Apps" directory in the CPython source tree. The source files for the main python binary and the new pysystem binary will be located in that directory. The source files for the _freeze_importlib binary and the _testembed binary have been moved out of the Modules directory (which is intended for CPython builtin and extension modules) and into the Tools directory.

The Status Quo

The current mechanisms for configuring the interpreter have accumulated in a fairly ad hoc fashion over the past 20+ years, leading to a rather inconsistent interface with varying levels of documentation.

(Note: some of the info below could probably be cleaned up and added to the C API documentation for at least 3.3. - it's all CPython specific, so it doesn't belong in the language reference)

Ignoring Environment Variables

The -E command line option allows all environment variables to be ignored when initializing the Python interpreter. An embedding application can enable this behaviour by setting Py_IgnoreEnvironmentFlag before calling Py_Initialize().

In the CPython source code, the Py_GETENV macro implicitly checks this flag, and always produces NULL if it is set.

<TBD: I believe PYTHONCASEOK is checked regardless of this setting > <TBD: Does -E also ignore Windows registry keys? >

Randomised Hashing

The randomised hashing is controlled via the -R command line option (in releases prior to 3.3), as well as the PYTHONHASHSEED environment variable.

In Python 3.3, only the environment variable remains relevant. It can be used to disable randomised hashing (by using a seed value of 0) or else to force a specific hash value (e.g. for repeatability of testing, or to share hash values between processes)

However, embedding applications must use the Py_HashRandomizationFlag to explicitly request hash randomisation (CPython sets it in Py_Main() rather than in Py_Initialize()).

The new configuration API should make it straightforward for an embedding application to reuse the PYTHONHASHSEED processing with a text based configuration setting provided by other means (e.g. a config file or separate environment variable).

Locating Python and the standard library

The location of the Python binary and the standard library is influenced by several elements. The algorithm used to perform the calculation is not documented anywhere other than in the source code [3,4_]. Even that description is incomplete, as it failed to be updated for the virtual environment support added in Python 3.3 (detailed in PEP 405).

These calculations are affected by the following function calls (made prior to calling Py_Initialize()) and environment variables:

  • Py_SetProgramName()
  • Py_SetPythonHome()
  • PYTHONHOME

The filesystem is also inspected for pyvenv.cfg files (see PEP 405) or, failing that, a lib/os.py (Windows) or lib/python$VERSION/os.py file.

The build time settings for PREFIX and EXEC_PREFIX are also relevant, as are some registry settings on Windows. The hardcoded fallbacks are based on the layout of the CPython source tree and build output when working in a source checkout.

Configuring sys.path

An embedding application may call Py_SetPath() prior to Py_Initialize() to completely override the calculation of sys.path. It is not straightforward to only allow some of the calculations, as modifying sys.path after initialization is already complete means those modifications will not be in effect when standard library modules are imported during the startup sequence.

If Py_SetPath() is not used prior to the first call to Py_GetPath() (implicit in Py_Initialize()), then it builds on the location data calculations above to calculate suitable path entries, along with the PYTHONPATH environment variable.

<TBD: On Windows, there's also a bunch of stuff to do with the registry>

The site module, which is implicitly imported at startup (unless disabled via the -S option) adds additional paths to this initial set of paths, as described in its documentation [5].

The -s command line option can be used to exclude the user site directory from the list of directories added. Embedding applications can control this by setting the Py_NoUserSiteDirectory global variable.

The following commands can be used to check the default path configurations for a given Python executable on a given system:

  • ./python -c "import sys, pprint; pprint.pprint(sys.path)" - standard configuration
  • ./python -s -c "import sys, pprint; pprint.pprint(sys.path)" - user site directory disabled
  • ./python -S -c "import sys, pprint; pprint.pprint(sys.path)" - all site path modifications disabled

(Note: you can see similar information using -m site instead of -c, but this is slightly misleading as it calls os.abspath on all of the path entries, making relative path entries look absolute. Using the site module also causes problems in the last case, as on Python versions prior to 3.3, explicitly importing site will carry out the path modifications -S avoids, while on 3.3+ combining -m site with -S currently fails)

The calculation of sys.path[0] is comparatively straightforward:

  • For an ordinary script (Python source or compiled bytecode), sys.path[0] will be the directory containing the script.
  • For a valid sys.path entry (typically a zipfile or directory), sys.path[0] will be that path
  • For an interactive session, running from stdin or when using the -c or -m switches, sys.path[0] will be the empty string, which the import system interprets as allowing imports from the current directory

Configuring sys.argv

Unlike most other settings discussed in this PEP, sys.argv is not set implicitly by Py_Initialize(). Instead, it must be set via an explicitly call to Py_SetArgv().

CPython calls this in Py_Main() after calling Py_Initialize(). The calculation of sys.argv[1:] is straightforward: they're the command line arguments passed after the script name or the argument to the -c or -m options.

The calculation of sys.argv[0] is a little more complicated:

  • For an ordinary script (source or bytecode), it will be the script name
  • For a sys.path entry (typically a zipfile or directory) it will initially be the zipfile or directory name, but will later be changed by the runpy module to the full path to the imported __main__ module.
  • For a module specified with the -m switch, it will initially be the string "-m", but will later be changed by the runpy module to the full path to the executed module.
  • For a package specified with the -m switch, it will initially be the string "-m", but will later be changed by the runpy module to the full path to the executed __main__ submodule of the package.
  • For a command executed with -c, it will be the string "-c"
  • For explicitly requested input from stdin, it will be the string "-"
  • Otherwise, it will be the empty string

Embedding applications must call Py_SetArgv themselves. The CPython logic for doing so is part of Py_Main() and is not exposed separately. However, the runpy module does provide roughly equivalent logic in runpy.run_module and runpy.run_path.

Other configuration settings

TBD: Cover the initialization of the following in more detail:

  • Completely disabling the import system
  • The initial warning system state: * sys.warnoptions * (-W option, PYTHONWARNINGS)
  • Arbitrary extended options (e.g. to automatically enable faulthandler): * sys._xoptions * (-X option)
  • The filesystem encoding used by: * sys.getfsencoding * os.fsencode * os.fsdecode
  • The IO encoding and buffering used by: * sys.stdin * sys.stdout * sys.stderr * (-u option, PYTHONIOENCODING, PYTHONUNBUFFEREDIO)
  • Whether or not to implicitly cache bytecode files: * sys.dont_write_bytecode * (-B option, PYTHONDONTWRITEBYTECODE)
  • Whether or not to enforce correct case in filenames on case-insensitive platforms * os.environ["PYTHONCASEOK"]
  • The other settings exposed to Python code in sys.flags:
    • debug (Enable debugging output in the pgen parser)
    • inspect (Enter interactive interpreter after __main__ terminates)
    • interactive (Treat stdin as a tty)
    • optimize (__debug__ status, write .pyc or .pyo, strip doc strings)
    • no_user_site (don't add the user site directory to sys.path)
    • no_site (don't implicitly import site during startup)
    • ignore_environment (whether environment vars are used during config)
    • verbose (enable all sorts of random output)
    • bytes_warning (warnings/errors for implicit str/bytes interaction)
    • quiet (disable banner output even if verbose is also enabled or stdin is a tty and the interpreter is launched in interactive mode)
  • Whether or not CPython's signal handlers should be installed

Much of the configuration of CPython is currently handled through C level global variables:

Py_BytesWarningFlag (-b)
Py_DebugFlag (-d option)
Py_InspectFlag (-i option, PYTHONINSPECT)
Py_InteractiveFlag (property of stdin, cannot be overridden)
Py_OptimizeFlag (-O option, PYTHONOPTIMIZE)
Py_DontWriteBytecodeFlag (-B option, PYTHONDONTWRITEBYTECODE)
Py_NoUserSiteDirectory (-s option, PYTHONNOUSERSITE)
Py_NoSiteFlag (-S option)
Py_UnbufferedStdioFlag (-u, PYTHONUNBUFFEREDIO)
Py_VerboseFlag (-v option, PYTHONVERBOSE)

For the above variables, the conversion of command line options and environment variables to C global variables is handled by Py_Main, so each embedding application must set those appropriately in order to change them from their defaults.

Some configuration can only be provided as OS level environment variables:

PYTHONSTARTUP
PYTHONCASEOK
PYTHONIOENCODING

The Py_InitializeEx() API also accepts a boolean flag to indicate whether or not CPython's signal handlers should be installed.

Finally, some interactive behaviour (such as printing the introductory banner) is triggered only when standard input is reported as a terminal connection by the operating system.

TBD: Document how the "-x" option is handled (skips processing of the first comment line in the main script)

Also see detailed sequence of operations notes at [1]

References

[1]CPython interpreter initialization notes (http://wiki.python.org/moin/CPythonInterpreterInitialization)
[2]BitBucket Sandbox (https://bitbucket.org/ncoghlan/cpython_sandbox/compare/pep432_modular_bootstrap..default#commits)
[3]*nix getpath implementation (http://hg.python.org/cpython/file/default/Modules/getpath.c)
[4]Windows getpath implementation (http://hg.python.org/cpython/file/default/PC/getpathp.c)
[5]Site module documentation (http://docs.python.org/3/library/site.html)
[6]Proposed CLI option for isolated mode (http://bugs.python.org/issue16499)
[7]Adding to sys.path on the command line (http://mail.python.org/pipermail/python-ideas/2010-October/008299.html) (http://mail.python.org/pipermail/python-ideas/2012-September/016128.html)
[8]Control sys.path[0] initialisation (http://bugs.python.org/issue13475)
[9]Enabling code coverage in subprocesses when testing (http://bugs.python.org/issue14803)
[10]Problems with PYTHONIOENCODING in Blender (http://bugs.python.org/issue16129)

pep-0433 Easier suppression of file descriptor inheritance

PEP:433
Title:Easier suppression of file descriptor inheritance
Version:$Revision$
Last-Modified:$Date$
Author:Victor Stinner <victor.stinner at gmail.com>
Status:Superseded
Type:Standards Track
Content-Type:text/x-rst
Created:10-January-2013
Python-Version:3.4
Superseded-By:446

Abstract

Add a new optional cloexec parameter on functions creating file descriptors, add different ways to change default values of this parameter, and add four new functions:

  • os.get_cloexec(fd)
  • os.set_cloexec(fd, cloexec=True)
  • sys.getdefaultcloexec()
  • sys.setdefaultcloexec(cloexec)

Rationale

A file descriptor has a close-on-exec flag which indicates if the file descriptor will be inherited or not.

On UNIX, if the close-on-exec flag is set, the file descriptor is not inherited: it will be closed at the execution of child processes; otherwise the file descriptor is inherited by child processes.

On Windows, if the close-on-exec flag is set, the file descriptor is not inherited; the file descriptor is inherited by child processes if the close-on-exec flag is cleared and if CreateProcess() is called with the bInheritHandles parameter set to TRUE (when subprocess.Popen is created with close_fds=False for example). Windows does not have "close-on-exec" flag but an inheritance flag which is just the opposite value. For example, setting close-on-exec flag means clearing the HANDLE_FLAG_INHERIT flag of an handle.

Status in Python 3.3

On UNIX, the subprocess module closes file descriptors greater than 2 by default since Python 3.2 [1]. All file descriptors created by the parent process are automatically closed in the child process.

xmlrpc.server.SimpleXMLRPCServer sets the close-on-exec flag of the listening socket, the parent class socketserver.TCPServer does not set this flag.

There are other cases creating a subprocess or executing a new program where file descriptors are not closed: functions of the os.spawn*() and the os.exec*() families and third party modules calling exec() or fork() + exec(). In this case, file descriptors are shared between the parent and the child processes which is usually unexpected and causes various issues.

This PEP proposes to continue the work started with the change in the subprocess in Python 3.2, to fix the issue in any code, and not just code using subprocess.

Inherited file descriptors issues

Closing the file descriptor in the parent process does not close the related resource (file, socket, ...) because it is still open in the child process.

The listening socket of TCPServer is not closed on exec(): the child process is able to get connection from new clients; if the parent closes the listening socket and create a new listening socket on the same address, it would get an "address already is used" error.

Not closing file descriptors can lead to resource exhaustion: even if the parent closes all files, creating a new file descriptor may fail with "too many files" because files are still open in the child process.

See also the following issues:

Security

Leaking file descriptors is a major security vulnerability. An untrusted child process can read sensitive data like passwords and take control of the parent process though leaked file descriptors. It is for example a known vulnerability to escape from a chroot.

See also the CERT recommandation: FIO42-C. Ensure files are properly closed when they are no longer needed.

Example of vulnerabilities:

Atomicity

Using fcntl() to set the close-on-exec flag is not safe in a multithreaded application. If a thread calls fork() and exec() between the creation of the file descriptor and the call to fcntl(fd, F_SETFD, new_flags): the file descriptor will be inherited by the child process. Modern operating systems offer functions to set the flag during the creation of the file descriptor, which avoids the race condition.

Portability

Python 3.2 added socket.SOCK_CLOEXEC flag, Python 3.3 added os.O_CLOEXEC flag and os.pipe2() function. It is already possible to set atomically close-on-exec flag in Python 3.3 when opening a file and creating a pipe or socket.

The problem is that these flags and functions are not portable: only recent versions of operating systems support them. O_CLOEXEC and SOCK_CLOEXEC flags are ignored by old Linux versions and so FD_CLOEXEC flag must be checked using fcntl(fd, F_GETFD). If the kernel ignores O_CLOEXEC or SOCK_CLOEXEC flag, a call to fcntl(fd, F_SETFD, flags) is required to set close-on-exec flag.

Note

OpenBSD older 5.2 does not close the file descriptor with close-on-exec flag set if fork() is used before exec(), but it works correctly if exec() is called without fork(). Try openbsd_bug.py.

Scope

Applications still have to close explicitly file descriptors after a fork(). The close-on-exec flag only closes file descriptors after exec(), and so after fork() + exec().

This PEP only change the close-on-exec flag of file descriptors created by the Python standard library, or by modules using the standard library. Third party modules not using the standard library should be modified to conform to this PEP. The new os.set_cloexec() function can be used for example.

Note

See Close file descriptors after fork for a possible solution for fork() without exec().

Proposal

Add a new optional cloexec parameter on functions creating file descriptors and different ways to change default value of this parameter.

Add new functions:

  • os.get_cloexec(fd:int) -> bool: get the close-on-exec flag of a file descriptor. Not available on all platforms.
  • os.set_cloexec(fd:int, cloexec:bool=True): set or clear the close-on-exec flag on a file descriptor. Not available on all platforms.
  • sys.getdefaultcloexec() -> bool: get the current default value of the cloexec parameter
  • sys.setdefaultcloexec(cloexec: bool): set the default value of the cloexec parameter

Add a new optional cloexec parameter to:

  • asyncore.dispatcher.create_socket()
  • io.FileIO
  • io.open()
  • open()
  • os.dup()
  • os.dup2()
  • os.fdopen()
  • os.open()
  • os.openpty()
  • os.pipe()
  • select.devpoll()
  • select.epoll()
  • select.kqueue()
  • socket.socket()
  • socket.socket.accept()
  • socket.socket.dup()
  • socket.socket.fromfd
  • socket.socketpair()

The default value of the cloexec parameter is sys.getdefaultcloexec().

Add a new command line option -e and an environment variable PYTHONCLOEXEC to the set close-on-exec flag by default.

subprocess clears the close-on-exec flag of file descriptors of the pass_fds parameter.

All functions creating file descriptors in the standard library must respect the default value of the cloexec parameter: sys.getdefaultcloexec().

File descriptors 0 (stdin), 1 (stdout) and 2 (stderr) are expected to be inherited, but Python does not handle them differently. When os.dup2() is used to replace standard streams, cloexec=False must be specified explicitly.

Drawbacks of the proposal:

  • It is not more possible to know if the close-on-exec flag will be set or not on a newly created file descriptor just by reading the source code.
  • If the inheritance of a file descriptor matters, the cloexec parameter must now be specified explicitly, or the library or the application will not work depending on the default value of the cloexec parameter.

Alternatives

Inheritance enabled by default, default no configurable

Add a new optional parameter cloexec on functions creating file descriptors. The default value of the cloexec parameter is False, and this default cannot be changed. File descriptor inheritance enabled by default is also the default on POSIX and on Windows. This alternative is the most convervative option.

This option does not solve issues listed in the Rationale section, it only provides an helper to fix them. All functions creating file descriptors have to be modified to set cloexec=True in each module used by an application to fix all these issues.

Inheritance enabled by default, default can only be set to True

This alternative is based on the proposal: the only difference is that sys.setdefaultcloexec() does not take any argument, it can only be used to set the default value of the cloexec parameter to True.

Disable inheritance by default

This alternative is based on the proposal: the only difference is that the default value of the cloexec parameter is True (instead of False).

If a file must be inherited by child processes, cloexec=False parameter can be used.

Advantages of setting close-on-exec flag by default:

Drawbacks of setting close-on-exec flag by default:

  • It violates the principle of least surprise. Developers using the os module may expect that Python respects the POSIX standard and so that close-on-exec flag is not set by default.
  • The os module is written as a thin wrapper to system calls (to functions of the C standard library). If atomic flags to set close-on-exec flag are not supported (see Appendix: Operating system support), a single Python function call may call 2 or 3 system calls (see Performances section).
  • Extra system calls, if any, may slow down Python: see Performances.

Backward compatibility: only a few programs rely on inheritance of file descriptors, and they only pass a few file descriptors, usually just one. These programs will fail immediatly with EBADF error, and it will be simple to fix them: add cloexec=False parameter or use os.set_cloexec(fd, False).

The subprocess module will be changed anyway to clear close-on-exec flag on file descriptors listed in the pass_fds parameter of Popen constructor. So it possible that these programs will not need any fix if they use the subprocess module.

Close file descriptors after fork

This PEP does not fix issues with applications using fork() without exec(). Python needs a generic process to register callbacks which would be called after a fork, see #16500: Add an atfork module [2]. Such registry could be used to close file descriptors just after a fork().

Drawbacks:

  • It does not solve the problem on Windows: fork() does not exist on Windows
  • This alternative does not solve the problem for programs using exec() without fork().
  • A third party module may call directly the C function fork() which will not call "atfork" callbacks.
  • All functions creating file descriptors must be changed to register a callback and then unregister their callback when the file is closed. Or a list of all open file descriptors must be maintained.
  • The operating system is a better place than Python to close automatically file descriptors. For example, it is not easy to avoid a race condition between closing the file and unregistering the callback closing the file.

open(): add "e" flag to mode

A new "e" mode would set close-on-exec flag (best-effort).

This alternative only solves the problem for open(). socket.socket() and os.pipe() do not have a mode parameter for example.

Since its version 2.7, the GNU libc supports "e" flag for fopen(). It uses O_CLOEXEC if available, or use fcntl(fd, F_SETFD, FD_CLOEXEC). With Visual Studio, fopen() accepts a "N" flag which uses O_NOINHERIT.

Bikeshedding on the name of the new parameter

  • inherit, inherited: closer to Windows definition
  • sensitive
  • sterile: "Does not produce offspring."

Applications using inheritance of file descriptors

Most developers don't know that file descriptors are inherited by default. Most programs do not rely on inheritance of file descriptors. For example, subprocess.Popen was changed in Python 3.2 to close all file descriptors greater than 2 in the child process by default. No user complained about this behavior change yet.

Network servers using fork may want to pass the client socket to the child process. For example, on UNIX a CGI server pass the socket client through file descriptors 0 (stdin) and 1 (stdout) using dup2().

To access a restricted resource like creating a socket listening on a TCP port lower than 1024 or reading a file containing sensitive data like passwords, a common practice is: start as the root user, create a file descriptor, create a child process, drop privileges (ex: change the current user), pass the file descriptor to the child process and exit the parent process.

Security is very important in such use case: leaking another file descriptor would be a critical security vulnerability (see Security). The root process may not exit but monitors the child process instead, and restarts a new child process and pass the same file descriptor if the previous child process crashed.

Example of programs taking file descriptors from the parent process using a command line option:

  • gpg: --status-fd <fd>, --logger-fd <fd>, etc.
  • openssl: -pass fd:<fd>
  • qemu: -add-fd <fd>
  • valgrind: --log-fd=<fd>, --input-fd=<fd>, etc.
  • xterm: -S <fd>

On Linux, it is possible to use "/dev/fd/<fd>" filename to pass a file descriptor to a program expecting a filename.

Performances

Setting close-on-exec flag may require additional system calls for each creation of new file descriptors. The number of additional system calls depends on the method used to set the flag:

  • O_NOINHERIT: no additional system call
  • O_CLOEXEC: one additional system call, but only at the creation of the first file descriptor, to check if the flag is supported. If the flag is not supported, Python has to fallback to the next method.
  • ioctl(fd, FIOCLEX): one additional system call per file descriptor
  • fcntl(fd, F_SETFD, flags): two additional system calls per file descriptor, one to get old flags and one to set new flags

On Linux, setting the close-on-flag has a low overhead on performances. Results of bench_cloexec.py on Linux 3.6:

  • close-on-flag not set: 7.8 us
  • O_CLOEXEC: 1% slower (7.9 us)
  • ioctl(): 3% slower (8.0 us)
  • fcntl(): 3% slower (8.0 us)

Implementation

os.get_cloexec(fd)

Get the close-on-exec flag of a file descriptor.

Pseudo-code:

if os.name == 'nt':
    def get_cloexec(fd):
        handle = _winapi._get_osfhandle(fd);
        flags = _winapi.GetHandleInformation(handle)
        return not(flags & _winapi.HANDLE_FLAG_INHERIT)
else:
    try:
        import fcntl
    except ImportError:
        pass
    else:
        def get_cloexec(fd):
            flags = fcntl.fcntl(fd, fcntl.F_GETFD)
            return bool(flags & fcntl.FD_CLOEXEC)

os.set_cloexec(fd, cloexec=True)

Set or clear the close-on-exec flag on a file descriptor. The flag is set after the creation of the file descriptor and so it is not atomic.

Pseudo-code:

if os.name == 'nt':
    def set_cloexec(fd, cloexec=True):
        handle = _winapi._get_osfhandle(fd);
        mask = _winapi.HANDLE_FLAG_INHERIT
        if cloexec:
            flags = 0
        else:
            flags = mask
        _winapi.SetHandleInformation(handle, mask, flags)
else:
    fnctl = None
    ioctl = None
    try:
        import ioctl
    except ImportError:
        try:
            import fcntl
        except ImportError:
            pass
    if ioctl is not None and hasattr('FIOCLEX', ioctl):
        def set_cloexec(fd, cloexec=True):
            if cloexec:
                ioctl.ioctl(fd, ioctl.FIOCLEX)
            else:
                ioctl.ioctl(fd, ioctl.FIONCLEX)
    elif fnctl is not None:
        def set_cloexec(fd, cloexec=True):
            flags = fcntl.fcntl(fd, fcntl.F_GETFD)
            if cloexec:
                flags |= FD_CLOEXEC
            else:
                flags &= ~FD_CLOEXEC
            fcntl.fcntl(fd, fcntl.F_SETFD, flags)

ioctl is preferred over fcntl because it requires only one syscall, instead of two syscalls for fcntl.

Note

fcntl(fd, F_SETFD, flags) only supports one flag (FD_CLOEXEC), so it would be possible to avoid fcntl(fd, F_GETFD). But it may drop other flags in the future, and so it is safer to keep the two functions calls.

Note

fopen() function of the GNU libc ignores the error if fcntl(fd, F_SETFD, flags) failed.

open()

  • Windows: open() with O_NOINHERIT flag [atomic]
  • open() with O_CLOEXEC flag [atomic]
  • open() + os.set_cloexec(fd, True) [best-effort]

os.dup()

  • Windows: DuplicateHandle() [atomic]
  • fcntl(fd, F_DUPFD_CLOEXEC) [atomic]
  • dup() + os.set_cloexec(fd, True) [best-effort]

os.dup2()

  • fcntl(fd, F_DUP2FD_CLOEXEC, fd2) [atomic]
  • dup3() with O_CLOEXEC flag [atomic]
  • dup2() + os.set_cloexec(fd, True) [best-effort]

os.pipe()

  • Windows: CreatePipe() with SECURITY_ATTRIBUTES.bInheritHandle=TRUE, or _pipe() with O_NOINHERIT flag [atomic]
  • pipe2() with O_CLOEXEC flag [atomic]
  • pipe() + os.set_cloexec(fd, True) [best-effort]

socket.socket()

  • Windows: WSASocket() with WSA_FLAG_NO_HANDLE_INHERIT flag [atomic]
  • socket() with SOCK_CLOEXEC flag [atomic]
  • socket() + os.set_cloexec(fd, True) [best-effort]

socket.socketpair()

  • socketpair() with SOCK_CLOEXEC flag [atomic]
  • socketpair() + os.set_cloexec(fd, True) [best-effort]

socket.socket.accept()

  • accept4() with SOCK_CLOEXEC flag [atomic]
  • accept() + os.set_cloexec(fd, True) [best-effort]

Backward compatibility

There is no backward incompatible change. The default behaviour is unchanged: the close-on-exec flag is not set by default.

Appendix: Operating system support

Windows

Windows has an O_NOINHERIT flag: "Do not inherit in child processes".

For example, it is supported by open() and _pipe().

The flag can be cleared using SetHandleInformation(fd, HANDLE_FLAG_INHERIT, 0).

CreateProcess() has an bInheritHandles parameter: if it is FALSE, the handles are not inherited. If it is TRUE, handles with HANDLE_FLAG_INHERIT flag set are inherited. subprocess.Popen uses close_fds option to define bInheritHandles.

ioctl

Functions:

  • ioctl(fd, FIOCLEX, 0): set the close-on-exec flag
  • ioctl(fd, FIONCLEX, 0): clear the close-on-exec flag

Availability: Linux, Mac OS X, QNX, NetBSD, OpenBSD, FreeBSD.

fcntl

Functions:

  • flags = fcntl(fd, F_GETFD); fcntl(fd, F_SETFD, flags | FD_CLOEXEC): set the close-on-exec flag
  • flags = fcntl(fd, F_GETFD); fcntl(fd, F_SETFD, flags & ~FD_CLOEXEC): clear the close-on-exec flag

Availability: AIX, Digital UNIX, FreeBSD, HP-UX, IRIX, Linux, Mac OS X, OpenBSD, Solaris, SunOS, Unicos.

Atomic flags

New flags:

  • O_CLOEXEC: available on Linux (2.6.23), FreeBSD (8.3), OpenBSD 5.0, Solaris 11, QNX, BeOS, next NetBSD release (6.1?). This flag is part of POSIX.1-2008.
  • SOCK_CLOEXEC flag for socket() and socketpair(), available on Linux 2.6.27, OpenBSD 5.2, NetBSD 6.0.
  • WSA_FLAG_NO_HANDLE_INHERIT flag for WSASocket(): supported on Windows 7 with SP1, Windows Server 2008 R2 with SP1, and later
  • fcntl(): F_DUPFD_CLOEXEC flag, available on Linux 2.6.24, OpenBSD 5.0, FreeBSD 9.1, NetBSD 6.0, Solaris 11. This flag is part of POSIX.1-2008.
  • fcntl(): F_DUP2FD_CLOEXEC flag, available on FreeBSD 9.1 and Solaris 11.
  • recvmsg(): MSG_CMSG_CLOEXEC, available on Linux 2.6.23, NetBSD 6.0.

On Linux older than 2.6.23, O_CLOEXEC flag is simply ignored. So we have to check that the flag is supported by calling fcntl(). If it does not work, we have to set the flag using ioctl() or fcntl().

On Linux older than 2.6.27, if the SOCK_CLOEXEC flag is set in the socket type, socket() or socketpair() fail and errno is set to EINVAL.

On Windows XPS3, WSASocket() with with WSAEPROTOTYPE when WSA_FLAG_NO_HANDLE_INHERIT flag is used.

New functions:

  • dup3(): available on Linux 2.6.27 (and glibc 2.9)
  • pipe2(): available on Linux 2.6.27 (and glibc 2.9)
  • accept4(): available on Linux 2.6.28 (and glibc 2.10)

If accept4() is called on Linux older than 2.6.28, accept4() returns -1 (fail) and errno is set to ENOSYS.

Footnotes

[1]On UNIX since Python 3.2, subprocess.Popen() closes all file descriptors by default: close_fds=True. It closes file descriptors in range 3 inclusive to local_max_fd exclusive, where local_max_fd is fcntl(0, F_MAXFD) on NetBSD, or sysconf(_SC_OPEN_MAX) otherwise. If the error pipe has a descriptor smaller than 3, ValueError is raised.

pep-0434 IDLE Enhancement Exception for All Branches

PEP:434
Title:IDLE Enhancement Exception for All Branches
Version:$Revision$
Last-Modified:$Date$
Author:Todd Rovito <rovitotv at gmail.com>, Terry Reedy <tjreedy at udel.edu>
BDFL-Delegate:Nick Coghlan
Status:Active
Type:Informational
Content-Type:text/x-rst
Created:16-Feb-2013
Post-History:16-Feb-2013 03-Mar-2013 21-Mar-2013 30-Mar-2013
Resolution:http://mail.python.org/pipermail/python-dev/2013-March/125003.html

Abstract

Most CPython tracker issues are classified as behavior or enhancement. Most behavior patches are backported to branches for existing versions. Enhancement patches are restricted to the default branch that becomes the next Python version.

This PEP proposes that the restriction on applying enhancements be relaxed for IDLE code, residing in .../Lib/idlelib/. In practice, this would mean that IDLE developers would not have to classify or agree on the classification of a patch but could instead focus on what is best for IDLE users and future IDLE development. It would also mean that IDLE patches would not necessarily have to be split into 'bugfix' changes and enhancement changes.

The PEP would apply to changes in existing features and addition of small features, such as would require a new menu entry, but not necessarily to possible major re-writes such as switching to themed widgets or tabbed windows.

Motivation

This PEP was prompted by controversy on both the tracker and pydev list over adding Cut, Copy, and Paste to right-click context menus (Issue 1207589, opened in 2005 [1]; pydev thread [2]). The features were available as keyboard shortcuts but not on the context menu. It is standard, at least on Windows, that they should be when applicable (a read-only window would only have Copy), so users do not have to shift to the keyboard after selecting text for cutting or copying or a slice point for pasting. The context menu was not documented until 10 days before the new options were added (Issue 10405 [5]).

Normally, behavior is called a bug if it conflicts with documentation judged to be correct. But if there is no doc, what is the standard? If the code is its own documentation, most IDLE issues on the tracker are enhancement issues. If we substitute reasonable user expectation, (which can, of course, be its own subject of disagreement), many more issues are behavior issues.

For context menus, people disagreed on the status of the additions -- bugfix or enhancement. Even people who called it an enhancement disagreed as to whether the patch should be backported. This PEP proposes to make the status disagreement irrelevant by explicitly allowing more liberal backporting than for other stdlib modules.

Python does have many advanced features yet Python is well known for being a easy computer language for beginners [3]. A major Python philosophy is "batteries included" which is best demonstrated in Python's standard library with many modules that are not typically included with other programming languages [4]. IDLE is a important "battery" in the Python toolbox because it allows a beginner to get started quickly without downloading and configuring a third party IDE. IDLE represents a commitment by the Python community to encouage the use of Python as a teaching language both inside and outside of formal educational settings. The recommended teaching experience is to have a learner start with IDLE. This PEP and the work that it will enable will allow the Python community to make that learner's experience with IDLE awesome by making IDLE a simple tool for beginners to get started with Python.

Rationale

People primarily use IDLE by running the graphical user interface (GUI) application, rather than by directly importing the effectively private (undocumented) implementation modules in idlelib. Whether they use the shell, the editor, or both, we believe they will benefit more from consistency across the latest releases of current Python versions than from consistency within the bugfix releases for one Python version. This is especially true when existing behavior is clearly unsatisfactory.

When people use the standard interpreter, the OS-provided frame works the same for all Python versions. If, for instance, Microsoft were to upgrade the Command Prompt GUI, the improvements would be present regardless of which Python were running within it. Similarly, if one edits Python code with editor X, behaviors such as the right-click context menu and the search-replace box do not depend on the version of Python being edited or even the language being edited.

The benefit for IDLE developers is mixed. On the one hand, testing more versions and possibly having to adjust a patch, especially for 2.7, is more work. (There is, of course, the option on not backporting everything. For issue 12510, some changes to calltips for classes were not included in the 2.7 patch because of issues with old-style classes [6].) On the other hand, bike-shedding can be an energy drain. If the obvious fix for a bug looks like an enhancement, writing a separate bugfix-only patch is more work. And making the code diverge between versions makes future multi-version patches more difficult.

These issue are illustrated by the search-and-replace dialog box. It used to raise an exception for certain user entries [7]. The uncaught exception caused IDLE to exit. At least on Windows, the exit was silent (no visible traceback) and looked like a crash if IDLE was started normally, from an icon.

Was this a bug? IDLE Help (on the current Help submenu) just says "Replace... Open a search-and-replace dialog box", and a box was opened. It is not, in general, a bug for a library method to raise an exception. And it is not, in general, a bug for a library method to ignore an exception raised by functions it calls. So if we were to adopt the 'code = doc' philosophy in the absence of detailed docs, one might say 'No'.

However, IDLE exiting when it does not need to is definitely obnoxious. So four of us agreed that it should be prevented. But there was still the question of what to do instead? Catch the exception? Just not raise the exception? Beep? Display an error message box? Or try to do something useful with the user's entry? Would replacing a 'crash' with useful behavior be an enhancement, limited to future Python releases? Should IDLE developers have to ask that?

Backwards Compatibility

For IDLE, there are three types of users who might be concerned about back compatibility. First are people who run IDLE as an application. We have already discussed them above.

Second are people who import one of the idlelib modules. As far as we know, this is only done to start the IDLE application, and we do not propose breaking such use. Otherwise, the modules are undocumented and effectively private implementations. If an IDLE module were defined as public, documented, and perhaps moved to the tkinter package, it would then follow the normal rules. (Documenting the private interfaces for the benefit of people working on the IDLE code is a separate issue.)

Third are people who write IDLE extensions. The guaranteed extension interface is given in idlelib/extension.txt. This should be respected at least in existing versions, and not frivolously changed in future versions. But there is a warning that "The extension cannot assume much about this [EditorWindow] argument." This guarantee should rarely be an issue with patches, and the issue is not specific to 'enhancement' versus 'bugfix' patches.

As is happens, after the context menu patch was applied, it came up that extensions that added items to the context menu (rare) would be broken because the patch a) added a new item to standard rmenu_specs and b) expected every rmenu_spec to be lengthened. It is not clear whether this violates the guarantee, but there is a second patch that fixes assumption b). It should be applied when it is clear that the first patch will not have to be reverted.

References

[1]IDLE: Right Click Context Menu, Foord, Michael (http://bugs.python.org/issue1207589)
[2]Cut/Copy/Paste items in IDLE right click context menu (http://mail.python.org/pipermail/python-dev/2012-November/122514.html)
[3]Getting Started with Python (http://www.python.org/about/gettingstarted/)
[4]Batteries Included (http://docs.python.org/2/tutorial/stdlib.html#batteries-included)
[5]IDLE breakpoint facility undocumented, Deily, Ned (http://bugs.python.org/issue10405)
[6]IDLE: calltips mishandle raw strings and other examples, Reedy, Terry (http://bugs.python.org/issue12510)
[7]IDLE: replace ending with '' causes crash, Reedy, Terry (http://bugs.python.org/issue13052)

pep-0435 Adding an Enum type to the Python standard library

PEP:435
Title:Adding an Enum type to the Python standard library
Version:$Revision$
Last-Modified:$Date$
Author:Barry Warsaw <barry at python.org>, Eli Bendersky <eliben at gmail.com>, Ethan Furman <ethan at stoneleaf.us>
Status:Final
Type:Standards Track
Content-Type:text/x-rst
Created:2013-02-23
Python-Version:3.4
Post-History:2013-02-23, 2013-05-02
Resolution:http://mail.python.org/pipermail/python-dev/2013-May/126112.html

Abstract

This PEP proposes adding an enumeration type to the Python standard library.

An enumeration is a set of symbolic names bound to unique, constant values. Within an enumeration, the values can be compared by identity, and the enumeration itself can be iterated over.

Status of discussions

The idea of adding an enum type to Python is not new - PEP 354 [2] is a previous attempt that was rejected in 2005. Recently a new set of discussions was initiated [3] on the python-ideas mailing list. Many new ideas were proposed in several threads; after a lengthy discussion Guido proposed adding flufl.enum to the standard library [4]. During the PyCon 2013 language summit the issue was discussed further. It became clear that many developers want to see an enum that subclasses int, which can allow us to replace many integer constants in the standard library by enums with friendly string representations, without ceding backwards compatibility. An additional discussion among several interested core developers led to the proposal of having IntEnum as a special case of Enum.

The key dividing issue between Enum and IntEnum is whether comparing to integers is semantically meaningful. For most uses of enumerations, it's a feature to reject comparison to integers; enums that compare to integers lead, through transitivity, to comparisons between enums of unrelated types, which isn't desirable in most cases. For some uses, however, greater interoperatiliby with integers is desired. For instance, this is the case for replacing existing standard library constants (such as socket.AF_INET) with enumerations.

Further discussion in late April 2013 led to the conclusion that enumeration members should belong to the type of their enum: type(Color.red) == Color. Guido has pronounced a decision on this issue [5], as well as related issues of not allowing to subclass enums [6], unless they define no enumeration members [7].

The PEP was accepted by Guido on May 10th, 2013 [1].

Motivation

[Based partly on the Motivation stated in PEP 354]

The properties of an enumeration are useful for defining an immutable, related set of constant values that may or may not have a semantic meaning. Classic examples are days of the week (Sunday through Saturday) and school assessment grades ('A' through 'D', and 'F'). Other examples include error status values and states within a defined process.

It is possible to simply define a sequence of values of some other basic type, such as int or str, to represent discrete arbitrary values. However, an enumeration ensures that such values are distinct from any others including, importantly, values within other enumerations, and that operations without meaning ("Wednesday times two") are not defined for these values. It also provides a convenient printable representation of enum values without requiring tedious repetition while defining them (i.e. no GREEN = 'green').

Module and type name

We propose to add a module named enum to the standard library. The main type exposed by this module is Enum. Hence, to import the Enum type user code will run:

>>> from enum import Enum

Proposed semantics for the new enumeration type

Creating an Enum

Enumerations are created using the class syntax, which makes them easy to read and write. An alternative creation method is described in Functional API. To define an enumeration, subclass Enum as follows:

>>> from enum import Enum
>>> class Color(Enum):
...     red = 1
...     green = 2
...     blue = 3

A note on nomenclature: we call Color an enumeration (or enum) and Color.red, Color.green are enumeration members (or enum members). Enumeration members also have values (the value of Color.red is 1, etc.)

Enumeration members have human readable string representations:

>>> print(Color.red)
Color.red

...while their repr has more information:

>>> print(repr(Color.red))
<Color.red: 1>

The type of an enumeration member is the enumeration it belongs to:

>>> type(Color.red)
<Enum 'Color'>
>>> isinstance(Color.green, Color)
True
>>>

Enums also have a property that contains just their item name:

>>> print(Color.red.name)
red

Enumerations support iteration, in definition order:

>>> class Shake(Enum):
...   vanilla = 7
...   chocolate = 4
...   cookies = 9
...   mint = 3
...
>>> for shake in Shake:
...   print(shake)
...
Shake.vanilla
Shake.chocolate
Shake.cookies
Shake.mint

Enumeration members are hashable, so they can be used in dictionaries and sets:

>>> apples = {}
>>> apples[Color.red] = 'red delicious'
>>> apples[Color.green] = 'granny smith'
>>> apples
{<Color.red: 1>: 'red delicious', <Color.green: 2>: 'granny smith'}

Programmatic access to enumeration members

Sometimes it's useful to access members in enumerations programmatically (i.e. situations where Color.red won't do because the exact color is not known at program-writing time). Enum allows such access:

>>> Color(1)
<Color.red: 1>
>>> Color(3)
<Color.blue: 3>

If you want to access enum members by name, use item access:

>>> Color['red']
<Color.red: 1>
>>> Color['green']
<Color.green: 2>

Duplicating enum members and values

Having two enum members with the same name is invalid:

>>> class Shape(Enum):
...   square = 2
...   square = 3
...
Traceback (most recent call last):
...
TypeError: Attempted to reuse key: square

However, two enum members are allowed to have the same value. Given two members A and B with the same value (and A defined first), B is an alias to A. By-value lookup of the value of A and B will return A. By-name lookup of B will also return A:

>>> class Shape(Enum):
...   square = 2
...   diamond = 1
...   circle = 3
...   alias_for_square = 2
...
>>> Shape.square
<Shape.square: 2>
>>> Shape.alias_for_square
<Shape.square: 2>
>>> Shape(2)
<Shape.square: 2>

Iterating over the members of an enum does not provide the aliases:

>>> list(Shape)
[<Shape.square: 2>, <Shape.diamond: 1>, <Shape.circle: 3>]

The special attribute __members__ is an ordered dictionary mapping names to members. It includes all names defined in the enumeration, including the aliases:

>>> for name, member in Shape.__members__.items():
...   name, member
...
('square', <Shape.square: 2>)
('diamond', <Shape.diamond: 1>)
('circle', <Shape.circle: 3>)
('alias_for_square', <Shape.square: 2>)

The __members__ attribute can be used for detailed programmatic access to the enumeration members. For example, finding all the aliases:

>>> [name for name, member in Shape.__members__.items() if member.name != name]
['alias_for_square']

Comparisons

Enumeration members are compared by identity:

>>> Color.red is Color.red
True
>>> Color.red is Color.blue
False
>>> Color.red is not Color.blue
True

Ordered comparisons between enumeration values are not supported. Enums are not integers (but see IntEnum below):

>>> Color.red < Color.blue
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unorderable types: Color() < Color()

Equality comparisons are defined though:

>>> Color.blue == Color.red
False
>>> Color.blue != Color.red
True
>>> Color.blue == Color.blue
True

Comparisons against non-enumeration values will always compare not equal (again, IntEnum was explicitly designed to behave differently, see below):

>>> Color.blue == 2
False

Allowed members and attributes of enumerations

The examples above use integers for enumeration values. Using integers is short and handy (and provided by default by the Functional API), but not strictly enforced. In the vast majority of use-cases, one doesn't care what the actual value of an enumeration is. But if the value is important, enumerations can have arbitrary values.

Enumerations are Python classes, and can have methods and special methods as usual. If we have this enumeration:

class Mood(Enum):
  funky = 1
  happy = 3

  def describe(self):
    # self is the member here
    return self.name, self.value

  def __str__(self):
    return 'my custom str! {0}'.format(self.value)

  @classmethod
  def favorite_mood(cls):
    # cls here is the enumeration
    return cls.happy

Then:

>>> Mood.favorite_mood()
<Mood.happy: 3>
>>> Mood.happy.describe()
('happy', 3)
>>> str(Mood.funky)
'my custom str! 1'

The rules for what is allowed are as follows: all attributes defined within an enumeration will become members of this enumeration, with the exception of __dunder__ names and descriptors [9]; methods are descriptors too.

Restricted subclassing of enumerations

Subclassing an enumeration is allowed only if the enumeration does not define any members. So this is forbidden:

>>> class MoreColor(Color):
...   pink = 17
...
TypeError: Cannot extend enumerations

But this is allowed:

>>> class Foo(Enum):
...   def some_behavior(self):
...     pass
...
>>> class Bar(Foo):
...   happy = 1
...   sad = 2
...

The rationale for this decision was given by Guido in [6]. Allowing to subclass enums that define members would lead to a violation of some important invariants of types and instances. On the other hand, it makes sense to allow sharing some common behavior between a group of enumerations, and subclassing empty enumerations is also used to implement IntEnum.

IntEnum

A variation of Enum is proposed which is also a subclass of int. Members of an IntEnum can be compared to integers; by extension, integer enumerations of different types can also be compared to each other:

>>> from enum import IntEnum
>>> class Shape(IntEnum):
...   circle = 1
...   square = 2
...
>>> class Request(IntEnum):
...   post = 1
...   get = 2
...
>>> Shape == 1
False
>>> Shape.circle == 1
True
>>> Shape.circle == Request.post
True

However they still can't be compared to Enum:

>>> class Shape(IntEnum):
...   circle = 1
...   square = 2
...
>>> class Color(Enum):
...   red = 1
...   green = 2
...
>>> Shape.circle == Color.red
False

IntEnum values behave like integers in other ways you'd expect:

>>> int(Shape.circle)
1
>>> ['a', 'b', 'c'][Shape.circle]
'b'
>>> [i for i in range(Shape.square)]
[0, 1]

For the vast majority of code, Enum is strongly recommended, since IntEnum breaks some semantic promises of an enumeration (by being comparable to integers, and thus by transitivity to other unrelated enumerations). It should be used only in special cases where there's no other choice; for example, when integer constants are replaced with enumerations and backwards compatibility is required with code that still expects integers.

Other derived enumerations

IntEnum will be part of the enum module. However, it would be very simple to implement independently:

class IntEnum(int, Enum):
    pass

This demonstrates how similar derived enumerations can be defined, for example a StrEnum that mixes in str instead of int.

Some rules:

  1. When subclassing Enum, mix-in types must appear before Enum itself in the sequence of bases, as in the IntEnum example above.
  2. While Enum can have members of any type, once you mix in an additional type, all the members must have values of that type, e.g. int above. This restriction does not apply to mix-ins which only add methods and don't specify another data type such as int or str.

Pickling

Enumerations can be pickled and unpickled:

>>> from enum.tests.fruit import Fruit
>>> from pickle import dumps, loads
>>> Fruit.tomato is loads(dumps(Fruit.tomato))
True

The usual restrictions for pickling apply: picklable enums must be defined in the top level of a module, since unpickling requires them to be importable from that module.

Functional API

The Enum class is callable, providing the following functional API:

>>> Animal = Enum('Animal', 'ant bee cat dog')
>>> Animal
<Enum 'Animal'>
>>> Animal.ant
<Animal.ant: 1>
>>> Animal.ant.value
1
>>> list(Animal)
[<Animal.ant: 1>, <Animal.bee: 2>, <Animal.cat: 3>, <Animal.dog: 4>]

The semantics of this API resemble namedtuple. The first argument of the call to Enum is the name of the enumeration. Pickling enums created with the functional API will work on CPython and PyPy, but for IronPython and Jython you may need to specify the module name explicitly as follows:

>>> Animals = Enum('Animals', 'ant bee cat dog', module=__name__)

The second argument is the source of enumeration member names. It can be a whitespace-separated string of names, a sequence of names, a sequence of 2-tuples with key/value pairs, or a mapping (e.g. dictionary) of names to values. The last two options enable assigning arbitrary values to enumerations; the others auto-assign increasing integers starting with 1. A new class derived from Enum is returned. In other words, the above assignment to Animal is equivalent to:

>>> class Animals(Enum):
...   ant = 1
...   bee = 2
...   cat = 3
...   dog = 4

The reason for defaulting to 1 as the starting number and not 0 is that 0 is False in a boolean sense, but enum members all evaluate to True.

Proposed variations

Some variations were proposed during the discussions in the mailing list. Here's some of the more popular ones.

flufl.enum

flufl.enum was the reference implementation upon which this PEP was originally based. Eventually, it was decided against the inclusion of flufl.enum because its design separated enumeration members from enumerations, so the former are not instances of the latter. Its design also explicitly permits subclassing enumerations for extending them with more members (due to the member/enum separation, the type invariants are not violated in flufl.enum with such a scheme).

Not having to specify values for enums

Michael Foord proposed (and Tim Delaney provided a proof-of-concept implementation) to use metaclass magic that makes this possible:

class Color(Enum):
    red, green, blue

The values get actually assigned only when first looked up.

Pros: cleaner syntax that requires less typing for a very common task (just listing enumeration names without caring about the values).

Cons: involves much magic in the implementation, which makes even the definition of such enums baffling when first seen. Besides, explicit is better than implicit.

Using special names or forms to auto-assign enum values

A different approach to avoid specifying enum values is to use a special name or form to auto assign them. For example:

class Color(Enum):
    red = None          # auto-assigned to 0
    green = None        # auto-assigned to 1
    blue = None         # auto-assigned to 2

More flexibly:

class Color(Enum):
    red = 7
    green = None        # auto-assigned to 8
    blue = 19
    purple = None       # auto-assigned to 20

Some variations on this theme:

  1. A special name auto imported from the enum package.
  2. Georg Brandl proposed ellipsis (...) instead of None to achieve the same effect.

Pros: no need to manually enter values. Makes it easier to change the enum and extend it, especially for large enumerations.

Cons: actually longer to type in many simple cases. The argument of explicit vs. implicit applies here as well.

Use-cases in the standard library

The Python standard library has many places where the usage of enums would be beneficial to replace other idioms currently used to represent them. Such usages can be divided to two categories: user-code facing constants, and internal constants.

User-code facing constants like os.SEEK_*, socket module constants, decimal rounding modes and HTML error codes could require backwards compatibility since user code may expect integers. IntEnum as described above provides the required semantics; being a subclass of int, it does not affect user code that expects integers, while on the other hand allowing printable representations for enumeration values:

>>> import socket
>>> family = socket.AF_INET
>>> family == 2
True
>>> print(family)
SocketFamily.AF_INET

Internal constants are not seen by user code but are employed internally by stdlib modules. These can be implemented with Enum. Some examples uncovered by a very partial skim through the stdlib: binhex, imaplib, http/client, urllib/robotparser, idlelib, concurrent.futures, turtledemo.

In addition, looking at the code of the Twisted library, there are many use cases for replacing internal state constants with enums. The same can be said about a lot of networking code (especially implementation of protocols) and can be seen in test protocols written with the Tulip library as well.

Acknowledgments

This PEP was initially proposing including the flufl.enum package [8] by Barry Warsaw into the stdlib, and is inspired in large parts by it. Ben Finney is the author of the earlier enumeration PEP 354.

References

[1]http://mail.python.org/pipermail/python-dev/2013-May/126112.html
[2]http://www.python.org/dev/peps/pep-0354/
[3]http://mail.python.org/pipermail/python-ideas/2013-January/019003.html
[4]http://mail.python.org/pipermail/python-ideas/2013-February/019373.html
[5]To make enums behave similarly to Python classes like bool, and behave in a more intuitive way. It would be surprising if the type of Color.red would not be Color. (Discussion in http://mail.python.org/pipermail/python-dev/2013-April/125687.html)
[6](1, 2, 3) Subclassing enums and adding new members creates an unresolvable situation; on one hand MoreColor.red and Color.red should not be the same object, and on the other isinstance checks become confusing if they are not. The discussion also links to Stack Overflow discussions that make additional arguments. (http://mail.python.org/pipermail/python-dev/2013-April/125716.html)
[7]It may be useful to have a class defining some behavior (methods, with no actual enumeration members) mixed into an enum, and this would not create the problem discussed in [6]. (Discussion in http://mail.python.org/pipermail/python-dev/2013-May/125859.html)
[8]http://pythonhosted.org/flufl.enum/
[9]http://docs.python.org/3/howto/descriptor.html

pep-0436 The Argument Clinic DSL

PEP:436
Title:The Argument Clinic DSL
Version:$Revision$
Last-Modified:$Date$
Author:Larry Hastings <larry at hastings.org>
Discussions-To:Python-Dev <python-dev at python.org>
Status:Draft
Type:Standards Track
Content-Type:text/x-rst
Created:22-Feb-2013

Abstract

This document proposes "Argument Clinic", a DSL to facilitate argument processing for built-in functions in the implementation of CPython.

Rationale and Goals

The primary implementation of Python, "CPython", is written in a mixture of Python and C. One implementation detail of CPython is what are called "built-in" functions -- functions available to Python programs but written in C. When a Python program calls a built-in function and passes in arguments, those arguments must be translated from Python values into C values. This process is called "parsing arguments".

As of CPython 3.3, builtin functions nearly always parse their arguments with one of two functions: the original PyArg_ParseTuple(), [1] and the more modern PyArg_ParseTupleAndKeywords(). [2] The former only handles positional parameters; the latter also accommodates keyword and keyword-only parameters, and is preferred for new code.

With either function, the caller specifies the translation for parsing arguments in a "format string": [3] each parameter corresponds to a "format unit", a short character sequence telling the parsing function what Python types to accept and how to translate them into the appropriate C value for that parameter.

PyArg_ParseTuple() was reasonable when it was first conceived. There were only a dozen or so of these "format units"; each one was distinct, and easy to understand and remember. But over the years the PyArg_Parse interface has been extended in numerous ways. The modern API is complex, to the point that it is somewhat painful to use. Consider:

  • There are now forty different "format units"; a few are even three characters long. This makes it difficult for the programmer to understand what the format string says--or even perhaps to parse it--without constantly cross-indexing it with the documentation.
  • There are also six meta-format units that may be buried in the format string. (They are: "()|$:;".)
  • The more format units are added, the less likely it is the implementer can pick an easy-to-use mnemonic for the format unit, because the character of choice is probably already in use. In other words, the more format units we have, the more obtuse the format units become.
  • Several format units are nearly identical to others, having only subtle differences. This makes understanding the exact semantics of the format string even harder, and can make it difficult to figure out exactly which format unit you want.
  • The docstring is specified as a static C string, making it mildly bothersome to read and edit since it must obey C string quoting rules.
  • When adding a new parameter to a function using PyArg_ParseTupleAndKeywords(), it's necessary to touch six different places in the code: [4]
    • Declaring the variable to store the argument.
    • Passing in a pointer to that variable in the correct spot in PyArg_ParseTupleAndKeywords(), also passing in any "length" or "converter" arguments in the correct order.
    • Adding the name of the argument in the correct spot of the "keywords" array passed in to PyArg_ParseTupleAndKeywords().
    • Adding the format unit to the correct spot in the format string.
    • Adding the parameter to the prototype in the docstring.
    • Documenting the parameter in the docstring.
  • There is currently no mechanism for builtin functions to provide their "signature" information (see inspect.getfullargspec and inspect.Signature). Adding this information using a mechanism similar to the existing PyArg_Parse functions would require repeating ourselves yet again.

The goal of Argument Clinic is to replace this API with a mechanism inheriting none of these downsides:

  • You need specify each parameter only once.
  • All information about a parameter is kept together in one place.
  • For each parameter, you specify a conversion function; Argument Clinic handles the translation from Python value into C value for you.
  • Argument Clinic also allows for fine-tuning of argument processing behavior with parameterized conversion functions.
  • Docstrings are written in plain text. Function docstrings are required; per-parameter docstrings are encouraged.
  • From this, Argument Clinic generates for you all the mundane, repetitious code and data structures CPython needs internally. Once you've specified the interface, the next step is simply to write your implementation using native C types. Every detail of argument parsing is handled for you.

Argument Clinic is implemented as a preprocessor. It draws inspiration for its workflow directly from [Cog] by Ned Batchelder. To use Clinic, add a block comment to your C source code beginning and ending with special text strings, then run Clinic on the file. Clinic will find the block comment, process the contents, and write the output back into your C source file directly after the comment. The intent is that Clinic's output becomes part of your source code; it's checked in to revision control, and distributed with source packages. This means that Python will still ship ready-to-build. It does complicate development slightly; in order to add a new function, or modify the arguments or documentation of an existing function using Clinic, you'll need a working Python 3 interpreter.

Future goals of Argument Clinic include:

  • providing signature information for builtins,
  • enabling alternative implementations of Python to create automated library compatibility tests, and
  • speeding up argument parsing with improvements to the generated code.

DSL Syntax Summary

The Argument Clinic DSL is specified as a comment embedded in a C file, as follows. The "Example" column on the right shows you sample input to the Argument Clinic DSL, and the "Section" column on the left specifies what each line represents in turn.

Argument Clinic's DSL syntax mirrors the Python def statement, lending it some familiarity to Python core developers.

+-----------------------+-----------------------------------------------------------------+
| Section               | Example                                                         |
+-----------------------+-----------------------------------------------------------------+
| Clinic DSL start      | /*[clinic]                                                      |
| Module declaration    | module module_name                                              |
| Class declaration     | class module_name.class_name                                    |
| Function declaration  | module_name.function_name  -> return_annotation                 |
| Parameter declaration |       name : converter(param=value)                             |
| Parameter docstring   |           Lorem ipsum dolor sit amet, consectetur               |
|                       |           adipisicing elit, sed do eiusmod tempor               |
| Function docstring    | Lorem ipsum dolor sit amet, consectetur adipisicing             |
|                       | elit, sed do eiusmod tempor incididunt ut labore et             |
| Clinic DSL end        | [clinic]*/                                                      |
| Clinic output         | ...                                                             |
| Clinic output end     | /*[clinic end output:<checksum>]*/                              |
+-----------------------+-----------------------------------------------------------------+

To give some flavor of the proposed DSL syntax, here are some sample Clinic code blocks. This first block reflects the normally preferred style, including blank lines between parameters and per-argument docstrings. It also includes a user-defined converter (path_t) created locally:

/*[clinic]
os.stat as os_stat_fn -> stat result

   path: path_t(allow_fd=1)
       Path to be examined; can be string, bytes, or open-file-descriptor int.

   *

   dir_fd: OS_STAT_DIR_FD_CONVERTER = DEFAULT_DIR_FD
       If not None, it should be a file descriptor open to a directory,
       and path should be a relative string; path will then be relative to
       that directory.

   follow_symlinks: bool = True
       If False, and the last element of the path is a symbolic link,
       stat will examine the symbolic link itself instead of the file
       the link points to.

Perform a stat system call on the given path.

{parameters}

dir_fd and follow_symlinks may not be implemented
  on your platform.  If they are unavailable, using them will raise a
  NotImplementedError.

It's an error to use dir_fd or follow_symlinks when specifying path as
  an open file descriptor.

[clinic]*/

This second example shows a minimal Clinic code block, omitting all parameter docstrings and non-significant blank lines:

/*[clinic]
os.access
   path: path
   mode: int
   *
   dir_fd: OS_ACCESS_DIR_FD_CONVERTER = 1
   effective_ids: bool = False
   follow_symlinks: bool = True
Use the real uid/gid to test for access to a path.
Returns True if granted, False otherwise.

{parameters}

dir_fd, effective_ids, and follow_symlinks may not be implemented
  on your platform.  If they are unavailable, using them will raise a
  NotImplementedError.

Note that most operations will use the effective uid/gid, therefore this
  routine can be used in a suid/sgid environment to test if the invoking user
  has the specified access to the path.

[clinic]*/

This final example shows a Clinic code block handling groups of optional parameters, including parameters on the left:

/*[clinic]
curses.window.addch

   [
   y: int
     Y-coordinate.

   x: int
     X-coordinate.
   ]

   ch: char
     Character to add.

   [
   attr: long
     Attributes for the character.
   ]

   /

Paint character ch at (y, x) with attributes attr,
overwriting any character previously painter at that location.
By default, the character position and attributes are the
current settings for the window object.
[clinic]*/

General Behavior Of the Argument Clinic DSL

All lines support # as a line comment delimiter except docstrings. Blank lines are always ignored.

Like Python itself, leading whitespace is significant in the Argument Clinic DSL. The first line of the "function" section is the function declaration. Indented lines below the function declaration declare parameters, one per line; lines below those that are indented even further are per-parameter docstrings. Finally, the first line dedented back to column 0 end parameter declarations and start the function docstring.

Parameter docstrings are optional; function docstrings are not. Functions that specify no arguments may simply specify the function declaration followed by the docstring.

Module and Class Declarations

When a C file implements a module or class, this should be declared to Clinic. The syntax is simple:

module module_name

or

class module_name.class_name

(Note that these are not actually special syntax; they are implemented as Directives.)

The module name or class name should always be the full dotted path from the top-level module. Nested modules and classes are supported.

Function Declaration

The full form of the function declaration is as follows:

dotted.name [ as legal_c_id ] [ -> return_annotation ]

The dotted name should be the full name of the function, starting with the highest-level package (e.g. "os.stat" or "curses.window.addch").

The "as legal_c_id" syntax is optional. Argument Clinic uses the name of the function to create the names of the generated C functions. In some circumstances, the generated name may collide with other global names in the C program's namespace. The "as legal_c_id" syntax allows you to override the generated name with your own; substitute "legal_c_id" with any legal C identifier. If skipped, the "as" keyword must also be omitted.

The return annotation is also optional. If skipped, the arrow ("->") must also be omitted. If specified, the value for the return annotation must be compatible with ast.literal_eval, and it is interpreted as a return converter.

Parameter Declaration

The full form of the parameter declaration line as as follows:

name: converter [ (parameter=value [, parameter2=value2]) ] [ = default]

The "name" must be a legal C identifier. Whitespace is permitted between the name and the colon (though this is not the preferred style). Whitespace is permitted (and encouraged) between the colon and the converter.

The "converter" is the name of one of the "converter functions" registered with Argument Clinic. Clinic will ship with a number of built-in converters; new converters can also be added dynamically. In choosing a converter, you are automatically constraining what Python types are permitted on the input, and specifying what type the output variable (or variables) will be. Although many of the converters will resemble the names of C types or perhaps Python types, the name of a converter may be any legal Python identifier.

If the converter is followed by parentheses, these parentheses enclose parameter to the conversion function. The syntax mirrors providing arguments a Python function call: the parameter must always be named, as if they were "keyword-only parameters", and the values provided for the parameters will syntactically resemble Python literal values. These parameters are always optional, permitting all conversion functions to be called without any parameters. In this case, you may also omit the parentheses entirely; this is always equivalent to specifying empty parentheses. The values supplied for these parameters must be compatible with ast.literal_eval.

The "default" is a Python literal value. Default values are optional; if not specified you must omit the equals sign too. Parameters which don't have a default are implicitly required. The default value is dynamically assigned, "live" in the generated C code, and although it's specified as a Python value, it's translated into a native C value in the generated C code. Few default values are permitted, owing to this manual translation step.

If this were a Python function declaration, a parameter declaration would be delimited by either a trailing comma or an ending parentheses. However, Argument Clinic uses neither; parameter declarations are delimited by a newline. A trailing comma or right parenthesis is not permitted.

The first parameter declaration establishes the indent for all parameter declarations in a particular Clinic code block. All subsequent parameters must be indented to the same level.

Legacy Converters

For convenience's sake in converting existing code to Argument Clinic, Clinic provides a set of legacy converters that match PyArg_ParseTuple format units. They are specified as a C string containing the format unit. For example, to specify a parameter "foo" as taking a Python "int" and emitting a C int, you could specify:

foo : "i"

(To more closely resemble a C string, these must always use double quotes.)

Although these resemble PyArg_ParseTuple format units, no guarantee is made that the implementation will call a PyArg_Parse function for parsing.

This syntax does not support parameters. Therefore it doesn't support any of the format units that require input parameters ("O!", "O&", "es", "es#", "et", "et#"). Parameters requiring one of these conversions cannot use the legacy syntax. (You may still, however, supply a default value.)

Parameter Docstrings

All lines that appear below and are indented further than a parameter declaration are the docstring for that parameter. All such lines are "dedented" until the first line is flush left.

Special Syntax For Parameter Lines

There are four special symbols that may be used in the parameter section. Each of these must appear on a line by itself, indented to the same level as parameter declarations. The four symbols are:

*
Establishes that all subsequent parameters are keyword-only.
[
Establishes the start of an optional "group" of parameters. Note that "groups" may nest inside other "groups". See Functions With Positional-Only Parameters below. Note that currently [ is only legal for use in functions where all parameters are marked positional-only, see / below.
]
Ends an optional "group" of parameters.
/
Establishes that all the proceeding arguments are positional-only. For now, Argument Clinic does not support functions with both positional-only and non-positional-only arguments. Therefore: if / is specified for a function, it must currently always be after the last parameter. Also, Argument Clinic does not currently support default values for positional-only parameters.

(The semantics of / follow a syntax for positional-only parameters in Python once proposed by Guido. [5] )

Function Docstring

The first line with no leading whitespace after the function declaration is the first line of the function docstring. All subsequent lines of the Clinic block are considered part of the docstring, and their leading whitespace is preserved.

If the string {parameters} appears on a line by itself inside the function docstring, Argument Clinic will insert a list of all parameters that have docstrings, each such parameter followed by its docstring. The name of the parameter is on a line by itself; the docstring starts on a subsequent line, and all lines of the docstring are indented by two spaces. (Parameters with no per-parameter docstring are suppressed.) The entire list is indented by the leading whitespace that appeared before the {parameters} token.

If the string {parameters} doesn't appear in the docstring, Argument Clinic will append one to the end of the docstring, inserting a blank line above it if the docstring does not end with a blank line, and with the parameter list at column 0.

Converters

Argument Clinic contains a pre-initialized registry of converter functions. Example converter functions:

int
Accepts a Python object implementing __int__; emits a C int.
byte
Accepts a Python int; emits an unsigned char. The integer must be in the range [0, 256).
str
Accepts a Python str object; emits a C char *. Automatically encodes the string using the ascii codec.
PyObject
Accepts any object; emits a C PyObject * without any conversion.

All converters accept the following parameters:

doc_default
The Python value to use in place of the parameter's actual default in Python contexts. In other words: when specified, this value will be used for the parameter's default in the docstring, and in the Signature. (TBD alternative semantics: If the string is a valid Python expression which can be rendered into a Python value using eval(), then the result of eval() on it will be used as the default in the Signature.) Ignored if there is no default.
required
Normally any parameter that has a default value is automatically optional. A parameter that has "required" set will be considered required (non-optional) even if it has a default value. The generated documentation will also not show any default value.

Additionally, converters may accept one or more of these optional parameters, on an individual basis:

annotation
Explicitly specifies the per-parameter annotation for this parameter. Normally it's the responsibility of the conversion function to generate the annotation (if any).
bitwise
For converters that accept unsigned integers. If the Python integer passed in is signed, copy the bits directly even if it is negative.
encoding
For converters that accept str. Encoding to use when encoding a Unicode string to a char *.
immutable
Only accept immutable values.
length
For converters that accept iterable types. Requests that the converter also emit the length of the iterable, passed in to the _impl function in a Py_ssize_t variable; its name will be this parameter's name appended with "_length".
nullable
This converter normally does not accept None, but in this case it should. If None is supplied on the Python side, the equivalent C argument will be NULL. (The _impl argument emitted by this converter will presumably be a pointer type.)
types

A list of strings representing acceptable Python types for this object. There are also four strings which represent Python protocols:

  • "buffer"
  • "mapping"
  • "number"
  • "sequence"
zeroes
For converters that accept string types. The converted value should be allowed to have embedded zeroes.

Return Converters

A return converter conceptually performs the inverse operation of a converter: it converts a native C value into its equivalent Python value.

Directives

Argument Clinic also permits "directives" in Clinic code blocks. Directives are similar to pragmas in C; they are statements that modify Argument Clinic's behavior.

The format of a directive is as follows:

directive_name [argument [second_argument [ ... ]]]

Directives only take positional arguments.

A Clinic code block must contain either one or more directives, or a function declaration. It may contain both, in which case all directives must come before the function declaration.

Internally directives map directly to Python callables. The directive's arguments are passed directly to the callable as positional arguments of type str().

Example possible directives include the production, suppression, or redirection of Clinic output. Also, the "module" and "class" keywords are implemented as directives in the prototype.

Python Code

Argument Clinic also permits embedding Python code inside C files, which is executed in-place when Argument Clinic processes the file. Embedded code looks like this:

/*[python]

# this is python code!
print("/" + "* Hello world! *" + "/")

[python]*/
/* Hello world! */
/*[python end:da39a3ee5e6b4b0d3255bfef95601890afd80709]*/

The "/* Hello world! */" line above was generated by running the Python code in the preceding comment.

Any Python code is valid. Python code sections in Argument Clinic can also be used to directly interact with Clinic; see Argument Clinic Programmatic Interfaces.

Output

Argument Clinic writes its output inline in the C file, immediately after the section of Clinic code. For "python" sections, the output is everything printed using builtins.print. For "clinic" sections, the output is valid C code, including:

  • a #define providing the correct methoddef structure for the function
  • a prototype for the "impl" function -- this is what you'll write to implement this function
  • a function that handles all argument processing, which calls your "impl" function
  • the definition line of the "impl" function
  • and a comment indicating the end of output.

The intention is that you write the body of your impl function immediately after the output -- as in, you write a left-curly-brace immediately after the end-of-output comment and implement builtin in the body there. (It's a bit strange at first, but oddly convenient.)

Argument Clinic will define the parameters of the impl function for you. The function will take the "self" parameter passed in originally, all the parameters you define, and possibly some extra generated parameters ("length" parameters; also "group" parameters, see next section).

Argument Clinic also writes a checksum for the output section. This is a valuable safety feature: if you modify the output by hand, Clinic will notice that the checksum doesn't match, and will refuse to overwrite the file. (You can force Clinic to overwrite with the "-f" command-line argument; Clinic will also ignore the checksums when using the "-o" command-line argument.)

Finally, Argument Clinic can also emit the boilerplate definition of the PyMethodDef array for the defined classes and modules.

Functions With Positional-Only Parameters

A significant fraction of Python builtins implemented in C use the older positional-only API for processing arguments (PyArg_ParseTuple()). In some instances, these builtins parse their arguments differently based on how many arguments were passed in. This can provide some bewildering flexibility: there may be groups of optional parameters, which must either all be specified or none specified. And occasionally these groups are on the left! (A representative example: curses.window.addch().)

Argument Clinic supports these legacy use-cases by allowing you to specify parameters in groups. Each optional group of parameters is marked with square brackets. Note that these groups are permitted on the right or left of any required parameters!

The impl function generated by Clinic will add an extra parameter for every group, "int group_{left|right}_<x>", where x is a monotonically increasing number assigned to each group as it builds away from the required arguments. This argument will be nonzero if the group was specified on this call, and zero if it was not.

Note that when operating in this mode, you cannot specify default arguments.

Also, note that it's possible to specify a set of groups to a function such that there are several valid mappings from the number of arguments to a valid set of groups. If this happens, Clinic will abort with an error message. This should not be a problem, as positional-only operation is only intended for legacy use cases, and all the legacy functions using this quirky behavior have unambiguous mappings.

Current Status

As of this writing, there is a working prototype implementation of Argument Clinic available online (though the syntax may be out of date as you read this). [6] The prototype generates code using the existing PyArg_Parse APIs. It supports translating to all current format units except the mysterious "w*". Sample functions using Argument Clinic exercise all major features, including positional-only argument parsing.

Argument Clinic Programmatic Interfaces

The prototype also currently provides an experimental extension mechanism, allowing adding support for new types on-the-fly. See Modules/posixmodule.c in the prototype for an example of its use.

In the future, Argument Clinic is expected to be automatable enough to allow querying, modification, or outright new construction of function declarations through Python code. It may even permit dynamically adding your own custom DSL!

Notes / TBD

  • The API for supplying inspect.Signature metadata for builtins is currently under discussion. Argument Clinic will add support for the prototype when it becomes viable.

  • Nick Coghlan suggests that we a) only support at most one left-optional group per function, and b) in the face of ambiguity, prefer the left group over the right group. This would solve all our existing use cases including range().

  • Optimally we'd want Argument Clinic run automatically as part of the normal Python build process. But this presents a bootstrapping problem; if you don't have a system Python 3, you need a Python 3 executable to build Python 3. I'm sure this is a solvable problem, but I don't know what the best solution might be. (Supporting this will also require a parallel solution for Windows.)

  • On a related note: inspect.Signature has no way of representing blocks of arguments, like the left-optional block of y and x for curses.window.addch. How far are we going to go in supporting this admittedly aberrant parameter paradigm?

  • During the PyCon US 2013 Language Summit, there was discussion of having Argument Clinic also generate the actual documentation (in ReST, processed by Sphinx) for the function. The logistics of this are TBD, but it would require that the docstrings be written in ReST, and require that Python ship a ReST -> ascii converter. It would be best to come to a decision about this before we begin any large-scale conversion of the CPython source tree to using Clinic.

  • Guido proposed having the "function docstring" be hand-written inline, in the middle of the output, something like this:

    /*[clinic]
      ... prototype and parameters (including parameter docstrings) go here
    [clinic]*/
    ... some output ...
    /*[clinic docstring start]*/
    ... hand-edited function docstring goes here   <-- you edit this by hand!
    /*[clinic docstring end]*/
    ... more output
    /*[clinic output end]*/
    

    I tried it this way and don't like it -- I think it's clumsy. I prefer that everything you write goes in one place, rather than having an island of hand-edited stuff in the middle of the DSL output.

  • Argument Clinic does not support automatic tuple unpacking (the "(OOO)" style format string for PyArg_ParseTuple().)

  • Argument Clinic removes some dynamism / flexibility. With PyArg_ParseTuple() one could theoretically pass in different encodings at runtime for the "es"/"et" format units. AFAICT CPython doesn't do this itself, however it's possible external users might do this. (Trivia: there are no uses of "es" exercised by regrtest, and all the uses of "et" exercised are in socketmodule.c, except for one in _ssl.c. They're all static, specifying the encoding "idna".)

Acknowledgements

The PEP author wishes to thank Ned Batchelder for permission to shamelessly rip off his clever design for Cog--"my favorite tool that I've never gotten to use". Thanks also to everyone who provided feedback on the [bugtracker issue] and on python-dev. Special thanks to Nick Coglan and Guido van Rossum for a rousing two-hour in-person deep dive on the topic at PyCon US 2013.

pep-0437 A DSL for specifying signatures, annotations and argument converters

PEP:0437
Title:A DSL for specifying signatures, annotations and argument converters
Version:$Revision$
Last-Modified:$Date$
Author:Stefan Krah <skrah at bytereef.org>
Status:Rejected
Type:Standards Track
Content-Type:text/x-rst
Created:2013-03-11
Python-Version:3.4
Post-History:
Resolution:http://mail.python.org/pipermail/python-dev/2013-May/126117.html

Abstract

The Python C-API currently has no mechanism for specifying and auto-generating function signatures, annotations or custom argument converters.

There are several possible approaches to the problem. Cython uses cdef definitions in .pyx files to generate the required information. However, CPython's C-API functions often require additional initialization and cleanup snippets that would be hard to specify in a cdef.

PEP 436 proposes a domain specific language (DSL) enclosed in C comments that largely resembles a per-parameter configuration file. A preprocessor reads the comment and emits an argument parsing function, docstrings and a header for the function that utilizes the results of the parsing step.

The latter function is subsequently referred to as the implementation function.

Rejection Notice

This PEP was rejected by Guido van Rossum at PyCon US 2013. However, several of the specific issues raised by this PEP were taken into account when designing the second iteration of the PEP 436 DSL [3].

Rationale

Opinions differ regarding the suitability of the PEP 436 DSL in the context of a C file. This PEP proposes an alternative DSL. The specific issues with PEP 436 that spurred the counter proposal will be explained in the final section of this PEP.

Scope

The PEP focuses exclusively on the DSL. Topics like the output locations of docstrings or the generated code are outside the scope of this PEP.

It is however vital that the DSL is suitable for generating custom argument parsers, a feature that is already implemented in Cython. Therefore, one of the goals of this PEP is to keep the DSL close to existing solutions, thus facilitating a possible inclusion of the relevant parts of Cython into the CPython source tree.

DSL overview

Type safety and annotations

A conversion from a Python to a C value is fully defined by the type of the converter function. The PyArg_Parse* family of functions accepts custom converters in addition to the well-known default converters "i", "f", etc.

This PEP views the default converters as abstract functions, regardless of how they are actually implemented.

Include/converters.h

Converter functions must be forward-declared. All converter functions shall be entered into the file Include/converters.h. The file is read by the preprocessor prior to translating .c files. This is an excerpt:

/*[converter]
##### Default converters #####
"s":  str                                -> const char *res;
"s*": [str, bytes, bytearray, rw_buffer] -> Py_buffer &res;
[...]
"es#": str -> (const char *res_encoding, char **res, Py_ssize_t *res_length);
[...]
##### Custom converters #####
path_converter:           [str, bytes, int]  -> path_t &res;
OS_STAT_DIR_FD_CONVERTER: [int, None]        -> int res;
[converter_end]*/

Converters are specified by their name, Python input type(s) and C output type(s). Default converters must have quoted names, custom converters must have regular names. A Python type is given by its name. If a function accepts multiple Python types, the set is written in list form.

Since the default converters may have multiple implicit return values, the C output type(s) are written according to the following convention:

The main return value must be named res. This is a placeholder for the actual variable name given later in the DSL. Additional implicit return values must be prefixed by res_.

By default the variables are passed by value to the implementation function. If the address should be passed instead, res must be prefixed with an ampersand.

Additional declarations may be placed into .c files. Duplicate declarations are allowed as long as the function types are identical.

It is encouraged to declare custom converter types a second time right above the converter function definition. The preprocessor will then catch any mismatch between the declarations.

In order to keep the converter complexity manageable, PY_SSIZE_T_CLEAN will be deprecated and Py_ssize_t will be assumed for all length arguments.

TBD: Make a list of fantasy types like rw_buffer.

Function specifications

Keyword arguments

This example contains the definition of os.stat. The individual sections will be explained in detail. Grammatically, the whole define block consists of a function specification and an output section. The function specification in turn consists of a declaration section, an optional C-declaration section and an optional cleanup code section. Sections within the function specification are separated in yacc style by '%%':

/*[define posix_stat]
def os.stat(path: path_converter, *, dir_fd: OS_STAT_DIR_FD_CONVERTER = None,
            follow_symlinks: "p" = True) -> os.stat_result: pass
%%
path_t path = PATH_T_INITIALIZE("stat", 0, 1);
int dir_fd = DEFAULT_DIR_FD;
int follow_symlinks = 1;
%%
path_cleanup(&path);
[define_end]*/

<literal C output>

/*[define_output_end]*/

Define block

The function specification block starts with a /*[define token, followed by an optional C function name, followed by a right bracket. If the C function name is not given, it is generated from the declaration name. In the example, omitting the name posix_stat would result in a C function name of os_stat.

Declaration

The required declaration is (almost) a valid Python function definition. The 'def' keyword and the function body are redundant, but the author of this PEP finds the definition more readable if they are present.

The function name may be a path instead of a plain identifier. Each argument is annotated with the name of the converter function that will be applied to it.

Default values are given in the usual Python manner and may be any valid Python expression.

The return value may be any Python expression. Usually it will be the name of an object, but alternative return values could be specified in list form.

C-declarations

This optional section contains C variable declarations. Since the converter functions have been declared beforehand, the preprocessor can type-check the declarations.

Cleanup

The optional cleanup section contains literal C code that will be inserted unmodified after the implementation function.

Output

The output section contains the code emitted by the preprocessor.

Positional-only arguments

Functions that do not take keyword arguments are indicated by the presence of the slash special parameter:

/*[define stat_float_times]
def os.stat_float_times(/, newval: "i") -> os.stat_result: pass
%%
int newval = -1;
[define_end]*/

The preprocessor translates this definition to a PyArg_ParseTuple() call. All arguments to the right of the slash are optional arguments.

Left and right optional arguments

Some legacy functions contain optional arguments groups both to the left and right of a central parameter. It is debatable whether a new tool should support such functions. For completeness' sake, this is the proposed syntax:

/*[define]
def curses.window.addch(y: "i", x: "i", ch: "O", attr: "l") -> None: pass
where groups = [[ch], [ch, attr], [y, x, ch], [y, x, ch, attr]]
[define_end]*/

Here ch is the central parameter, attr can optionally be added on the right, and the group [y, x] can optionally be added on the left.

Essentially the rule is that all ordered combinations of the central parameter and the optional groups must be possible such that no two combinations have the same length.

This is concisely expressed by putting the central parameter first in the list and subsequently adding the optional arguments groups to the left and right.

Flexibility in formatting

If the above os.stat example is considered too compact, it can easily be formatted this way:

/*[define posix_stat]
def os.stat(path: path_converter,
            *,
            dir_fd: OS_STAT_DIR_FD_CONVERTER = None,
            follow_symlinks: "p" = True)
-> os.stat_result: pass
%%
path_t path = PATH_T_INITIALIZE("stat", 0, 1);
int dir_fd = DEFAULT_DIR_FD;
int follow_symlinks = 1;
%%
path_cleanup(&path);
[define_end]*/

<literal C output>

/*[define_output_end]*/

Benefits of a compact notation

The advantages of a concise notation are especially obvious when a large number of parameters is involved. The argument parsing part of _posixsubprocess.fork_exec is fully specified by this definition:

/*[define subprocess_fork_exec]
def _posixsubprocess.fork_exec(
    process_args: "O", executable_list: "O",
    close_fds: "p", py_fds_to_keep: "O",
    cwd_obj: "O", env_list: "O",
    p2cread: "i", p2cwrite: "i", c2pread: "i", c2pwrite: "i",
    errread: "i", errwrite: "i", errpipe_read: "i", errpipe_write: "i",
    restore_signals: "i", call_setsid: "i", preexec_fn: "i", /) -> int: pass
[define_end]*/

Note that the preprocess tool currently emits a redundant C-declaration section for this example, so the output is longer than necessary.

Easy validation of the definition

How can an inexperienced user validate a definition like os.stat? Simply by changing os.stat to os_stat, defining missing converters and pasting the definition into the Python interactive interpreter!

In fact, a converters.py module could be auto-generated from converters.h.

Reference implementation

A reference implementation is available at issue 16612 [1]. Since this PEP was written under time constraints and the author is unfamiliar with the PLY toolchain, the software is written in Standard ML and utilizes the ml-yacc/ml-lex toolchain.

The grammar is conflict-free and available in ml-yacc readable BNF form.

Two tools are available:

  • printsemant reads a converter header and a .c file and dumps the semantically checked parse tree to stdout.
  • preprocess reads a converter header and a .c file and dumps the preprocessed .c file to stdout.

Known deficiencies:

  • The Python 'test' expression is not semantically checked. The syntax however is checked since it is part of the grammar.
  • The lexer does not handle triple quoted strings.
  • C declarations are parsed in a primitive way. The final implementation should utilize 'declarator' and 'init-declarator' from the C grammar.
  • The preprocess tool does not emit code for the left-and-right optional arguments case. The printsemant tool can deal with this case.
  • Since the preprocess tool generates the output from the parse tree, the original indentation of the define block is lost.

Grammar

TBD: The grammar exists in ml-yacc readable form, but should probably be included here in EBNF notation.

Comparison with PEP 436

The author of this PEP has the following concerns about the DSL proposed in PEP 436:

  • The whitespace sensitive configuration file like syntax looks out of place in a C file.

  • The structure of the function definition gets lost in the per-parameter specifications. Keywords like positional-only, required and keyword-only are scattered across too many different places.

    By contrast, in the alternative DSL the structure of the function definition can be understood at a single glance.

  • The PEP 436 DSL has 14 documented flags and at least one undocumented (allow_fd) flag. Figuring out which of the 2**15 possible combinations are valid places an unnecessary burden on the user.

    Experience with the PEP-3118 buffer flags has shown that sorting out (and exhaustively testing!) valid combinations is an extremely tedious task. The PEP-3118 flags are still not well understood by many people.

    By contrast, the alternative DSL has a central file Include/converters.h that can be quickly searched for the desired converter. Many of the converters are already known, perhaps even memorized by people (due to frequent use).

  • The PEP 436 DSL allows too much freedom. Types can apparently be omitted, the preprocessor accepts (and ignores) unknown keywords, sometimes adding white space after a docstring results in an assertion error.

    The alternative DSL on the other hand allows no such freedoms. Omitting converter or return value annotations is plainly a syntax error. The LALR(1) grammar is unambiguous and specified for the complete translation unit.

pep-0438 Transitioning to release-file hosting on PyPI

PEP:438
Title:Transitioning to release-file hosting on PyPI
Version:$Revision$
Last-Modified:$Date$
Author:Holger Krekel <holger at merlinux.eu>, Carl Meyer <carl at oddbird.net>
BDFL-Delegate:Richard Jones <richard@python.org>
Discussions-To:distutils-sig at python.org
Status:Accepted
Type:Process
Content-Type:text/x-rst
Created:15-Mar-2013
Post-History:19-May-2013
Resolution:http://mail.python.org/pipermail/distutils-sig/2013-May/020773.html

Abstract

This PEP proposes a backward-compatible two-phase transition process to speed up, simplify and robustify installing from the pypi.python.org (PyPI) package index. To ease the transition and minimize client-side friction, no changes to distutils or existing installation tools are required in order to benefit from the first transition phase, which will result in faster, more reliable installs for most existing packages.

The first transition phase implements easy and explicit means for a package maintainer to control which release file links are served to present-day installation tools. The first phase also includes the implementation of analysis tools for present-day packages, to support communication with package maintainers and the automated setting of default modes for controlling release file links. The first phase also will default newly-registered projects on PyPI to only serve links to release files which were uploaded to PyPI.

The second transition phase concerns end-user installation tools, which shall default to only install release files that are hosted on PyPI and tell the user if external release files exist, offering a choice to automatically use those external files. External release files shall in the future be registered together with a checksum hash so that installation tools can verify the integrity of the eventual download (PyPI-hosted release files always carry such a checksum).

Alternative PyPI server implementations should implement the new simple index serving behaviour of transition phase 1 to avoid installation tools treating their release links as external ones in phase 2.

Rationale

History and motivations for external hosting

When PyPI went online, it offered release registration but had no facility to host release files itself. When hosting was added, no automated downloading tool existed yet. When Phillip Eby implemented automated downloading (through setuptools), he made the choice to allow people to use download hosts of their choice. The finding of externally-hosted packages was implemented as follows:

  1. The PyPI simple/ index for a package contains all links found by scraping them from that package's long_description metadata for any release. Links in the "Download-URL" and "Home-page" metadata fields are given rel=download and rel=homepage attributes, respectively.
  2. Any of these links whose target is a file whose name appears to be in the form of an installable source or binary distribution, with name in the form "packagename-version.ARCHIVEEXT", is considered a potential installation candidate by installation tools.
  3. Similarly, any links suffixed with an "#egg=packagename-version" fragment are considered an installation candidate.
  4. Additionally, the rel=homepage and rel=download links are crawled by installation tools and, if HTML, are themselves scraped for release-file links in the above formats.

See the easy_install documentation for a complete description of this behavior. [1]

Today, most packages indexed on PyPI host their release files on PyPI. Out of 29,117 total projects on PyPI, only 2,581 (less than 10%) include any links to installable files that are available only off-PyPI. [2]

There are many reasons [3] why people have chosen external hosting. To cite just a few:

  • release processes and scripts have been developed already and upload to external sites
  • it takes too long to upload large files from some places in the world
  • export restrictions e.g. for crypto-related software
  • company policies which require offering open source packages through own sites
  • problems with integrating uploading to PyPI into one's release process (because of release policies)
  • desiring download statistics different from those maintained by PyPI
  • perceived bad reliability of PyPI
  • not aware that PyPI offers file-hosting

Irrespective of the present-day validity of these reasons, there clearly is a history why people choose to host files externally and it even was for some time the only way you could do things. This PEP takes the position that there remain some valid reasons for external hosting even today.

Problem

Today, python package installers (pip, easy_install, buildout, and others) often need to query many non-PyPI URLs even if there are no externally hosted files. Apart from querying pypi.python.org's simple index pages, also all homepages and download pages ever specified with any release of a package are crawled by an installer. The need for installers to crawl external sites slows down installation and makes for a brittle and unreliable installation process. Those sites and packages also don't take part in the PEP 381 mirroring infrastructure, further decreasing reliability and speed of automated installation processes around the world.

Most packages are hosted directly on pypi.python.org [2]. Even for these packages, installers still crawl their homepage and download-url, if specified. Many package uploaders are not aware that specifying the "homepage" or "download-url" in their package metadata will needlessly slow down the installation process for all users.

Relying on third party sites also opens up more attack vectors for injecting malicious packages into sites using automated installs. A simple attack might just involve getting hold of an old now-unused homepage domain and placing malicious packages there. Moreover, performing a Man-in-The-Middle (MITM) attack between an installation site and any of the download sites can inject malicious packages on the installation site. As many homepages and download locations are using HTTP and not HTTPS, such attacks are not hard to launch. Such MITM attacks can easily happen even for packages which never intended to host files externally as their homepages are contacted by installers anyway.

There is currently no way for package maintainers to avoid external-link crawling, other than removing all homepage/download url metadata for all historic releases. While a script [4] has been written to perform this action, it is not a good general solution because it removes useful metadata from PyPI releases.

Even if the sites referenced by "Homepage" and "Download-URL" links were not scraped for further links, there is no obvious way under the current system for a package owner to link to an installable file from a long_description metadata field (which is shown as package documentation on /pypi/PKG) without installation tools automatically considering that file a candidate for installation. Conversely, there is no way to explicitly register multiple external release files without putting them in metadata fields.

Goals

These are the goals to be achieved by implementation of this PEP:

  • Package owners should be able to explicitly control which files are presented by PyPI to installer tools as installation candidates. Installation should not be slowed and made less reliable by extensive and unnecessary crawling of links that package owners did not explicitly nominate as installation files.
  • It should remain possible for package owners to choose to host their release files on their own hosting, external to PyPI. It should be easy for a user to request the installation of such releases using automated installer tools, especially if the external release files were registered together with a checksum hash.
  • Automated installer tools should not install externally-hosted packages by default, but require explicit authorization to do so by the user. When tools refuse to install such a package by default, they should tell the user exactly which external link(s) the installer needs to follow, and what option(s) the user can provide to authorize the tool to follow those links. PyPI should provide all necessary metadata for installer tools to implement this easily and within a single request/reply interaction.
  • Migration from the status quo to the above points should be gradual and minimize breakage. This includes tooling that makes it easy for package owners with an existing release process that uploads to non-PyPI hosting to also upload those release files to PyPI.

Solution / two transition phases

The first transition phase introduces a "hosting-mode" field for each project on PyPI, allowing package owners explicit control of which release file links are served to present-day installation tools in the machine-readable simple/ index. The first transition will, after successful hosting-mode manipulations by individual early-adopters, set a default hosting mode for existing packages, based on automated analysis. Maintainers will be notified one month ahead of any such automated change. At completion of the first transition phase, all present-day existing release and installation processes and tools are expected to continue working. Any remaining errors or problems are expected to only relate to installation of individual packages and can be easily corrected by package maintainers or PyPI admins if maintainers are not reachable.

Also in the first phase, each link served in the simple/ index will be explicitly marked as rel="internal" if it is hosted by the index itself (even if on a separate domain, which may be the case if the index uses a CDN for file-serving). Any link not so marked will be considered an external link.

In the second transition phase, PyPI client installation tools shall be updated to default to only install rel="internal" packages unless a user specifies option(s) to permit installing from external links. See second transition phase for details on how installers should behave.

Maintainers of packages which currently host release files on non-PyPI sites shall receive instructions and tools to ease "re-hosting" of their historic and future package release files. This re-hosting tool MUST be available before automated hosting-mode changes are announced to package maintainers.

Implementation

Hosting modes

The foundation of the first transition phase is the introduction of three "modes" of PyPI hosting for a package, affecting which links are generated for the simple/ index. These modes are implemented without requiring changes to installation tools via changes to the algorithm for generating the machine-readable simple/ index.

The modes are:

  • pypi-scrape-crawl: no change from the current situation of generating machine-readable links for installation tools, as outlined in the history.
  • pypi-scrape: for a package in this mode, links to be added to the simple/ index are still scraped from package metadata. However, the "Home-page" and "Download-url" links are given rel=ext-homepage and rel=ext-download attributes instead of rel=homepage and rel=download. The effect of this (with no change in installation tools necessary) is that these links will not be followed and scraped for further candidate links by present-day installation tools: only installable files directly hosted from PyPI or linked directly from PyPI metadata will be considered for installation. Installation tools MAY evolve to offer an option to use the new rel-attribution to crawl external pages but MUST NOT default to it.
  • pypi-explicit: for a package in this mode, only links to release files uploaded to PyPI, and external links to release files explicitly nominated by the package owner, will be added to the simple/ index. PyPI will provide a new interface for package owners to supply external release-file URLs. These URLs MUST include a URL fragment in the form "#hashtype=hashvalue" specifying a hash of the externally-linked file which installer tools MUST use to validate that they have downloaded the intended file.

Thus the hope is that eventually all projects on PyPI can be migrated to the pypi-explicit mode, while preserving the ability to install release files hosted externally via installer tools. Deprecation of hosting modes to eventually only allow the pypi-explicit mode is NOT REGULATED by this PEP but is expected to become feasible some time after successful implementation of the transition phases described in this PEP. It is expected that deprecation requires a new process to deal with abandoned packages because of unreachable maintainers for still popular packages.

First transition phase (PyPI)

The proposed solution consists of multiple implementation and communication steps:

  1. Implement in PyPI the three modes described above, with an interface for package owners to select the mode for each package and register explicit external file URLs.
  2. For packages in all modes, label links in the simple/ index to index-hosted files with rel="internal", to make it easier for client tools to distinguish these links in the second phase.
  3. Add an HTML tag <meta name="api-version" value="2"> to all simple/ index pages, to allow clients to distinguish between indexes providing the rel="internal" metadata and older ones that do not.
  4. Default all newly-registered packages to pypi-explicit mode (package owners can still switch to the other modes as desired).
  5. Determine (via automated analysis [2]) which packages have all installable files available on PyPI itself (group A), which have all installable files on PyPI or linked directly from PyPI metadata (group B), and which have installable versions available that are linked only from external homepage/download HTML pages (group C).
  6. Send mail to maintainers of projects in group A that their project will be automatically configured to pypi-explicit mode in one month, and similarly to maintainers of projects in group B that their project will be automatically configured to pypi-scrape mode. Inform them that this change is not expected to affect installability of their project at all, but will result in faster and safer installs for their users. Encourage them to set this mode themselves sooner to benefit their users.
  7. Send mail to maintainers of packages in group C that their package hosting mode is pypi-scrape-crawl, list the URLs which currently are crawled, and suggest that they either re-host their packages directly on PyPI and switch to pypi-explicit, or at least provide direct links to release files in PyPI metadata and switch to pypi-scrape. Provide instructions and tools to help with these transitions.

Second transition phase (installer tools)

For the second transition phase, maintainers of installation tools are asked to release two updates.

The first update shall provide clear warnings if externally-hosted release files (that is, files whose link does not include rel="internal") are selected for download, for which projects and URLs exactly this happens, and warn that in future versions externally-hosted downloads will be disabled by default.

The second update should change the default mode to allow only installation of rel="internal" package files, and allow installation of externally-hosted packages only when the user supplies an option.

The installer should distinguish between verifiable and non-verifiable external links. A verifiable external link is a direct link to an installable file from the PyPI simple/ index that includes a hash in the URL fragment ("#hashtype=hashvalue") which can be used to verify the integrity of the downloaded file. A non-verifiable external link is any link (other than those explicitly supplied by the user of an installer tool) without a hash, scraped from external HTML, or injected into the search via some other non-PyPI source (e.g. setuptools' dependency_links feature).

Installers should provide a blanket option to allow installing any verifiable external link. Non-verifiable external links should only be installed if the user-provided option specifies exactly which external domains can be used or for which specific package names external links can be used.

When download of an externally-hosted package is disallowed by the default configuration, the user should be notified, with instructions for how to make the install succeed and warnings about the implication (that a file will be downloaded from a site that is not part of the package index). The warning given for non-verifiable links should clearly state that the installer cannot verify the integrity of the downloaded file. The warning given for verifiable external links should simply note that the file will be downloaded from an external URL, but that the file integrity can be verified by checksum.

Alternative PyPI-compatible index implementations should upgrade to begin providing the rel="internal" metadata and the <meta name="api-version" value="2"> tag as soon as possible. For alternative indexes which do not yet provide the meta tag in their simple/ pages, installation tools should provide backwards-compatible fallback behavior (treat links as internal as in pre-PEP times and provide a warning).

API For Submitting External Distribution URLs

New distribution URLs may be submitted by performing a HTTP POST to the URL:

https://pypi.python.org/pypi

With the following form-encoded data:

Name Value
:action The string "urls"
name The package name as a string
version The release version as a string
new-url The new URL to store
submit_new_url The string "yes"

The POST must be accompanied by an HTTP Basic Auth header encoding the username and password of the user authorized to maintain the package on PyPI.

The HTTP response to this request will be one of:

Code Meaning URL submission implications
200 OK Everything worked just fine
400 Bad request Data provided for submission was malformed
401 Unauthorised The username or password supplied were incorrect
403 Forbidden User does not have permission to update the package information (not Owner or Maintainer)

References

[1]Phillip Eby, easy_install 'Package Index "API"' documentation, http://peak.telecommunity.com/DevCenter/EasyInstall#package-index-api
[2](1, 2, 3) Donald Stufft, automated analysis of PyPI project links, https://github.com/dstufft/pypi.linkcheck
[3]Marc-Andre Lemburg, reasons for external hosting, http://mail.python.org/pipermail/catalog-sig/2013-March/005626.html
[4]Holger Krekel, script to remove homepage/download metadata for all releases http://mail.python.org/pipermail/catalog-sig/2013-February/005423.html

Acknowledgments

Phillip Eby for precise information and the basic ideas to implement the transition via server-side changes only.

Donald Stufft for pushing away from external hosting and offering to implement both a Pull Request for the necessary PyPI changes and the analysis tool to drive the transition phase 1.

Marc-Andre Lemburg, Nick Coghlan and catalog-sig in general for thinking through issues regarding getting rid of "external hosting".

pep-0439 Inclusion of implicit pip bootstrap in Python installation

PEP:439
Title:Inclusion of implicit pip bootstrap in Python installation
Version:$Revision$
Last-Modified:$Date$
Author:Richard Jones <richard at python.org>
BDFL-Delegate:Nick Coghlan <ncoghlan@gmail.com>
Discussions-To:<distutils-sig at python.org>
Status:Rejected
Type:Standards Track
Content-Type:text/x-rst
Created:18-Mar-2013
Python-Version:3.4
Post-History:19-Mar-2013
Resolution:http://mail.python.org/pipermail/distutils-sig/2013-August/022527.html

Abstract

This PEP proposes the inclusion of a pip boostrap executable in the Python installation to simplify the use of 3rd-party modules by Python users.

This PEP does not propose to include the pip implementation in the Python standard library. Nor does it propose to implement any package management or installation mechanisms beyond those provided by PEP 427 ("The Wheel Binary Package Format 1.0") and TODO distlib PEP.

PEP Rejection

This PEP has been rejected in favour of a more explicit mechanism that should achieve the same end result in a more reliable fashion. The more explicit bootstrapping mechanism is described in PEP 453.

Rationale

Currently the user story for installing 3rd-party Python modules is not as simple as it could be. It requires that all 3rd-party modules inform the user of how to install the installer, typically via a link to the installer. That link may be out of date or the steps required to perform the install of the installer may be enough of a roadblock to prevent the user from further progress.

Large Python projects which emphasise a low barrier to entry have shied away from depending on third party packages because of the introduction of this potential stumbling block for new users.

With the inclusion of the package installer command in the standard Python installation the barrier to installing additional software is considerably reduced. It is hoped that this will therefore increase the likelihood that Python projects will reuse third party software.

The Python community also has an issue of complexity around the current bootstrap procedure for pip and setuptools. They all have their own bootstrap download file with slightly different usages and even refer to each other in some cases. Having a single bootstrap which is common amongst them all, with a simple usage, would be far preferable.

It is also hoped that this is reduces the number of proposals to include more and more software in the Python standard library, and therefore that more popular Python software is more easily upgradeable beyond requiring Python installation upgrades.

Proposal

The bootstrap will install the pip implementation, setuptools by downloading their installation files from PyPI.

This proposal affects two components of packaging: the pip bootstrap and, thanks to easier package installation, modifications to publishing packages.

The core of this proposal is that the user experience of using pip should not require the user to install pip.

The pip bootstrap

The Python installation includes an executable called "pip3" (see PEP 394 for naming rationale etc.) that attempts to import pip machinery. If it can then the pip command proceeds as normal. If it cannot it will bootstrap pip by downloading the pip implementation and setuptools wheel files. Hereafter the installation of the "pip implementation" will imply installation of setuptools and virtualenv. Once installed, the pip command proceeds as normal. Once the bootstrap process is complete the "pip3" command is no longer the bootstrap but rather the full pip command.

A boostrap is used in the place of a the full pip code so that we don't have to bundle pip and also pip is upgradeable outside of the regular Python upgrade timeframe and processes.

To avoid issues with sudo we will have the bootstrap default to installing the pip implementation to the per-user site-packages directory defined in PEP 370 and implemented in Python 2.6/3.0. Since we avoid installing to the system Python we also avoid conflicting with any other packaging system (on Linux systems, for example.) If the user is inside a virtual environment [1] then the pip implementation will be installed into that virtual environment.

The bootstrap process will proceed as follows:

  1. The user system has Python (3.4+) installed. In the "scripts" directory of the Python installation there is the bootstrap script called "pip3".
  2. The user will invoke a pip command, typically "pip3 install <package>", for example "pip3 install Django".
  3. The boostrap script will attempt to import the pip implementation. If this succeeds, the pip command is processed normally. Stop.
  4. On failing to import the pip implementation the bootstrap notifies the user that it needs to "install pip". It will ask the user whether it should install pip as a system-wide site-packages or as a user-only package. This choice will also be present as a command-line option to pip so non-interactive use is possible.
  5. The bootstrap will and contact PyPI to obtain the latest download wheel file (see PEP 427.)
  6. Upon downloading the file it is installed using "python setup.py install".
  7. The pip tool may now import the pip implementation and continues to process the requested user command normally.

Users may be running in an environment which cannot access the public Internet and are relying solely on a local package repository. They would use the "-i" (Base URL of Python Package Index) argument to the "pip3 install" command. This simply overrides the default index URL pointing to PyPI.

Some users may have no Internet access suitable for fetching the pip implementation file. These users can manually download and install the setuptools and pip tar files. Adding specific support for this use-case is unnecessary.

The download of the pip implementation install file will be performed securely. The transport from pypi.python.org will be done over HTTPS with the CA certificate check performed. This facility will be present in Python 3.4+ using Operating System certificates (see PEP XXXX).

Beyond those arguments controlling index location and download options, the "pip3" boostrap command may support further standard pip options for verbosity, quietness and logging.

The "pip3" command will support two new command-line options that are used in the boostrapping, and otherwise ignored. They control where the pip implementation is installed:

--bootstrap Install to the user's packages directory. The name of this option is chosen to promote it as the preferred installation option.
--bootstrap-to-system
 Install to the system site-packages directory.

These command-line options will also need to be implemented, but otherwise ignored, in the pip implementation.

Consideration should be given to defaulting pip to install packages to the user's packages directory if pip is installed in that location.

The "--no-install" option to the "pip3" command will not affect the bootstrapping process.

Modifications to publishing packages

An additional new Python package is proposed, "pypublish", which will be a tool for publishing packages to PyPI. It would replace the current "python setup.py register" and "python setup.py upload" distutils commands. Again because of the measured Python release cycle and extensive existing Python installations these commands are difficult to bugfix and extend. Additionally it is desired that the "register" and "upload" commands be able to be performed over HTTPS with certificate validation. Since shipping CA certificate keychains with Python is not really feasible (updating the keychain is quite difficult to manage) it is desirable that those commands, and the accompanying keychain, be made installable and upgradeable outside of Python itself.

The existing distutils mechanisms for package registration and upload would remain, though with a deprecation warning.

Implementation

The changes to pip required by this PEP are being tracked in that project's issue tracker [2]. Most notably, the addition of --bootstrap and --bootstrap- to-system to the pip command-line.

It would be preferable that the pip and setuptools projects distribute a wheel format download.

The required code for this implementation is the "pip3" command described above. The additional pypublish can be developed outside of the scope of this PEP's work.

Finally, it would be desirable that "pip3" be ported to Python 2.6+ to allow the single command to replace existing pip, setuptools and virtualenv (which would be added to the bootstrap) bootstrap scripts. Having that bootstrap included in a future Python 2.7 release would also be highly desirable.

Risks

The key that is used to sign the pip implementation download might be compromised and this PEP currently proposes no mechanism for key revocation.

There is a Perl package installer also named "pip". It is quite rare and not commonly used. The Fedora variant of Linux has historically named Python's "pip" as "python-pip" and Perl's "pip" as "perl-pip". This policy has been altered[3] so that future and upgraded Fedora installations will use the name "pip" for Python's "pip". Existing (non-upgraded) installations will still have the old name for the Python "pip", though the potential for confusion is now much reduced.

References

[1]PEP 405, Python Virtual Environments http://www.python.org/dev/peps/pep-0405/
[2]pip issue tracking work needed for this PEP https://github.com/pypa/pip/issues/863
[3]Fedora's python-pip package does not provide /usr/bin/pip https://bugzilla.redhat.com/show_bug.cgi?id=958377

Acknowledgments

Nick Coghlan for his thoughts on the proposal and dealing with the Red Hat issue.

Jannis Leidel and Carl Meyer for their thoughts. Marcus Smith for feedback.

Marcela Mašláňová for resolving the Fedora issue.

pep-0440 Version Identification and Dependency Specification

PEP:440
Title:Version Identification and Dependency Specification
Version:$Revision$
Last-Modified:$Date$
Author:Nick Coghlan <ncoghlan at gmail.com>, Donald Stufft <donald at stufft.io>
BDFL-Delegate:Nick Coghlan <ncoghlan@gmail.com>
Discussions-To:Distutils SIG <distutils-sig at python.org>
Status:Accepted
Type:Informational
Content-Type:text/x-rst
Created:18 Mar 2013
Post-History:30 Mar 2013, 27 May 2013, 20 Jun 2013, 21 Dec 2013, 28 Jan 2014, 08 Aug 2014 22 Aug 2014
Replaces:386
Resolution:https://mail.python.org/pipermail/distutils-sig/2014-August/024673.html

Contents

Abstract

This PEP describes a scheme for identifying versions of Python software distributions, and declaring dependencies on particular versions.

This document addresses several limitations of the previous attempt at a standardized approach to versioning, as described in PEP 345 and PEP 386.

Definitions

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.

The following terms are to be interpreted as described in PEP 426:

  • "Distributions"
  • "Releases"
  • "Build tools"
  • "Index servers"
  • "Publication tools"
  • "Installation tools"
  • "Automated tools"
  • "Projects"

Version scheme

Distributions are identified by a public version identifier which supports all defined version comparison operations

The version scheme is used both to describe the distribution version provided by a particular distribution archive, as well as to place constraints on the version of dependencies needed in order to build or run the software.

Public version identifiers

The canonical public version identifiers MUST comply with the following scheme:

[N!]N(.N)*[{a|b|rc}N][.postN][.devN]

Public version identifiers MUST NOT include leading or trailing whitespace.

Public version identifiers MUST be unique within a given distribution.

Installation tools SHOULD ignore any public versions which do not comply with this scheme but MUST also include the normalizations specified below. Installation tools MAY warn the user when non-compliant or ambiguous versions are detected.

Public version identifiers are separated into up to five segments:

  • Epoch segment: N!
  • Release segment: N(.N)*
  • Pre-release segment: {a|b|rc}N
  • Post-release segment: .postN
  • Development release segment: .devN

Any given release will be a "final release", "pre-release", "post-release" or "developmental release" as defined in the following sections.

All numeric components MUST be non-negative integers.

All numeric components MUST be interpreted and ordered according to their numeric value, not as text strings.

All numeric components MAY be zero. Except as described below for the release segment, a numeric component of zero has no special significance aside from always being the lowest possible value in the version ordering.

Note

Some hard to read version identifiers are permitted by this scheme in order to better accommodate the wide range of versioning practices across existing public and private Python projects.

Accordingly, some of the versioning practices which are technically permitted by the PEP are strongly discouraged for new projects. Where this is the case, the relevant details are noted in the following sections.

Local version identifiers

Local version identifiers MUST comply with the following scheme:

<public version identifier>[+<local version label>]

They consist of a normal public version identifier (as defined in the previous section), along with an arbitrary "local version label", separated from the public version identifier by a plus. Local version labels have no specific semantics assigned, but some syntactic restrictions are imposed.

Local version identifiers are used to denote fully API (and, if applicable, ABI) compatible patched versions of upstream projects. For example, these may be created by application developers and system integrators by applying specific backported bug fixes when upgrading to a new upstream release would be too disruptive to the application or other integrated system (such as a Linux distribution).

The inclusion of the local version label makes it possible to differentiate upstream releases from potentially altered rebuilds by downstream integrators. The use of a local version identifier does not affect the kind of a release but, when applied to a source distribution, does indicate that it may not contain the exact same code as the corresponding upstream release.

To ensure local version identifiers can be readily incorporated as part of filenames and URLs, and to avoid formatting inconsistencies in hexadecimal hash representations, local version labels MUST be limited to the following set of permitted characters:

  • ASCII letters ([a-zA-Z])
  • ASCII digits ([0-9])
  • periods (.)

Local version labels MUST start and end with an ASCII letter or digit.

Comparison and ordering of local versions considers each segment of the local version (divided by a .) separately. If a segment consists entirely of ASCII digits then that section should be considered an integer for comparison purposes and if a segment contains any ASCII letters than that segment is compared lexicographically with case insensitivity. When comparing a numeric and lexicographic segment, the numeric section always compares as greater than the lexicographic segment. Additionally a local version with a great number of segments will always compare as greater than a local version with fewer segments, as long as the shorter local version's segments match the beginning of the longer local version's segments exactly.

An "upstream project" is a project that defines its own public versions. A "downstream project" is one which tracks and redistributes an upstream project, potentially backporting security and bug fixes from later versions of the upstream project.

Local version identifiers SHOULD NOT be used when publishing upstream projects to a public index server, but MAY be used to identify private builds created directly from the project source. Local version identifiers SHOULD be used by downstream projects when releasing a version that is API compatible with the version of the upstream project identified by the public version identifier, but contains additional changes (such as bug fixes). As the Python Package Index is intended solely for indexing and hosting upstream projects, it MUST NOT allow the use of local version identifiers.

Source distributions using a local version identifier SHOULD provide the python.integrator extension metadata (as defined in PEP 459).

Final releases

A version identifier that consists solely of a release segment and optionally an epoch identifier is termed a "final release".

The release segment consists of one or more non-negative integer values, separated by dots:

N(.N)*

Final releases within a project MUST be numbered in a consistently increasing fashion, otherwise automated tools will not be able to upgrade them correctly.

Comparison and ordering of release segments considers the numeric value of each component of the release segment in turn. When comparing release segments with different numbers of components, the shorter segment is padded out with additional zeros as necessary.

While any number of additional components after the first are permitted under this scheme, the most common variants are to use two components ("major.minor") or three components ("major.minor.micro").

For example:

0.9
0.9.1
0.9.2
...
0.9.10
0.9.11
1.0
1.0.1
1.1
2.0
2.0.1
...

A release series is any set of final release numbers that start with a common prefix. For example, 3.3.1, 3.3.5 and 3.3.9.45 are all part of the 3.3 release series.

Note

X.Y and X.Y.0 are not considered distinct release numbers, as the release segment comparison rules implicit expand the two component form to X.Y.0 when comparing it to any release segment that includes three components.

Date based release segments are also permitted. An example of a date based release scheme using the year and month of the release:

2012.04
2012.07
2012.10
2013.01
2013.06
...

Pre-releases

Some projects use an "alpha, beta, release candidate" pre-release cycle to support testing by their users prior to a final release.

If used as part of a project's development cycle, these pre-releases are indicated by including a pre-release segment in the version identifier:

X.YaN   # Alpha release
X.YbN   # Beta release
X.YrcN  # Release Candidate
X.Y     # Final release

A version identifier that consists solely of a release segment and a pre-release segment is termed a "pre-release".

The pre-release segment consists of an alphabetical identifier for the pre-release phase, along with a non-negative integer value. Pre-releases for a given release are ordered first by phase (alpha, beta, release candidate) and then by the numerical component within that phase.

Installation tools MAY accept both c and rc releases for a common release segment in order to handle some existing legacy distributions.

Installation tools SHOULD interpret c versions as being equivalent to rc versions (that is, c1 indicates the same version as rc1).

Build tools, publication tools and index servers SHOULD disallow the creation of both rc and c releases for a common release segment.

Post-releases

Some projects use post-releases to address minor errors in a final release that do not affect the distributed software (for example, correcting an error in the release notes).

If used as part of a project's development cycle, these post-releases are indicated by including a post-release segment in the version identifier:

X.Y.postN    # Post-release

A version identifier that includes a post-release segment without a developmental release segment is termed a "post-release".

The post-release segment consists of the string .post, followed by a non-negative integer value. Post-releases are ordered by their numerical component, immediately following the corresponding release, and ahead of any subsequent release.

Note

The use of post-releases to publish maintenance releases containing actual bug fixes is strongly discouraged. In general, it is better to use a longer release number and increment the final component for each maintenance release.

Post-releases are also permitted for pre-releases:

X.YaN.postM   # Post-release of an alpha release
X.YbN.postM   # Post-release of a beta release
X.YrcN.postM  # Post-release of a release candidate

Note

Creating post-releases of pre-releases is strongly discouraged, as it makes the version identifier difficult to parse for human readers. In general, it is substantially clearer to simply create a new pre-release by incrementing the numeric component.

Developmental releases

Some projects make regular developmental releases, and system packagers (especially for Linux distributions) may wish to create early releases directly from source control which do not conflict with later project releases.

If used as part of a project's development cycle, these developmental releases are indicated by including a developmental release segment in the version identifier:

X.Y.devN    # Developmental release

A version identifier that includes a developmental release segment is termed a "developmental release".

The developmental release segment consists of the string .dev, followed by a non-negative integer value. Developmental releases are ordered by their numerical component, immediately before the corresponding release (and before any pre-releases with the same release segment), and following any previous release (including any post-releases).

Developmental releases are also permitted for pre-releases and post-releases:

X.YaN.devM       # Developmental release of an alpha release
X.YbN.devM       # Developmental release of a beta release
X.YrcN.devM      # Developmental release of a release candidate
X.Y.postN.devM   # Developmental release of a post-release

Note

While they may be useful for continuous integration purposes, publishing developmental releases of pre-releases to general purpose public index servers is strongly discouraged, as it makes the version identifier difficult to parse for human readers. If such a release needs to be published, it is substantially clearer to instead create a new pre-release by incrementing the numeric component.

Developmental releases of post-releases are also strongly discouraged, but they may be appropriate for projects which use the post-release notation for full maintenance releases which may include code changes.

Version epochs

If included in a version identifier, the epoch appears before all other components, separated from the release segment by an exclamation mark:

E!X.Y  # Version identifier with epoch

If no explicit epoch is given, the implicit epoch is 0.

Most version identifiers will not include an epoch, as an explicit epoch is only needed if a project changes the way it handles version numbering in a way that means the normal version ordering rules will give the wrong answer. For example, if a project is using date based versions like 2014.04 and would like to switch to semantic versions like 1.0, then the new releases would be identified as older than the date based releases when using the normal sorting scheme:

1.0
1.1
2.0
2013.10
2014.04

However, by specifying an explicit epoch, the sort order can be changed appropriately, as all versions from a later epoch are sorted after versions from an earlier epoch:

2013.10
2014.04
1!1.0
1!1.1
1!2.0

Normalization

In order to maintain better compatibility with existing versions there are a number of "alternative" syntaxes that MUST be taken into account when parsing versions. These syntaxes MUST be considered when parsing a version, however they should be "normalized" to the standard syntax defined above.

Case sensitivity

All ascii letters should be interpreted case insensitively within a version and the normal form is lowercase. This allows versions such as 1.1RC1 which would be normalized to 1.1rc1.

Integer Normalization

All integers are interpreted via the int() built in and normalize to the string form of the output. This means that an integer version of 00 would normalize to 0 while 09000 would normalize to 9000. This does not hold true for integers inside of an alphanumeric segment of a local version such as 1.0+foo0100 which is already in its normalized form.

Pre-release separators

Pre-releases should allow a ., -, or _ separator between the release segment and the pre-release segment. The normal form for this is without a separator. This allows versions such as 1.1.a1 or 1.1-a1 which would be normalized to 1.1a1. It should also allow a seperator to be used between the pre-release signifier and the numeral. This allows versions such as 1.0a.1 which would be normalized to 1.0a1.

Pre-release spelling

Pre-releases allow the additional spellings of alpha, beta, c, pre, and preview for a, b, rc, rc, and rc respectively. This allows versions such as 1.1alpha1, 1.1beta2, or 1.1c3 which normalize to 1.1a1, 1.1b2, and 1.1rc3. In every case the additional spelling should be considered equivalent to their normal forms.

Implicit pre-release number

Pre releases allow omitting the numeral in which case it is implicitly assumed to be 0. The normal form for this is to include the 0 explicitly. This allows versions such as 1.2a which is normalized to 1.2a0.

Post release separators

Post releases allow a ., -, or _ separator as well as omitting the separator all together. The normal form of this is with the . separator. This allows versions such as 1.2-post2 or 1.2post2 which normalize to 1.2.post2. Like the pre-release seperator this also allows an optional separator between the post release signifier and the numeral. This allows versions like 1.2.post-2 which would normalize to 1.2.post2.

Post release spelling

Post-releases allow the additional spellings of rev and r. This allows versions such as 1.0-r4 which normalizes to 1.0.post4. As with the pre-releases the additional spellings should be considered equivalent to their normal forms.

Implicit post release number

Post releases allow omiting the numeral in which case it is implicitly assumed to be 0. The normal form for this is to include the 0 explicitly. This allows versions such as 1.2.post which is normalized to 1.2.post0.

Implicit post releases

Post releases allow omitting the post signifier all together. When using this form the separator MUST be - and no other form is allowed. This allows versions such as 1.0-1 to be normalized to 1.0.post1. This particular normalization MUST NOT be used in conjunction with the implicit post release number rule. In other words 1.0- is not a valid version and it does not normalize to 1.0.post0.

Development release separators

Development releases allow a ., -, or a _ separator as well as omitting the separator all together. The normal form of this is with the . separator. This allows versions such as 1.2-dev2 or 1.2dev2 which normalize to 1.2.dev2.

Implicit development release number

Development releases allow omiting the numeral in which case it is implicitly assumed to be 0. The normal form for this is to include the 0 explicitly. This allows versions such as 1.2.dev which is normalized to 1.2.dev0.

Local version segments

With a local version, in addition to the use of . as a separator of segments, the use of - and _ is also acceptable. The normal form is using the . character. This allows versions such as 1.0+ubuntu-1 to be normalized to 1.0+ubuntu.1.

Preceding v character

In order to support the common version notation of v1.0 versions may be preceded by a single literal v character. This character MUST be ignored for all purposes and should be omitted from all normalized forms of the version. The same version with and without the v is considered equivalent.

Leading and Trailing Whitespace

Leading and trailing whitespace must be silently ignored and removed from all normalized forms of a version. This includes " ", \t, \n, \r, \f, and \v. This allows accidental whitespace to be handled sensibly, such as a version like 1.0\n which normalizes to 1.0.

Examples of compliant version schemes

The standard version scheme is designed to encompass a wide range of identification practices across public and private Python projects. In practice, a single project attempting to use the full flexibility offered by the scheme would create a situation where human users had difficulty figuring out the relative order of versions, even though the rules above ensure all compliant tools will order them consistently.

The following examples illustrate a small selection of the different approaches projects may choose to identify their releases, while still ensuring that the "latest release" and the "latest stable release" can be easily determined, both by human users and automated tools.

Simple "major.minor" versioning:

0.1
0.2
0.3
1.0
1.1
...

Simple "major.minor.micro" versioning:

1.1.0
1.1.1
1.1.2
1.2.0
...

"major.minor" versioning with alpha, beta and candidate pre-releases:

0.9
1.0a1
1.0a2
1.0b1
1.0rc1
1.0
1.1a1
...

"major.minor" versioning with developmental releases, release candidates and post-releases for minor corrections:

0.9
1.0.dev1
1.0.dev2
1.0.dev3
1.0.dev4
1.0c1
1.0c2
1.0
1.0.post1
1.1.dev1
...

Date based releases, using an incrementing serial within each year, skipping zero:

2012.1
2012.2
2012.3
...
2012.15
2013.1
2013.2
...

Summary of permitted suffixes and relative ordering

Note

This section is intended primarily for authors of tools that automatically process distribution metadata, rather than developers of Python distributions deciding on a versioning scheme.

The epoch segment of version identifiers MUST be sorted according to the numeric value of the given epoch. If no epoch segment is present, the implicit numeric value is 0.

The release segment of version identifiers MUST be sorted in the same order as Python's tuple sorting when the normalized release segment is parsed as follows:

tuple(map(int, release_segment.split(".")))

All release segments involved in the comparison MUST be converted to a consistent length by padding shorter segments with zeros as needed.

Within a numeric release (1.0, 2.7.3), the following suffixes are permitted and MUST be ordered as shown:

.devN, aN, bN, rcN, <no suffix>, .postN

Note that c is considered to be semantically equivalent to rc and must be sorted as if it were rc. Tools MAY reject the case of having the same N for both a c and a rc in the same release segment as ambiguous and remain in compliance with the PEP.

Within an alpha (1.0a1), beta (1.0b1), or release candidate (1.0rc1, 1.0c1), the following suffixes are permitted and MUST be ordered as shown:

.devN, <no suffix>, .postN

Within a post-release (1.0.post1), the following suffixes are permitted and MUST be ordered as shown:

.devN, <no suffix>

Note that devN and postN MUST always be preceded by a dot, even when used immediately following a numeric version (e.g. 1.0.dev456, 1.0.post1).

Within a pre-release, post-release or development release segment with a shared prefix, ordering MUST be by the value of the numeric component.

The following example covers many of the possible combinations:

1.0.dev456
1.0a1
1.0a2.dev456
1.0a12.dev456
1.0a12
1.0b1.dev456
1.0b2
1.0b2.post345.dev456
1.0b2.post345
1.0rc1.dev456
1.0rc1
1.0
1.0+abc.5
1.0+abc.7
1.0+5
1.0.post456.dev34
1.0.post456
1.1.dev1

Version ordering across different metadata versions

Metadata v1.0 (PEP 241) and metadata v1.1 (PEP 314) do not specify a standard version identification or ordering scheme. However metadata v1.2 (PEP 345) does specify a scheme which is defined in PEP 386.

Due to the nature of the simple installer API it is not possible for an installer to be aware of which metadata version a particular distribution was using. Additionally installers required the ability to create a reasonably prioritized list that includes all, or as many as possible, versions of a project to determine which versions it should install. These requirements necessitate a standardization across one parsing mechanism to be used for all versions of a project.

Due to the above, this PEP MUST be used for all versions of metadata and supersedes PEP 386 even for metadata v1.2. Tools SHOULD ignore any versions which cannot be parsed by the rules in this PEP, but MAY fall back to implementation defined version parsing and ordering schemes if no versions complying with this PEP are available.

Distribution users may wish to explicitly remove non-compliant versions from any private package indexes they control.

Compatibility with other version schemes

Some projects may choose to use a version scheme which requires translation in order to comply with the public version scheme defined in this PEP. In such cases, the project specific version can be stored in the metadata while the translated public version is published in the version field.

This allows automated distribution tools to provide consistently correct ordering of published releases, while still allowing developers to use the internal versioning scheme they prefer for their projects.

Semantic versioning

Semantic versioning [10] is a popular version identification scheme that is more prescriptive than this PEP regarding the significance of different elements of a release number. Even if a project chooses not to abide by the details of semantic versioning, the scheme is worth understanding as it covers many of the issues that can arise when depending on other distributions, and when publishing a distribution that others rely on.

The "Major.Minor.Patch" (described in this PEP as "major.minor.micro") aspects of semantic versioning (clauses 1-9 in the 2.0.0-rc-1 specification) are fully compatible with the version scheme defined in this PEP, and abiding by these aspects is encouraged.

Semantic versions containing a hyphen (pre-releases - clause 10) or a plus sign (builds - clause 11) are not compatible with this PEP and are not permitted in the public version field.

One possible mechanism to translate such semantic versioning based source labels to compatible public versions is to use the .devN suffix to specify the appropriate version order.

Specific build information may also be included in local version labels.

DVCS based version labels

Many build tools integrate with distributed version control systems like Git and Mercurial in order to add an identifying hash to the version identifier. As hashes cannot be ordered reliably such versions are not permitted in the public version field.

As with semantic versioning, the public .devN suffix may be used to uniquely identify such releases for publication, while the original DVCS based label can be stored in the project metadata.

Identifying hash information may also be included in local version labels.

Olson database versioning

The pytz project inherits its versioning scheme from the corresponding Olson timezone database versioning scheme: the year followed by a lowercase character indicating the version of the database within that year.

This can be translated to a compliant public version identifier as <year>.<serial>, where the serial starts at zero or one (for the '<year>a' release) and is incremented with each subsequent database update within the year.

As with other translated version identifiers, the corresponding Olson database version could be recorded in the project metadata.

Version specifiers

A version specifier consists of a series of version clauses, separated by commas. For example:

~= 0.9, >= 1.0, != 1.3.4.*, < 2.0

The comparison operator determines the kind of version clause:

The comma (",") is equivalent to a logical and operator: a candidate version must match all given version clauses in order to match the specifier as a whole.

Whitespace between a conditional operator and the following version identifier is optional, as is the whitespace around the commas.

When multiple candidate versions match a version specifier, the preferred version SHOULD be the latest version as determined by the consistent ordering defined by the standard Version scheme. Whether or not pre-releases are considered as candidate versions SHOULD be handled as described in Handling of pre-releases.

Except where specifically noted below, local version identifiers MUST NOT be permitted in version specifiers, and local version labels MUST be ignored entirely when checking if candidate versions match a given version specifier.

Compatible release

A compatible release clause consists of the compatible release operator ~= and a version identifier. It matches any candidate version that is expected to be compatible with the specified version.

The specified version identifier must be in the standard format described in Version scheme. Local version identifiers are NOT permitted in this version specifier.

For a given release identifier V.N, the compatible release clause is approximately equivalent to the pair of comparison clauses:

>= V.N, == V.*

This operator MUST NOT be used with a single segment version number such as ~=1.

For example, the following groups of version clauses are equivalent:

~= 2.2
>= 2.2, == 2.*

~= 1.4.5
>= 1.4.5, == 1.4.*

If a pre-release, post-release or developmental release is named in a compatible release clause as V.N.suffix, then the suffix is ignored when determining the required prefix match:

~= 2.2.post3
>= 2.2.post3, == 2.*

~= 1.4.5a4
>= 1.4.5a4, == 1.4.*

The padding rules for release segment comparisons means that the assumed degree of forward compatibility in a compatible release clause can be controlled by appending additional zeros to the version specifier:

~= 2.2.0
>= 2.2.0, == 2.2.*

~= 1.4.5.0
>= 1.4.5.0, == 1.4.5.*

Version matching

A version matching clause includes the version matching operator == and a version identifier.

The specified version identifier must be in the standard format described in Version scheme, but a trailing .* is permitted on public version identifiers as described below.

By default, the version matching operator is based on a strict equality comparison: the specified version must be exactly the same as the requested version. The only substitution performed is the zero padding of the release segment to ensure the release segments are compared with the same length.

Whether or not strict version matching is appropriate depends on the specific use case for the version specifier. Automated tools SHOULD at least issue warnings and MAY reject them entirely when strict version matches are used inappropriately.

Prefix matching may be requested instead of strict comparison, by appending a trailing .* to the version identifier in the version matching clause. This means that additional trailing segments will be ignored when determining whether or not a version identifier matches the clause. If the specified version includes only a release segment, than trailing components (or the lack thereof) in the release segment are also ignored.

For example, given the version 1.1.post1, the following clauses would match or not as shown:

== 1.1        # Not equal, so 1.1.post1 does not match clause
== 1.1.post1  # Equal, so 1.1.post1 matches clause
== 1.1.*      # Same prefix, so 1.1.post1 matches clause

For purposes of prefix matching, the pre-release segment is considered to have an implied preceding ., so given the version 1.1a1, the following clauses would match or not as shown:

== 1.1        # Not equal, so 1.1a1 does not match clause
== 1.1a1      # Equal, so 1.1a1 matches clause
== 1.1.*      # Same prefix, so 1.1a1 matches clause

An exact match is also considered a prefix match (this interpreation is implied by the usual zero padding rules for the release segment of version identifiers). Given the version 1.1, the following clauses would match or not as shown:

== 1.1        # Equal, so 1.1 matches clause
== 1.1.0      # Zero padding expands 1.1 to 1.1.0, so it matches clause
== 1.1.dev1   # Not equal (dev-release), so 1.1 does not match clause
== 1.1a1      # Not equal (pre-release), so 1.1 does not match clause
== 1.1.post1  # Not equal (post-release), so 1.1 does not match clause
== 1.1.*      # Same prefix, so 1.1 matches clause

It is invalid to have a prefix match containing a development or local release such as 1.0.dev1.* or 1.0+foo1.*. If present, the development release segment is always the final segment in the public version, and the local version is ignored for comparison purposes, so using either in a prefix match wouldn't make any sense.

The use of == (without at least the wildcard suffix) when defining dependencies for published distributions is strongly discouraged as it greatly complicates the deployment of security fixes. The strict version comparison operator is intended primarily for use when defining dependencies for repeatable deployments of applications while using a shared distribution index.

If the specified version identifier is a public version identifier (no local version label), then the local version label of any candidate versions MUST be ignored when matching versions.

If the specified version identifier is a local version identifier, then the local version labels of candidate versions MUST be considered when matching versions, with the public version identifier being matched as described above, and the local version label being checked for equivalence using a strict string equality comparison.

Version exclusion

A version exclusion clause includes the version exclusion operator != and a version identifier.

The allowed version identifiers and comparison semantics are the same as those of the Version matching operator, except that the sense of any match is inverted.

For example, given the version 1.1.post1, the following clauses would match or not as shown:

!= 1.1        # Not equal, so 1.1.post1 matches clause
!= 1.1.post1  # Equal, so 1.1.post1 does not match clause
!= 1.1.*      # Same prefix, so 1.1.post1 does not match clause

Inclusive ordered comparison

An inclusive ordered comparison clause includes a comparison operator and a version identifier, and will match any version where the comparison is correct based on the relative position of the candidate version and the specified version given the consistent ordering defined by the standard Version scheme.

The inclusive ordered comparison operators are <= and >=.

As with version matching, the release segment is zero padded as necessary to ensure the release segments are compared with the same length.

Local version identifiers are NOT permitted in this version specifier.

Exclusive ordered comparison

The exclusive ordered comparisons > and < are similar to the inclusive ordered comparisons in that they rely on the relative position of the candidate version and the specified version given the consistent ordering defined by the standard Version scheme. However, they specifically exclude pre-releases, post-releases, and local versions of the specified version.

The exclusive ordered comparison >V MUST NOT allow a post-release of the given version unless V itself is a post release. You may mandate that releases are later than a particular post release, including additional post releases, by using >V.postN. For example, >1.7 will allow 1.7.1 but not 1.7.0.post1 and >1.7.post2 will allow 1.7.1 and 1.7.0.post3 but not 1.7.0.

The exclusive ordered comparison >V MUST NOT match a local version of the specified version.

The exclusive ordered comparison <V MUST NOT allow a pre-release of the specified version unless the specified version is itself a pre-release. Allowing pre-releases that are earlier than, but not equal to a specific pre-release may be accomplished by using <V.rc1 or similar.

As with version matching, the release segment is zero padded as necessary to ensure the release segments are compared with the same length.

Local version identifiers are NOT permitted in this version specifier.

Arbitrary equality

Arbitrary equality comparisons are simple string equality operations which do not take into account any of the semantic information such as zero padding or local versions. This operator also does not support prefix matching as the == operator does.

The primary use case for arbitrary equality is to allow for specifying a version which cannot otherwise be represented by this PEP. This operator is special and acts as an escape hatch to allow someone using a tool which implements this PEP to still install a legacy version which is otherwise incompatible with this PEP.

An example would be ===foobar which would match a version of foobar.

This operator may also be used to explicitly require an unpatched version of a project such as ===1.0 which would not match for a version 1.0+downstream1.

Use of this operator is heavily discouraged and tooling MAY display a warning when it is used.

Handling of pre-releases

Pre-releases of any kind, including developmental releases, are implicitly excluded from all version specifiers, unless they are already present on the system, explicitly requested by the user, or if the only available version that satisfies the version specifier is a pre-release.

By default, dependency resolution tools SHOULD:

  • accept already installed pre-releases for all version specifiers
  • accept remotely available pre-releases for version specifiers where there is no final or post release that satisfies the version specifier
  • exclude all other pre-releases from consideration

Dependency resolution tools MAY issue a warning if a pre-release is needed to satisfy a version specifier.

Dependency resolution tools SHOULD also allow users to request the following alternative behaviours:

  • accepting pre-releases for all version specifiers
  • excluding pre-releases for all version specifiers (reporting an error or warning if a pre-release is already installed locally, or if a pre-release is the only way to satisfy a particular specifier)

Dependency resolution tools MAY also allow the above behaviour to be controlled on a per-distribution basis.

Post-releases and final releases receive no special treatment in version specifiers - they are always included unless explicitly excluded.

Examples

  • ~=3.1: version 3.1 or later, but not version 4.0 or later.
  • ~=3.1.2: version 3.1.2 or later, but not version 3.2.0 or later.
  • ~=3.1a1: version 3.1a1 or later, but not version 4.0 or later.
  • == 3.1: specifically version 3.1 (or 3.1.0), excludes all pre-releases, post releases, developmental releases and any 3.1.x maintenance releases.
  • == 3.1.*: any version that starts with 3.1. Equivalent to the ~=3.1.0 compatible release clause.
  • ~=3.1.0, != 3.1.3: version 3.1.0 or later, but not version 3.1.3 and not version 3.2.0 or later.

Direct references

Some automated tools may permit the use of a direct reference as an alternative to a normal version specifier. A direct reference consists of the specifier @ and an explicit URL.

Whether or not direct references are appropriate depends on the specific use case for the version specifier. Automated tools SHOULD at least issue warnings and MAY reject them entirely when direct references are used inappropriately.

Public index servers SHOULD NOT allow the use of direct references in uploaded distributions. Direct references are intended as a tool for software integrators rather than publishers.

Depending on the use case, some appropriate targets for a direct URL reference may be a valid source_url entry (see PEP 426), an sdist, or a wheel binary archive. The exact URLs and targets supported will be tool dependent.

For example, a local source archive may be referenced directly:

pip @ file:///localbuilds/pip-1.3.1.zip

Alternatively, a prebuilt archive may also be referenced:

pip @ file:///localbuilds/pip-1.3.1-py33-none-any.whl

All direct references that do not refer to a local file URL SHOULD specify a secure transport mechanism (such as https) AND include an expected hash value in the URL for verification purposes. If a direct reference is specified without any hash information, with hash information that the tool doesn't understand, or with a selected hash algorithm that the tool considers too weak to trust, automated tools SHOULD at least emit a warning and MAY refuse to rely on the URL. If such a direct reference also uses an insecure transport, automated tools SHOULD NOT rely on the URL.

It is RECOMMENDED that only hashes which are unconditionally provided by the latest version of the standard library's hashlib module be used for source archive hashes. At time of writing, that list consists of 'md5', 'sha1', 'sha224', 'sha256', 'sha384', and 'sha512'.

For source archive and wheel references, an expected hash value may be specified by including a <hash-algorithm>=<expected-hash> entry as part of the URL fragment.

For version control references, the VCS+protocol scheme SHOULD be used to identify both the version control system and the secure transport, and a version control system with hash based commit identifiers SHOULD be used. Automated tools MAY omit warnings about missing hashes for version control systems that do not provide hash based commit identifiers.

To handle version control systems that do not support including commit or tag references directly in the URL, that information may be appended to the end of the URL using the @<commit-hash> or the @<tag>#<commit-hash> notation.

Note

This isn't quite the same as the existing VCS reference notation supported by pip. Firstly, the distribution name is moved in front rather than embedded as part of the URL. Secondly, the commit hash is included even when retrieving based on a tag, in order to meet the requirement above that every link should include a hash to make things harder to forge (creating a malicious repo with a particular tag is easy, creating one with a specific hash, less so).

Remote URL examples:

pip @ https://github.com/pypa/pip/archive/1.3.1.zip#sha1=da9234ee9982d4bbb3c72346a6de940a148ea686
pip @ git+https://github.com/pypa/pip.git@7921be1537eac1e97bc40179a57f0349c2aee67d
pip @ git+https://github.com/pypa/pip.git@1.3.1#7921be1537eac1e97bc40179a57f0349c2aee67d

File URLs

File URLs take the form of file://<host>/<path>. If the <host> is omitted it is assumed to be localhost and even if the <host> is omitted the third slash MUST still exist. The <path> defines what the file path on the filesystem that is to be accessed.

On the various *nix operating systems the only allowed values for <host> is for it to be ommitted, localhost, or another FQDN that the current machine believes matches its own host. In other words on *nix the file:// scheme can only be used to access paths on the local machine.

On Windows the file format should include the drive letter if applicable as part of the <path> (e.g. file:///c:/path/to/a/file). Unlike *nix on Windows the <host> parameter may be used to specify a file residing on a network share. In other words in order to translate \\machine\volume\file to a file:// url, it would end up as file://machine/volume/file. For more information on file:// URLs on Windows see MSDN [4].

Updating the versioning specification

The versioning specification may be updated with clarifications without requiring a new PEP or a change to the metadata version.

Any technical changes that impact the version identification and comparison syntax and semantics would require an updated versioning scheme to be defined in a new PEP.

Summary of differences from pkg_resources.parse_version

  • Local versions sort differently, this PEP requires that they sort as greater than the same version without a local version, whereas pkg_resources.parse_version considers it a pre-release marker.
  • This PEP purposely restricts the syntax which constitutes a valid version while pkg_resources.parse_version attempts to provide some meaning from any arbitrary string.
  • pkg_resources.parse_version allows arbitrarily deeply nested version signifiers like 1.0.dev1.post1.dev5. This PEP however allows only a single use of each type and they must exist in a certain order.

Summary of differences from PEP 386

  • Moved the description of version specifiers into the versioning PEP
  • Added the "direct reference" concept as a standard notation for direct references to resources (rather than each tool needing to invent its own)
  • Added the "local version identifier" and "local version label" concepts to allow system integrators to indicate patched builds in a way that is supported by the upstream tools, as well as to allow the incorporation of build tags into the versioning of binary distributions.
  • Added the "compatible release" clause
  • Added the trailing wildcard syntax for prefix based version matching and exclusion
  • Changed the top level sort position of the .devN suffix
  • Allowed single value version numbers
  • Explicit exclusion of leading or trailing whitespace
  • Explicit support for date based versions
  • Explicit normalisation rules to improve compatibility with existing version metadata on PyPI where it doesn't introduce ambiguity
  • Implicitly exclude pre-releases unless they're already present or needed to satisfy a dependency
  • Treat post releases the same way as unqualified releases
  • Discuss ordering and dependencies across metadata versions
  • Switch from preferring c to rc.

The rationale for major changes is given in the following sections.

Changing the version scheme

One key change in the version scheme in this PEP relative to that in PEP 386 is to sort top level developmental releases like X.Y.devN ahead of alpha releases like X.Ya1. This is a far more logical sort order, as projects already using both development releases and alphas/betas/release candidates do not want their developmental releases sorted in between their release candidates and their final releases. There is no rationale for using dev releases in that position rather than merely creating additional release candidates.

The updated sort order also means the sorting of dev versions is now consistent between the metadata standard and the pre-existing behaviour of pkg_resources (and hence the behaviour of current installation tools).

Making this change should make it easier for affected existing projects to migrate to the latest version of the metadata standard.

Another change to the version scheme is to allow single number versions, similar to those used by non-Python projects like Mozilla Firefox, Google Chrome and the Fedora Linux distribution. This is actually expected to be more useful for version specifiers, but it is easier to allow it for both version specifiers and release numbers, rather than splitting the two definitions.

The exclusion of leading and trailing whitespace was made explicit after a couple of projects with version identifiers differing only in a trailing \n character were found on PyPI.

Various other normalisation rules were also added as described in the separate section on version normalisation below.

Appendix A shows detailed results of an analysis of PyPI distribution version information, as collected on 8th August, 2014. This analysis compares the behavior of the explicitly ordered version scheme defined in this PEP with the de facto standard defined by the behavior of setuptools. These metrics are useful, as the intent of this PEP is to follow existing setuptools behavior as closely as is feasible, while still throwing exceptions for unorderable versions (rather than trying to guess an appropriate order as setuptools does).

A more opinionated description of the versioning scheme

As in PEP 386, the primary focus is on codifying existing practices to make them more amenable to automation, rather than demanding that existing projects make non-trivial changes to their workflow. However, the standard scheme allows significantly more flexibility than is needed for the vast majority of simple Python packages (which often don't even need maintenance releases - many users are happy with needing to upgrade to a new feature release to get bug fixes).

For the benefit of novice developers, and for experienced developers wishing to better understand the various use cases, the specification now goes into much greater detail on the components of the defined version scheme, including examples of how each component may be used in practice.

The PEP also explicitly guides developers in the direction of semantic versioning (without requiring it), and discourages the use of several aspects of the full versioning scheme that have largely been included in order to cover esoteric corner cases in the practices of existing projects and in repackaging software for Linux distributions.

Describing version specifiers alongside the versioning scheme

The main reason to even have a standardised version scheme in the first place is to make it easier to do reliable automated dependency analysis. It makes more sense to describe the primary use case for version identifiers alongside their definition.

Changing the interpretation of version specifiers

The previous interpretation of version specifiers made it very easy to accidentally download a pre-release version of a dependency. This in turn made it difficult for developers to publish pre-release versions of software to the Python Package Index, as even marking the package as hidden wasn't enough to keep automated tools from downloading it, and also made it harder for users to obtain the test release manually through the main PyPI web interface.

The previous interpretation also excluded post-releases from some version specifiers for no adequately justified reason.

The updated interpretation is intended to make it difficult to accidentally accept a pre-release version as satisfying a dependency, while still allowing pre-release versions to be retrieved automatically when that's the only way to satisfy a dependency.

The "some forward compatibility assumed" version constraint is derived from the Ruby community's "pessimistic version constraint" operator [2] to allow projects to take a cautious approach to forward compatibility promises, while still easily setting a minimum required version for their dependencies. The spelling of the compatible release clause (~=) is inspired by the Ruby (~>) and PHP (~) equivalents.

Further improvements are also planned to the handling of parallel installation of multiple versions of the same library, but these will depend on updates to the installation database definition along with improved tools for dynamic path manipulation.

The trailing wildcard syntax to request prefix based version matching was added to make it possible to sensibly define compatible release clauses.

Support for date based version identifiers

Excluding date based versions caused significant problems in migrating pytz to the new metadata standards. It also caused concerns for the OpenStack developers, as they use a date based versioning scheme and would like to be able to migrate to the new metadata standards without changing it.

Adding version epochs

Version epochs are added for the same reason they are part of other versioning schemes, such as those of the Fedora and Debian Linux distributions: to allow projects to gracefully change their approach to numbering releases, without having a new release appear to have a lower version number than previous releases and without having to change the name of the project.

In particular, supporting version epochs allows a project that was previously using date based versioning to switch to semantic versioning by specifying a new version epoch.

The ! character was chosen to delimit an epoch version rather than the : character, which is commonly used in other systems, due to the fact that : is not a valid character in a Windows directory name.

Adding direct references

Direct references are added as an "escape clause" to handle messy real world situations that don't map neatly to the standard distribution model. This includes dependencies on unpublished software for internal use, as well as handling the more complex compatibility issues that may arise when wrapping third party libraries as C extensions (this is of especial concern to the scientific community).

Index servers are deliberately given a lot of freedom to disallow direct references, since they're intended primarily as a tool for integrators rather than publishers. PyPI in particular is currently going through the process of eliminating dependencies on external references, as unreliable external services have the effect of slowing down installation operations, as well as reducing PyPI's own apparent reliability.

Adding arbitrary equality

Arbitrary equality is added as an "escape clause" to handle the case where someone needs to install a project which uses a non compliant version. Although this PEP is able to attain ~97% compatibility with the versions that are already on PyPI there are still ~3% of versions which cannot be parsed. This operator gives a simple and effective way to still depend on them without having to "guess" at the semantics of what they mean (which would be required if anything other than strict string based equality was supported).

Adding local version identifiers

It's a fact of life that downstream integrators often need to backport upstream bug fixes to older versions. It's one of the services that gets Linux distro vendors paid, and application developers may also apply patches they need to bundled dependencies.

Historically, this practice has been invisible to cross-platform language specific distribution tools - the reported "version" in the upstream metadata is the same as for the unmodified code. This inaccuracy can then cause problems when attempting to work with a mixture of integrator provided code and unmodified upstream code, or even just attempting to identify exactly which version of the software is installed.

The introduction of local version identifiers and "local version labels" into the versioning scheme, with the corresponding python.integrator metadata extension allows this kind of activity to be represented accurately, which should improve interoperability between the upstream tools and various integrated platforms.

The exact scheme chosen is largely modeled on the existing behavior of pkg_resources.parse_version and pkg_resources.parse_requirements, with the main distinction being that where pkg_resources currently always takes the suffix into account when comparing versions for exact matches, the PEP requires that the local version label of the candidate version be ignored when no local version label is present in the version specifier clause. Furthermore, the PEP does not attempt to impose any structure on the local version labels (aside from limiting the set of permitted characters and defining their ordering).

This change is designed to ensure that an integrator provided version like pip 1.5+1 or pip 1.5+1.git.abc123de will still satisfy a version specifier like pip>=1.5.

The plus is chosen primarily for readability of local version identifiers. It was chosen instead of the hyphen to prevent pkg_resources.parse_version from parsing it as a prerelease, which is important for enabling a successful migration to the new, more structured, versioning scheme. The plus was chosen instead of a tilde because of the significance of the tilde in Debian's version ordering algorithm.

Providing explicit version normalization rules

Historically, the de facto standard for parsing versions in Python has been the pkg_resources.parse_version command from the setuptools project. This does not attempt to reject any version and instead tries to make something meaningful, with varying levels of success, out of whatever it is given. It has a few simple rules but otherwise it more or less relies largely on string comparison.

The normalization rules provided in this PEP exist primarily to either increase the compatability with pkg_resources.parse_version, particularly in documented use cases such as rev, r, pre, etc or to do something more reasonable with versions that already exist on PyPI.

All possible normalization rules were weighed against whether or not they were likely to cause any ambiguity (e.g. while someone might devise a scheme where v1.0 and 1.0 are considered distinct releases, the likelihood of anyone actually doing that, much less on any scale that is noticeable, is fairly low). They were also weighed against how pkg_resources.parse_version treated a particular version string, especially with regards to how it was sorted. Finally each rule was weighed against the kinds of additional versions it allowed, how "ugly" those versions looked, how hard there were to parse (both mentally and mechanically) and how much additional compatibility it would bring.

The breadth of possible normalizations were kept to things that could easily be implemented as part of the parsing of the version and not pre-parsing transformations applied to the versions. This was done to limit the side effects of each transformation as simple search and replace style transforms increase the likelihood of ambiguous or "junk" versions.

For an extended discussion on the various types of normalizations that were considered, please see the proof of concept for PEP 440 within pip [5].

Allowing Underscore in Normalization

There are not a lot of projects on PyPI which utilize a _ in the version string. However this PEP allows its use anywhere that - is acceptable. The reason for this is that the Wheel normalization scheme specifies that - gets normalized to a _ to enable easier parsing of the filename.

Summary of changes to PEP 440

The following changes were made to this PEP based on feedback received after the initial reference implementation was released in setuptools 8.0 and pip 6.0:

  • The exclusive ordered comparisons were updated to no longer imply a !=V.* which was deemed to be surprising behavior which was too hard to accurately describe. Instead the exclusive ordered comparisons will simply disallow matching pre-releases, post-releases, and local versions of the specified version (unless the specified version is itself a pre-release, post-release or local version). For an extended discussion see the threads on distutils-sig [6] [7].
  • The normalized form for release candidates was updated from 'c' to 'rc'. This change was based on user feedback received when setuptools 8.0 started applying normalisation to the release metadata generated when preparing packages for publication on PyPI [8].

References

The initial attempt at a standardised version scheme, along with the justifications for needing such a standard can be found in PEP 386.

[1]Reference Implementation of PEP 440 Versions and Specifiers https://github.com/pypa/packaging/pull/1
[2]Version compatibility analysis script: https://github.com/pypa/packaging/blob/master/tasks/check.py
[3]Pessimistic version constraint http://guides.rubygems.org/patterns/
[4]File URIs in Windows http://blogs.msdn.com/b/ie/archive/2006/12/06/file-uris-in-windows.aspx
[5]Proof of Concept: PEP 440 within pip https://github.com/pypa/pip/pull/1894
[6]PEP440: foo-X.Y.Z does not satisfy "foo>X.Y" https://mail.python.org/pipermail/distutils-sig/2014-December/025451.html
[7]PEP440: >1.7 vs >=1.7 https://mail.python.org/pipermail/distutils-sig/2014-December/025507.html
[8]Amend PEP 440 with Wider Feedback on Release Candidates https://mail.python.org/pipermail/distutils-sig/2014-December/025409.html
[9]Changing the status of PEP 440 to Provisional https://mail.python.org/pipermail/distutils-sig/2014-December/025412.html
[10]http://semver.org/

Appendix A

Metadata v2.0 guidelines versus setuptools:

$ invoke check.pep440
Total Version Compatibility:              245806/250521 (98.12%)
Total Sorting Compatibility (Unfiltered): 45441/47114 (96.45%)
Total Sorting Compatibility (Filtered):   47057/47114 (99.88%)
Projects with No Compatible Versions:     498/47114 (1.06%)
Projects with Differing Latest Version:   688/47114 (1.46%)

pep-0441 Improving Python ZIP Application Support

PEP:441
Title:Improving Python ZIP Application Support
Version:$Revision$
Last-Modified:$Date$
Author:Daniel Holth <dholth at gmail.com>, Paul Moore <p.f.moore at gmail.com>
Discussions-To:https://mail.python.org/pipermail/python-dev/2015-February/138277.html
Status:Final
Type:Standards Track
Content-Type:text/x-rst
Created:30 March 2013
Post-History:30 March 2013, 1 April 2013, 16 February 2015
Resolution:https://mail.python.org/pipermail/python-dev/2015-February/138578.html

Improving Python ZIP Application Support

Python has had the ability to execute directories or ZIP-format archives as scripts since version 2.6 [1]. When invoked with a zip file or directory as its first argument the interpreter adds that directory to sys.path and executes the __main__ module. These archives provide a great way to publish software that needs to be distributed as a single file script but is complex enough to need to be written as a collection of modules.

This feature is not as popular as it should be mainly because it was not promoted as part of Python 2.6 [2], so that it is relatively unknown, but also because the Windows installer does not register a file extension (other than .py) for this format of file, to associate with the launcher.

This PEP proposes to fix these problems by re-publicising the feature, defining the .pyz and .pyzw extensions as "Python ZIP Applications" and "Windowed Python ZIP Applications", and providing some simple tooling to manage the format.

A New Python ZIP Application Extension

The terminology "Python Zip Application" will be the formal term used for a zip-format archive that contains Python code in a form that can be directly executed by Python (specifically, it must have a __main__.py file in the root directory of the archive). The extension .pyz will be formally associated with such files.

The Python 3.5 installer will associate .pyz and .pyzw "Python Zip Applications" with the platform launcher so they can be executed. A .pyz archive is a console application and a .pyzw archive is a windowed application, indicating whether the console should appear when running the app.

On Unix, it would be ideal if the .pyz extension and the name "Python Zip Application" were registered (in the mime types database?). However, such an association is out of scope for this PEP.

Python Zip applications can be prefixed with a #! line pointing to the correct Python interpreter and an optional explanation:

#!/usr/bin/env python3
#  Python application packed with zipapp module
(binary contents of archive)

On Unix, this allows the OS to run the file with the correct interpreter, via the standard "shebang" support. On Windows, the Python launcher implements shebang support.

However, it is always possible to execute a .pyz application by supplying the filename to the Python interpreter directly.

As background, ZIP archives are defined with a footer containing relative offsets from the end of the file. They remain valid when concatenated to the end of any other file. This feature is completely standard and is how self-extracting ZIP archives and the bdist_wininst installer format work.

Minimal Tooling: The zipapp Module

This PEP also proposes including a module for working with these archives. The module will contain functions for working with Python zip application archives, and a command line interface (via python -m zipapp) for their creation and manipulation.

More complete tools for managing Python Zip Applications are encouraged as 3rd party applications on PyPI. Currently, pyzzer [5] and pex [6] are two such tools.

Module Interface

The zipapp module will provide the following functions:

create_archive(source, target=None, interpreter=None, main=None)

Create an application archive from source. The source can be any of the following:

  • The name of a directory, in which case a new application archive will be created from the content of that directory.
  • The name of an existing application archive file, in which case the file is copied to the target. The file name should include the .pyz or .pyzw extension, if required.
  • A file object open for reading in bytes mode. The content of the file should be an application archive, and the file object is assumed to be positioned at the start of the archive.

The target argument determines where the resulting archive will be written:

  • If it is the name of a file, the archive will be written to that file.
  • If it is an open file object, the archive will be written to that file object, which must be open for writing in bytes mode.
  • If the target is omitted (or None), the source must be a directory and the target will be a file with the same name as the source, with a .pyz extension added.

The interpreter argument specifies the name of the Python interpreter with which the archive will be executed. It is written as a "shebang" line at the start of the archive. On Unix, this will be interpreted by the OS, and on Windows it will be handled by the Python launcher. Omitting the interpreter results in no shebang line being written. If an interpreter is specified, and the target is a filename, the executable bit of the target file will be set.

The main argument specifies the name of a callable which will be used as the main program for the archive. It can only be specified if the source is a directory, and the source does not already contain a __main__.py file. The main argument should take the form "pkg.module:callable" and the archive will be run by importing "pkg.module" and executing the given callable with no arguments. It is an error to omit main if the source is a directory and does not contain a __main__.py file, as otherwise the resulting archive would not be executable.

If a file object is specified for source or target, it is the caller's responsibility to close it after calling create_archive.

When copying an existing archive, file objects supplied only need read and readline, or write methods. When creating an archive from a directory, if the target is a file object it will be passed to the zipfile.ZipFile class, and must supply the methods needed by that class.

get_interpreter(archive)

Returns the interpreter specified in the shebang line of the archive. If there is no shebang, the function returns None. The archive argument can be a filename or a file-like object open for reading in bytes mode.

Command Line Usage

The zipapp module can be run with the python -m flag. The command line interface is as follows:

python -m zipapp directory [options]

    Create an archive from the given directory.  An archive will
    be created from the contents of that directory.  The archive
    will have the same name as the source directory with a .pyz
    extension.

    The following options can be specified:

    -o archive / --output archive

        The destination archive will have the specified name.  The
        given name will be used as written, so should include the
        ".pyz" or ".pyzw" extension.

    -p interpreter / --python interpreter

        The given interpreter will be written to the shebang line
        of the archive.  If this option is not given, the archive
        will have no shebang line.

    -m pkg.mod:fn / --main pkg.mod:fn

        The source directory must not have a __main__.py file. The
        archiver will write a __main__.py file into the target
        which calls fn from the module pkg.mod.

The behaviour of the command line interface matches that of zipapp.create_archive().

In addition, it is possible to use the command line interface to work with an existing archive:

python -m zipapp app.pyz --show

    Displays the shebang line of an archive.  Output is of the
    form

        Interpreter: /usr/bin/env
    or
        Interpreter: <none>

    and is intended for diagnostic use, not for scripts.

python -m zipapp app.pyz -o newapp.pyz [-p interpreter]

    Copy app.pyz to newapp.pyz, modifying the shebang line based
    on the -p option (as for creating an archive, no -p option
    means remove the shebang line).  Specifying a destination is
    mandatory.

    In-place modification of an archive is *not* supported, as the
    risk of damaging archives is too great for a simple tool.

As noted, the archives are standard zip files, and so can be unpacked using any standard ZIP utility or Python's zipfile module. For this reason, no interfaces to list the contents of an archive, or unpack them, are provided or needed.

FAQ

Are you sure a standard ZIP utility can handle #! at the beginning?
Absolutely. The zipfile specification allows for arbitrary data to be prepended to a zipfile. This feature is commonly used by "self-extracting zip" programs. If your archive program can't handle this, it is a bug in your archive program.
Isn't zipapp just a very thin wrapper over the zipfile module?
Yes. If you prefer to build your own Python zip application archives using other tools, they will work just as well. The zipapp module is a convenience, nothing more.
Why not use just use a .zip or .py extension?
Users expect a .zip file to be opened with an archive tool, and expect a .py file to contain readable text. Both would be confusing for this use case.
How does this compete with existing package formats?
The sdist, bdist and wheel formats are designed for packaging of modules to be installed into an existing Python installation. They are not intended to be used without installing. The executable zip format is specifically designed for standalone use, without needing to be installed. They are in effect a multi-file version of a standalone Python script.

Rejected Proposals

Convenience Values for Shebang Lines

Is it worth having "convenience" forms for any of the common interpreter values? For example, -p 3 meaning the same as -p "/usr/bin/env python3". It would save a lot of typing for the common cases, as well as giving cross-platform options for people who don't want or need to understand the intricacies of shebang handling on "other" platforms.

Downsides are that it's not obvious how to translate the abbreviations. For example, should "3" mean "/usr/bin/env python3", "/usr/bin/python3", "python3", or something else? Also, there is no obvious short form for the key case of "/usr/bin/env python" (any available version of Python), which could easily result in scripts being written with overly-restrictive shebang lines.

Overall, this seems like there are more problems than benefits, and as a result has been dropped from consideration.

Registering .pyz as a Media Type

It was suggested [3] that the .pyz extension should be registered in the Unix database of extensions. While it makes sense to do this as an equivalent of the Windows installer registering the extension, the .py extension is not listed in the media types database [4]. It doesn't seem reasonable to register .pyz without .py, so this idea has been omitted from this PEP. An interested party could arrange for both .py and .pyz to be registered at a future date.

Default Interpreter

The initial draft of this PEP proposed using /usr/bin/env python as the default interpreter. Unix users have problems with this behaviour, as the default for the python command on many distributions is Python 2, and it is felt that this PEP should prefer Python 3 by default. However, using a command of python3 can result in unexpected behaviour for Windows users, where the default behaviour of the launcher for the command python is commonly customised by users, but the behaviour of python3 may not be modified to match.

As a result, the principle "in the face of ambiguity, refuse to guess" has been invoked, and archives have no shebang line unless explicitly requested. On Windows, the archives will still be run (with the default Python) by the launcher, and on Unix, the archives can be run by explicitly invoking the desired Python interpreter.

Command Line Tool to Manage Shebang Lines

It is conceivable that users would want to modify the shebang line for an existing archive, or even just display the current shebang line. This is tricky to do so with existing tools (zip programs typically ignore prepended data totally, and text editors can have trouble editing files containing binary data).

The zipapp module provides functions to handle the shebang line, but does not include a command line interface to that functionality. This is because it is not clear how to provide one without the resulting interface being over-complex and potentially confusing. Changing the shebang line is expected to be an uncommon requirement.

References

[1]Allow interpreter to execute a zip file (http://bugs.python.org/issue1739468)
[2]Feature is not documented (http://bugs.python.org/issue17359)
[3]Discussion of adding a .pyz mime type on python-dev (https://mail.python.org/pipermail/python-dev/2015-February/138338.html)
[4]Register of media types (http://www.iana.org/assignments/media-types/media-types.xhtml)
[5]pyzzer - A tool for creating Python-executable archives (https://pypi.python.org/pypi/pyzzer)
[6]pex - The PEX packaging toolchain (https://pypi.python.org/pypi/pex)

The discussion of this PEP took place on the python-dev mailing list, in the thread starting at https://mail.python.org/pipermail/python-dev/2015-February/138277.html

pep-0442 Safe object finalization

PEP:442
Title:Safe object finalization
Version:$Revision$
Last-Modified:$Date$
Author:Antoine Pitrou <solipsis at pitrou.net>
BDFL-Delegate:Benjamin Peterson <benjamin@python.org>
Status:Final
Type:Standards Track
Content-Type:text/x-rst
Created:2013-05-18
Python-Version:3.4
Post-History:2013-05-18
Resolution:http://mail.python.org/pipermail/python-dev/2013-June/126746.html

Abstract

This PEP proposes to deal with the current limitations of object finalization. The goal is to be able to define and run finalizers for any object, regardless of their position in the object graph.

This PEP doesn't call for any change in Python code. Objects with existing finalizers will benefit automatically.

Definitions

Reference
A directional link from an object to another. The target of the reference is kept alive by the reference, as long as the source is itself alive and the reference isn't cleared.
Weak reference
A directional link from an object to another, which doesn't keep alive its target. This PEP focusses on non-weak references.
Reference cycle
A cyclic subgraph of directional links between objects, which keeps those objects from being collected in a pure reference-counting scheme.
Cyclic isolate (CI)
A standalone subgraph of objects in which no object is referenced from the outside, containing one or several reference cycles, and whose objects are still in a usable, non-broken state: they can access each other from their respective finalizers.
Cyclic garbage collector (GC)
A device able to detect cyclic isolates and turn them into cyclic trash. Objects in cyclic trash are eventually disposed of by the natural effect of the references being cleared and their reference counts dropping to zero.
Cyclic trash (CT)
A former cyclic isolate whose objects have started being cleared by the GC. Objects in cyclic trash are potential zombies; if they are accessed by Python code, the symptoms can vary from weird AttributeErrors to crashes.
Zombie / broken object
An object part of cyclic trash. The term stresses that the object is not safe: its outgoing references may have been cleared, or one of the objects it references may be zombie. Therefore, it should not be accessed by arbitrary code (such as finalizers).
Finalizer
A function or method called when an object is intended to be disposed of. The finalizer can access the object and release any resource held by the object (for example mutexes or file descriptors). An example is a __del__ method.
Resurrection
The process by which a finalizer creates a new reference to an object in a CI. This can happen as a quirky but supported side-effect of __del__ methods.

Impact

While this PEP discusses CPython-specific implementation details, the change in finalization semantics is expected to affect the Python ecosystem as a whole. In particular, this PEP obsoletes the current guideline that "objects with a __del__ method should not be part of a reference cycle".

Benefits

The primary benefits of this PEP regard objects with finalizers, such as objects with a __del__ method and generators with a finally block. Those objects can now be reclaimed when they are part of a reference cycle.

The PEP also paves the way for further benefits:

  • The module shutdown procedure may not need to set global variables to None anymore. This could solve a well-known class of irritating issues.

The PEP doesn't change the semantics of:

  • Weak references caught in reference cycles.
  • C extension types with a custom tp_dealloc function.

Description

Reference-counted disposal

In normal reference-counted disposal, an object's finalizer is called just before the object is deallocated. If the finalizer resurrects the object, deallocation is aborted.

However, if the object was already finalized, then the finalizer isn't called. This prevents us from finalizing zombies (see below).

Disposal of cyclic isolates

Cyclic isolates are first detected by the garbage collector, and then disposed of. The detection phase doesn't change and won't be described here. Disposal of a CI traditionally works in the following order:

  1. Weakrefs to CI objects are cleared, and their callbacks called. At this point, the objects are still safe to use.
  2. The CI becomes a CT as the GC systematically breaks all known references inside it (using the tp_clear function).
  3. Nothing. All CT objects should have been disposed of in step 2 (as a side-effect of clearing references); this collection is finished.

This PEP proposes to turn CI disposal into the following sequence (new steps are in bold):

  1. Weakrefs to CI objects are cleared, and their callbacks called. At this point, the objects are still safe to use.
  2. The finalizers of all CI objects are called.
  3. The CI is traversed again to determine if it is still isolated. If it is determined that at least one object in CI is now reachable from outside the CI, this collection is aborted and the whole CI is resurrected. Otherwise, proceed.
  4. The CI becomes a CT as the GC systematically breaks all known references inside it (using the tp_clear function).
  5. Nothing. All CT objects should have been disposed of in step 4 (as a side-effect of clearing references); this collection is finished.

Note

The GC doesn't recalculate the CI after step 2 above, hence the need for step 3 to check that the whole subgraph is still isolated.

C-level changes

Type objects get a new tp_finalize slot to which __del__ methods are mapped (and reciprocally). Generators are modified to use this slot, rather than tp_del. A tp_finalize function is a normal C function which will be called with a valid and alive PyObject as its only argument. It doesn't need to manipulate the object's reference count, as this will be done by the caller. However, it must ensure that the original exception state is restored before returning to the caller.

For compatibility, tp_del is kept in the type structure. Handling of objects with a non-NULL tp_del is unchanged: when part of a CI, they are not finalized and end up in gc.garbage. However, a non-NULL tp_del is not encountered anymore in the CPython source tree (except for testing purposes).

Two new C API functions are provided to ease calling of tp_finalize, especially from custom deallocators.

On the internal side, a bit is reserved in the GC header for GC-managed objects to signal that they were finalized. This helps avoid finalizing an object twice (and, especially, finalizing a CT object after it was broken by the GC).

Note

Objects which are not GC-enabled can also have a tp_finalize slot. They don't need the additional bit since their tp_finalize function can only be called from the deallocator: it therefore cannot be called twice, except when resurrected.

Discussion

Predictability

Following this scheme, an object's finalizer is always called exactly once, even if it was resurrected afterwards.

For CI objects, the order in which finalizers are called (step 2 above) is undefined.

Safety

It is important to explain why the proposed change is safe. There are two aspects to be discussed:

  • Can a finalizer access zombie objects (including the object being finalized)?
  • What happens if a finalizer mutates the object graph so as to impact the CI?

Let's discuss the first issue. We will divide possible cases in two categories:

  • If the object being finalized is part of the CI: by construction, no objects in CI are zombies yet, since CI finalizers are called before any reference breaking is done. Therefore, the finalizer cannot access zombie objects, which don't exist.
  • If the object being finalized is not part of the CI/CT: by definition, objects in the CI/CT don't have any references pointing to them from outside the CI/CT. Therefore, the finalizer cannot reach any zombie object (that is, even if the object being finalized was itself referenced from a zombie object).

Now for the second issue. There are three potential cases:

  • The finalizer clears an existing reference to a CI object. The CI object may be disposed of before the GC tries to break it, which is fine (the GC simply has to be aware of this possibility).
  • The finalizer creates a new reference to a CI object. This can only happen from a CI object's finalizer (see above why). Therefore, the new reference will be detected by the GC after all CI finalizers are called (step 3 above), and collection will be aborted without any objects being broken.
  • The finalizer clears or creates a reference to a non-CI object. By construction, this is not a problem.

Implementation

An implementation is available in branch finalize of the repository at http://hg.python.org/features/finalize/.

Validation

Besides running the normal Python test suite, the implementation adds test cases for various finalization possibilities including reference cycles, object resurrection and legacy tp_del slots.

The implementation has also been checked to not produce any regressions on the following test suites:

References

Notes about reference cycle collection and weak reference callbacks: http://hg.python.org/cpython/file/4e687d53b645/Modules/gc_weakref.txt

Generator memory leak: http://bugs.python.org/issue17468

Allow objects to decide if they can be collected by GC: http://bugs.python.org/issue9141

Module shutdown procedure based on GC http://bugs.python.org/issue812369

pep-0443 Single-dispatch generic functions

PEP:443
Title:Single-dispatch generic functions
Version:$Revision$
Last-Modified:$Date$
Author:Łukasz Langa <lukasz at langa.pl>
Discussions-To:Python-Dev <python-dev at python.org>
Status:Final
Type:Standards Track
Content-Type:text/x-rst
Created:22-May-2013
Post-History:22-May-2013, 25-May-2013, 31-May-2013
Replaces:245 246 3124

Abstract

This PEP proposes a new mechanism in the functools standard library module that provides a simple form of generic programming known as single-dispatch generic functions.

A generic function is composed of multiple functions implementing the same operation for different types. Which implementation should be used during a call is determined by the dispatch algorithm. When the implementation is chosen based on the type of a single argument, this is known as single dispatch.

Rationale and Goals

Python has always provided a variety of built-in and standard-library generic functions, such as len(), iter(), pprint.pprint(), copy.copy(), and most of the functions in the operator module. However, it currently:

  1. does not have a simple or straightforward way for developers to create new generic functions,
  2. does not have a standard way for methods to be added to existing generic functions (i.e., some are added using registration functions, others require defining __special__ methods, possibly by monkeypatching).

In addition, it is currently a common anti-pattern for Python code to inspect the types of received arguments, in order to decide what to do with the objects.

For example, code may wish to accept either an object of some type, or a sequence of objects of that type. Currently, the "obvious way" to do this is by type inspection, but this is brittle and closed to extension.

Abstract Base Classes make it easier to discover present behaviour, but don't help adding new behaviour. A developer using an already-written library may be unable to change how their objects are treated by such code, especially if the objects they are using were created by a third party.

Therefore, this PEP proposes a uniform API to address dynamic overloading using decorators.

User API

To define a generic function, decorate it with the @singledispatch decorator. Note that the dispatch happens on the type of the first argument. Create your function accordingly:

>>> from functools import singledispatch
>>> @singledispatch
... def fun(arg, verbose=False):
...     if verbose:
...         print("Let me just say,", end=" ")
...     print(arg)

To add overloaded implementations to the function, use the register() attribute of the generic function. This is a decorator, taking a type parameter and decorating a function implementing the operation for that type:

>>> @fun.register(int)
... def _(arg, verbose=False):
...     if verbose:
...         print("Strength in numbers, eh?", end=" ")
...     print(arg)
...
>>> @fun.register(list)
... def _(arg, verbose=False):
...     if verbose:
...         print("Enumerate this:")
...     for i, elem in enumerate(arg):
...         print(i, elem)

To enable registering lambdas and pre-existing functions, the register() attribute can be used in a functional form:

>>> def nothing(arg, verbose=False):
...     print("Nothing.")
...
>>> fun.register(type(None), nothing)

The register() attribute returns the undecorated function. This enables decorator stacking, pickling, as well as creating unit tests for each variant independently:

>>> @fun.register(float)
... @fun.register(Decimal)
... def fun_num(arg, verbose=False):
...     if verbose:
...         print("Half of your number:", end=" ")
...     print(arg / 2)
...
>>> fun_num is fun
False

When called, the generic function dispatches on the type of the first argument:

>>> fun("Hello, world.")
Hello, world.
>>> fun("test.", verbose=True)
Let me just say, test.
>>> fun(42, verbose=True)
Strength in numbers, eh? 42
>>> fun(['spam', 'spam', 'eggs', 'spam'], verbose=True)
Enumerate this:
0 spam
1 spam
2 eggs
3 spam
>>> fun(None)
Nothing.
>>> fun(1.23)
0.615

Where there is no registered implementation for a specific type, its method resolution order is used to find a more generic implementation. The original function decorated with @singledispatch is registered for the base object type, which means it is used if no better implementation is found.

To check which implementation will the generic function choose for a given type, use the dispatch() attribute:

>>> fun.dispatch(float)
<function fun_num at 0x104319058>
>>> fun.dispatch(dict)    # note: default implementation
<function fun at 0x103fe0000>

To access all registered implementations, use the read-only registry attribute:

>>> fun.registry.keys()
dict_keys([<class 'NoneType'>, <class 'int'>, <class 'object'>,
          <class 'decimal.Decimal'>, <class 'list'>,
          <class 'float'>])
>>> fun.registry[float]
<function fun_num at 0x1035a2840>
>>> fun.registry[object]
<function fun at 0x103fe0000>

The proposed API is intentionally limited and opinionated, as to ensure it is easy to explain and use, as well as to maintain consistency with existing members in the functools module.

Implementation Notes

The functionality described in this PEP is already implemented in the pkgutil standard library module as simplegeneric. Because this implementation is mature, the goal is to move it largely as-is. The reference implementation is available on hg.python.org [1].

The dispatch type is specified as a decorator argument. An alternative form using function annotations was considered but its inclusion has been rejected. As of May 2013, this usage pattern is out of scope for the standard library [2], and the best practices for annotation usage are still debated.

Based on the current pkgutil.simplegeneric implementation, and following the convention on registering virtual subclasses on Abstract Base Classes, the dispatch registry will not be thread-safe.

Abstract Base Classes

The pkgutil.simplegeneric implementation relied on several forms of method resultion order (MRO). @singledispatch removes special handling of old-style classes and Zope's ExtensionClasses. More importantly, it introduces support for Abstract Base Classes (ABC).

When a generic function implementation is registered for an ABC, the dispatch algorithm switches to an extended form of C3 linearization, which includes the relevant ABCs in the MRO of the provided argument. The algorithm inserts ABCs where their functionality is introduced, i.e. issubclass(cls, abc) returns True for the class itself but returns False for all its direct base classes. Implicit ABCs for a given class (either registered or inferred from the presence of a special method like __len__()) are inserted directly after the last ABC explicitly listed in the MRO of said class.

In its most basic form, this linearization returns the MRO for the given type:

>>> _compose_mro(dict, [])
[<class 'dict'>, <class 'object'>]

When the second argument contains ABCs that the specified type is a subclass of, they are inserted in a predictable order:

>>> _compose_mro(dict, [Sized, MutableMapping, str,
...                     Sequence, Iterable])
[<class 'dict'>, <class 'collections.abc.MutableMapping'>,
 <class 'collections.abc.Mapping'>, <class 'collections.abc.Sized'>,
 <class 'collections.abc.Iterable'>, <class 'collections.abc.Container'>,
 <class 'object'>]

While this mode of operation is significantly slower, all dispatch decisions are cached. The cache is invalidated on registering new implementations on the generic function or when user code calls register() on an ABC to implicitly subclass it. In the latter case, it is possible to create a situation with ambiguous dispatch, for instance:

>>> from collections import Iterable, Container
>>> class P:
...     pass
>>> Iterable.register(P)
<class '__main__.P'>
>>> Container.register(P)
<class '__main__.P'>

Faced with ambiguity, @singledispatch refuses the temptation to guess:

>>> @singledispatch
... def g(arg):
...     return "base"
...
>>> g.register(Iterable, lambda arg: "iterable")
<function <lambda> at 0x108b49110>
>>> g.register(Container, lambda arg: "container")
<function <lambda> at 0x108b491c8>
>>> g(P())
Traceback (most recent call last):
...
RuntimeError: Ambiguous dispatch: <class 'collections.abc.Container'>
or <class 'collections.abc.Iterable'>

Note that this exception would not be raised if one or more ABCs had been provided explicitly as base classes during class definition. In this case dispatch happens in the MRO order:

>>> class Ten(Iterable, Container):
...     def __iter__(self):
...         for i in range(10):
...             yield i
...     def __contains__(self, value):
...         return value in range(10)
...
>>> g(Ten())
'iterable'

A similar conflict arises when subclassing an ABC is inferred from the presence of a special method like __len__() or __contains__():

>>> class Q:
...   def __contains__(self, value):
...     return False
...
>>> issubclass(Q, Container)
True
>>> Iterable.register(Q)
>>> g(Q())
Traceback (most recent call last):
...
RuntimeError: Ambiguous dispatch: <class 'collections.abc.Container'>
or <class 'collections.abc.Iterable'>

An early version of the PEP contained a custom approach that was simpler but created a number of edge cases with surprising results [3].

Usage Patterns

This PEP proposes extending behaviour only of functions specifically marked as generic. Just as a base class method may be overridden by a subclass, so too a function may be overloaded to provide custom functionality for a given type.

Universal overloading does not equal arbitrary overloading, in the sense that we need not expect people to randomly redefine the behavior of existing functions in unpredictable ways. To the contrary, generic function usage in actual programs tends to follow very predictable patterns and registered implementations are highly-discoverable in the common case.

If a module is defining a new generic operation, it will usually also define any required implementations for existing types in the same place. Likewise, if a module is defining a new type, then it will usually define implementations there for any generic functions that it knows or cares about. As a result, the vast majority of registered implementations can be found adjacent to either the function being overloaded, or to a newly-defined type for which the implementation is adding support.

It is only in rather infrequent cases that one will have implementations registered in a module that contains neither the function nor the type(s) for which the implementation is added. In the absence of incompetence or deliberate intention to be obscure, the few implementations that are not registered adjacent to the relevant type(s) or function(s), will generally not need to be understood or known about outside the scope where those implementations are defined. (Except in the "support modules" case, where best practice suggests naming them accordingly.)

As mentioned earlier, single-dispatch generics are already prolific throughout the standard library. A clean, standard way of doing them provides a way forward to refactor those custom implementations to use a common one, opening them up for user extensibility at the same time.

Alternative approaches

In PEP 3124 [4] Phillip J. Eby proposes a full-grown solution with overloading based on arbitrary rule sets (with the default implementation dispatching on argument types), as well as interfaces, adaptation and method combining. PEAK-Rules [5] is a reference implementation of the concepts described in PJE's PEP.

Such a broad approach is inherently complex, which makes reaching a consensus hard. In contrast, this PEP focuses on a single piece of functionality that is simple to reason about. It's important to note this does not preclude the use of other approaches now or in the future.

In a 2005 article on Artima [6] Guido van Rossum presents a generic function implementation that dispatches on types of all arguments on a function. The same approach was chosen in Andrey Popp's generic package available on PyPI [7], as well as David Mertz's gnosis.magic.multimethods [8].

While this seems desirable at first, I agree with Fredrik Lundh's comment that "if you design APIs with pages of logic just to sort out what code a function should execute, you should probably hand over the API design to someone else". In other words, the single argument approach proposed in this PEP is not only easier to implement but also clearly communicates that dispatching on a more complex state is an anti-pattern. It also has the virtue of corresponding directly with the familiar method dispatch mechanism in object oriented programming. The only difference is whether the custom implementation is associated more closely with the data (object-oriented methods) or the algorithm (single-dispatch overloading).

PyPy's RPython offers extendabletype [9], a metaclass which enables classes to be externally extended. In combination with pairtype() and pair() factories, this offers a form of single-dispatch generics.

Acknowledgements

Apart from Phillip J. Eby's work on PEP 3124 [4] and PEAK-Rules, influences include Paul Moore's original issue [10] that proposed exposing pkgutil.simplegeneric as part of the functools API, Guido van Rossum's article on multimethods [6], and discussions with Raymond Hettinger on a general pprint rewrite. Huge thanks to Nick Coghlan for encouraging me to create this PEP and providing initial feedback.

pep-0444 Python Web3 Interface

PEP:444
Title:Python Web3 Interface
Version:$Revision$
Last-Modified:$Date$
Author:Chris McDonough <chrism at plope.com>, Armin Ronacher <armin.ronacher at active-4.com>
Discussions-To:Python Web-SIG <web-sig at python.org>
Status:Deferred
Type:Informational
Content-Type:text/x-rst
Created:19-Jul-2010

Abstract

This document specifies a proposed second-generation standard interface between web servers and Python web applications or frameworks.

PEP Deferral

Further exploration of the concepts covered in this PEP has been deferred for lack of a current champion interested in promoting the goals of the PEP and collecting and incorporating feedback, and with sufficient available time to do so effectively.

Note that since this PEP was first created, PEP 3333 was created as a more incremental update that permitted use of WSGI on Python 3.2+. However, an alternative specification that furthers the Python 3 goals of a cleaner separation of binary and text data may still be valuable.

Rationale and Goals

This protocol and specification is influenced heavily by the Web Services Gateway Interface (WSGI) 1.0 standard described in PEP 333 [1]. The high-level rationale for having any standard that allows Python-based web servers and applications to interoperate is outlined in PEP 333. This document essentially uses PEP 333 as a template, and changes its wording in various places for the purpose of forming a different standard.

Python currently boasts a wide variety of web application frameworks which use the WSGI 1.0 protocol. However, due to changes in the language, the WSGI 1.0 protocol is not compatible with Python 3. This specification describes a standardized WSGI-like protocol that lets Python 2.6, 2.7 and 3.1+ applications communicate with web servers. Web3 is clearly a WSGI derivative; it only uses a different name than "WSGI" in order to indicate that it is not in any way backwards compatible.

Applications and servers which are written to this specification are meant to work properly under Python 2.6.X, Python 2.7.X and Python 3.1+. Neither an application nor a server that implements the Web3 specification can be easily written which will work under Python 2 versions earlier than 2.6 nor Python 3 versions earlier than 3.1.

Note

Whatever Python 3 version fixed http://bugs.python.org/issue4006 so os.environ['foo'] returns surrogates (ala PEP 383) when the value of 'foo' cannot be decoded using the current locale instead of failing with a KeyError is the true minimum Python 3 version. In particular, however, Python 3.0 is not supported.

Note

Python 2.6 is the first Python version that supported an alias for bytes and the b"foo" literal syntax. This is why it is the minimum version supported by Web3.

Explicability and documentability are the main technical drivers for the decisions made within the standard.

Differences from WSGI

  • All protocol-specific environment names are prefixed with web3. rather than wsgi., eg. web3.input rather than wsgi.input.
  • All values present as environment dictionary values are explicitly bytes instances instead of native strings. (Environment keys however are native strings, always str regardless of platform).
  • All values returned by an application must be bytes instances, including status code, header names and values, and the body.
  • Wherever WSGI 1.0 referred to an app_iter, this specification refers to a body.
  • No start_response() callback (and therefore no write() callable nor exc_info data).
  • The readline() function of web3.input must support a size hint parameter.
  • The read() function of web3.input must be length delimited. A call without a size argument must not read more than the content length header specifies. In case a content length header is absent the stream must not return anything on read. It must never request more data than specified from the client.
  • No requirement for middleware to yield an empty string if it needs more information from an application to produce output (e.g. no "Middleware Handling of Block Boundaries").
  • Filelike objects passed to a "file_wrapper" must have an __iter__ which returns bytes (never text).
  • wsgi.file_wrapper is not supported.
  • QUERY_STRING, SCRIPT_NAME, PATH_INFO values required to be placed in environ by server (each as the empty bytes instance if no associated value is received in the HTTP request).
  • web3.path_info and web3.script_name should be put into the Web3 environment, if possible, by the origin Web3 server. When available, each is the original, plain 7-bit ASCII, URL-encoded variant of its CGI equivalent derived directly from the request URI (with %2F segment markers and other meta-characters intact). If the server cannot provide one (or both) of these values, it must omit the value(s) it cannot provide from the environment.
  • This requirement was removed: "middleware components must not block iteration waiting for multiple values from an application iterable. If the middleware needs to accumulate more data from the application before it can produce any output, it must yield an empty string."
  • SERVER_PORT must be a bytes instance (not an integer).
  • The server must not inject an additional Content-Length header by guessing the length from the response iterable. This must be set by the application itself in all situations.
  • If the origin server advertises that it has the web3.async capability, a Web3 application callable used by the server is permitted to return a callable that accepts no arguments. When it does so, this callable is to be called periodically by the origin server until it returns a non-None response, which must be a normal Web3 response tuple.

Specification Overview

The Web3 interface has two sides: the "server" or "gateway" side, and the "application" or "framework" side. The server side invokes a callable object that is provided by the application side. The specifics of how that object is provided are up to the server or gateway. It is assumed that some servers or gateways will require an application's deployer to write a short script to create an instance of the server or gateway, and supply it with the application object. Other servers and gateways may use configuration files or other mechanisms to specify where an application object should be imported from, or otherwise obtained.

In addition to "pure" servers/gateways and applications/frameworks, it is also possible to create "middleware" components that implement both sides of this specification. Such components act as an application to their containing server, and as a server to a contained application, and can be used to provide extended APIs, content transformation, navigation, and other useful functions.

Throughout this specification, we will use the term "application callable" to mean "a function, a method, or an instance with a __call__ method". It is up to the server, gateway, or application implementing the application callable to choose the appropriate implementation technique for their needs. Conversely, a server, gateway, or application that is invoking a callable must not have any dependency on what kind of callable was provided to it. Application callables are only to be called, not introspected upon.

The Application/Framework Side

The application object is simply a callable object that accepts one argument. The term "object" should not be misconstrued as requiring an actual object instance: a function, method, or instance with a __call__ method are all acceptable for use as an application object. Application objects must be able to be invoked more than once, as virtually all servers/gateways (other than CGI) will make such repeated requests. If this cannot be guaranteed by the implementation of the actual application, it has to be wrapped in a function that creates a new instance on each call.

Note

Although we refer to it as an "application" object, this should not be construed to mean that application developers will use Web3 as a web programming API. It is assumed that application developers will continue to use existing, high-level framework services to develop their applications. Web3 is a tool for framework and server developers, and is not intended to directly support application developers.)

An example of an application which is a function (simple_app):

def simple_app(environ):
    """Simplest possible application object"""
    status = b'200 OK'
    headers = [(b'Content-type', b'text/plain')]
    body = [b'Hello world!\n']
    return body, status, headers

An example of an application which is an instance (simple_app):

class AppClass(object):

    """Produce the same output, but using an instance.  An
    instance of this class must be instantiated before it is
    passed to the server.  """

  def __call__(self, environ):
        status = b'200 OK'
        headers = [(b'Content-type', b'text/plain')]
        body = [b'Hello world!\n']
        return body, status, headers

simple_app = AppClass()

Alternately, an application callable may return a callable instead of the tuple if the server supports asynchronous execution. See information concerning web3.async for more information.

The Server/Gateway Side

The server or gateway invokes the application callable once for each request it receives from an HTTP client, that is directed at the application. To illustrate, here is a simple CGI gateway, implemented as a function taking an application object. Note that this simple example has limited error handling, because by default an uncaught exception will be dumped to sys.stderr and logged by the web server.

import locale
import os
import sys

encoding = locale.getpreferredencoding()

stdout = sys.stdout

if hasattr(sys.stdout, 'buffer'):
    # Python 3 compatibility; we need to be able to push bytes out
    stdout = sys.stdout.buffer

def get_environ():
    d = {}
    for k, v in os.environ.items():
        # Python 3 compatibility
        if not isinstance(v, bytes):
            # We must explicitly encode the string to bytes under
            # Python 3.1+
            v = v.encode(encoding, 'surrogateescape')
        d[k] = v
    return d

def run_with_cgi(application):

    environ = get_environ()
    environ['web3.input']        = sys.stdin
    environ['web3.errors']       = sys.stderr
    environ['web3.version']      = (1, 0)
    environ['web3.multithread']  = False
    environ['web3.multiprocess'] = True
    environ['web3.run_once']     = True
    environ['web3.async']        = False

    if environ.get('HTTPS', b'off') in (b'on', b'1'):
        environ['web3.url_scheme'] = b'https'
    else:
        environ['web3.url_scheme'] = b'http'

    rv = application(environ)
    if hasattr(rv, '__call__'):
        raise TypeError('This webserver does not support asynchronous '
                        'responses.')
    body, status, headers = rv

    CLRF = b'\r\n'

    try:
        stdout.write(b'Status: ' + status + CRLF)
        for header_name, header_val in headers:
            stdout.write(header_name + b': ' + header_val + CRLF)
        stdout.write(CRLF)
        for chunk in body:
            stdout.write(chunk)
            stdout.flush()
    finally:
        if hasattr(body, 'close'):
            body.close()

Middleware: Components that Play Both Sides

A single object may play the role of a server with respect to some application(s), while also acting as an application with respect to some server(s). Such "middleware" components can perform such functions as:

  • Routing a request to different application objects based on the target URL, after rewriting the environ accordingly.
  • Allowing multiple applications or frameworks to run side-by-side in the same process.
  • Load balancing and remote processing, by forwarding requests and responses over a network.
  • Perform content postprocessing, such as applying XSL stylesheets.

The presence of middleware in general is transparent to both the "server/gateway" and the "application/framework" sides of the interface, and should require no special support. A user who desires to incorporate middleware into an application simply provides the middleware component to the server, as if it were an application, and configures the middleware component to invoke the application, as if the middleware component were a server. Of course, the "application" that the middleware wraps may in fact be another middleware component wrapping another application, and so on, creating what is referred to as a "middleware stack".

A middleware must support asychronous execution if possible or fall back to disabling itself.

Here a middleware that changes the HTTP_HOST key if an X-Host header exists and adds a comment to all html responses:

import time

def apply_filter(app, environ, filter_func):
    """Helper function that passes the return value from an
    application to a filter function when the results are
    ready.
    """
    app_response = app(environ)

    # synchronous response, filter now
    if not hasattr(app_response, '__call__'):
        return filter_func(*app_response)

    # asychronous response.  filter when results are ready
    def polling_function():
        rv = app_response()
        if rv is not None:
            return filter_func(*rv)
    return polling_function

def proxy_and_timing_support(app):
    def new_application(environ):
        def filter_func(body, status, headers):
            now = time.time()
            for key, value in headers:
                if key.lower() == b'content-type' and \
                   value.split(b';')[0] == b'text/html':
                    # assumes ascii compatible encoding in body,
                    # but the middleware should actually parse the
                    # content type header and figure out the
                    # encoding when doing that.
                    body += ('<!-- Execution time: %.2fsec -->' %
                             (now - then)).encode('ascii')
                    break
            return body, status, headers
        then = time.time()
        host = environ.get('HTTP_X_HOST')
        if host is not None:
            environ['HTTP_HOST'] = host

        # use the apply_filter function that applies a given filter
        # function for both async and sync responses.
        return apply_filter(app, environ, filter_func)
    return new_application

app = proxy_and_timing_support(app)

Specification Details

The application callable must accept one positional argument. For the sake of illustration, we have named it environ, but it is not required to have this name. A server or gateway must invoke the application object using a positional (not keyword) argument. (E.g. by calling body, status, headers = application(environ) as shown above.)

The environ parameter is a dictionary object, containing CGI-style environment variables. This object must be a builtin Python dictionary (not a subclass, UserDict or other dictionary emulation), and the application is allowed to modify the dictionary in any way it desires. The dictionary must also include certain Web3-required variables (described in a later section), and may also include server-specific extension variables, named according to a convention that will be described below.

When called by the server, the application object must return a tuple yielding three elements: status, headers and body, or, if supported by an async server, an argumentless callable which either returns None or a tuple of those three elements.

The status element is a status in bytes of the form b'999 Message here'.

headers is a Python list of (header_name, header_value) pairs describing the HTTP response header. The headers structure must be a literal Python list; it must yield two-tuples. Both header_name and header_value must be bytes values.

The body is an iterable yielding zero or more bytes instances. This can be accomplished in a variety of ways, such as by returning a list containing bytes instances as body, or by returning a generator function as body that yields bytes instances, or by the body being an instance of a class which is iterable. Regardless of how it is accomplished, the application object must always return a body iterable yielding zero or more bytes instances.

The server or gateway must transmit the yielded bytes to the client in an unbuffered fashion, completing the transmission of each set of bytes before requesting another one. (In other words, applications should perform their own buffering. See the Buffering and Streaming section below for more on how application output must be handled.)

The server or gateway should treat the yielded bytes as binary byte sequences: in particular, it should ensure that line endings are not altered. The application is responsible for ensuring that the string(s) to be written are in a format suitable for the client. (The server or gateway may apply HTTP transfer encodings, or perform other transformations for the purpose of implementing HTTP features such as byte-range transmission. See Other HTTP Features, below, for more details.)

If the body iterable returned by the application has a close() method, the server or gateway must call that method upon completion of the current request, whether the request was completed normally, or terminated early due to an error. This is to support resource release by the application amd is intended to complement PEP 325's generator support, and other common iterables with close() methods.

Finally, servers and gateways must not directly use any other attributes of the body iterable returned by the application.

environ Variables

The environ dictionary is required to contain various CGI environment variables, as defined by the Common Gateway Interface specification [2].

The following CGI variables must be present. Each key is a native string. Each value is a bytes instance.

Note

In Python 3.1+, a "native string" is a str type decoded using the surrogateescape error handler, as done by os.environ.__getitem__. In Python 2.6 and 2.7, a "native string" is a str types representing a set of bytes.

REQUEST_METHOD
The HTTP request method, such as "GET" or "POST".
SCRIPT_NAME
The initial portion of the request URL's "path" that corresponds to the application object, so that the application knows its virtual "location". This may be the empty bytes instance if the application corresponds to the "root" of the server. SCRIPT_NAME will be a bytes instance representing a sequence of URL-encoded segments separated by the slash character (/). It is assumed that %2F characters will be decoded into literal slash characters within PATH_INFO, as per CGI.
PATH_INFO
The remainder of the request URL's "path", designating the virtual "location" of the request's target within the application. This may be a bytes instance if the request URL targets the application root and does not have a trailing slash. PATH_INFO will be a bytes instance representing a sequence of URL-encoded segments separated by the slash character (/). It is assumed that %2F characters will be decoded into literal slash characters within PATH_INFO, as per CGI.
QUERY_STRING
The portion of the request URL (in bytes) that follows the "?", if any, or the empty bytes instance.
SERVER_NAME, SERVER_PORT
When combined with SCRIPT_NAME and PATH_INFO (or their raw equivalents)`, these variables can be used to complete the URL. Note, however, that HTTP_HOST, if present, should be used in preference to SERVER_NAME for reconstructing the request URL. See the URL Reconstruction section below for more detail. SERVER_PORT should be a bytes instance, not an integer.
SERVER_PROTOCOL
The version of the protocol the client used to send the request. Typically this will be something like "HTTP/1.0" or "HTTP/1.1" and may be used by the application to determine how to treat any HTTP request headers. (This variable should probably be called REQUEST_PROTOCOL, since it denotes the protocol used in the request, and is not necessarily the protocol that will be used in the server's response. However, for compatibility with CGI we have to keep the existing name.)

The following CGI values may present be in the Web3 environment. Each key is a native string. Each value is a bytes instances.

CONTENT_TYPE
The contents of any Content-Type fields in the HTTP request.
CONTENT_LENGTH
The contents of any Content-Length fields in the HTTP request.
HTTP_ Variables
Variables corresponding to the client-supplied HTTP request headers (i.e., variables whose names begin with "HTTP_"). The presence or absence of these variables should correspond with the presence or absence of the appropriate HTTP header in the request.

A server or gateway should attempt to provide as many other CGI variables as are applicable, each with a string for its key and a bytes instance for its value. In addition, if SSL is in use, the server or gateway should also provide as many of the Apache SSL environment variables [5] as are applicable, such as HTTPS=on and SSL_PROTOCOL. Note, however, that an application that uses any CGI variables other than the ones listed above are necessarily non-portable to web servers that do not support the relevant extensions. (For example, web servers that do not publish files will not be able to provide a meaningful DOCUMENT_ROOT or PATH_TRANSLATED.)

A Web3-compliant server or gateway should document what variables it provides, along with their definitions as appropriate. Applications should check for the presence of any variables they require, and have a fallback plan in the event such a variable is absent.

Note that CGI variable values must be bytes instances, if they are present at all. It is a violation of this specification for a CGI variable's value to be of any type other than bytes. On Python 2, this means they will be of type str. On Python 3, this means they will be of type bytes.

They keys of all CGI and non-CGI variables in the environ, however, must be "native strings" (on both Python 2 and Python 3, they will be of type str).

In addition to the CGI-defined variables, the environ dictionary may also contain arbitrary operating-system "environment variables", and must contain the following Web3-defined variables.

Variable Value
web3.version The tuple (1, 0), representing Web3 version 1.0.
web3.url_scheme A bytes value representing the "scheme" portion of the URL at which the application is being invoked. Normally, this will have the value b"http" or b"https", as appropriate.
web3.input An input stream (file-like object) from which bytes constituting the HTTP request body can be read. (The server or gateway may perform reads on-demand as requested by the application, or it may pre- read the client's request body and buffer it in-memory or on disk, or use any other technique for providing such an input stream, according to its preference.)
web3.errors

An output stream (file-like object) to which error output text can be written, for the purpose of recording program or other errors in a standardized and possibly centralized location. This should be a "text mode" stream; i.e., applications should use "\n" as a line ending, and assume that it will be converted to the correct line ending by the server/gateway. Applications may not send bytes to the 'write' method of this stream; they may only send text.

For many servers, web3.errors will be the server's main error log. Alternatively, this may be sys.stderr, or a log file of some sort. The server's documentation should include an explanation of how to configure this or where to find the recorded output. A server or gateway may supply different error streams to different applications, if this is desired.

web3.multithread This value should evaluate true if the application object may be simultaneously invoked by another thread in the same process, and should evaluate false otherwise.
web3.multiprocess This value should evaluate true if an equivalent application object may be simultaneously invoked by another process, and should evaluate false otherwise.
web3.run_once This value should evaluate true if the server or gateway expects (but does not guarantee!) that the application will only be invoked this one time during the life of its containing process. Normally, this will only be true for a gateway based on CGI (or something similar).
web3.script_name The non-URL-decoded SCRIPT_NAME value. Through a historical inequity, by virtue of the CGI specification, SCRIPT_NAME is present within the environment as an already URL-decoded string. This is the original URL-encoded value derived from the request URI. If the server cannot provide this value, it must omit it from the environ.
web3.path_info The non-URL-decoded PATH_INFO value. Through a historical inequity, by virtue of the CGI specification, PATH_INFO is present within the environment as an already URL-decoded string. This is the original URL-encoded value derived from the request URI. If the server cannot provide this value, it must omit it from the environ.
web3.async This is True if the webserver supports async invocation. In that case an application is allowed to return a callable instead of a tuple with the response. The exact semantics are not specified by this specification.

Finally, the environ dictionary may also contain server-defined variables. These variables should have names which are native strings, composed of only lower-case letters, numbers, dots, and underscores, and should be prefixed with a name that is unique to the defining server or gateway. For example, mod_web3 might define variables with names like mod_web3.some_variable.

Input Stream

The input stream (web3.input) provided by the server must support the following methods:

Method Notes
read(size) 1,4
readline([size]) 1,2,4
readlines([size]) 1,3,4
__iter__() 4

The semantics of each method are as documented in the Python Library Reference, except for these notes as listed in the table above:

  1. The server is not required to read past the client's specified Content-Length, and is allowed to simulate an end-of-file condition if the application attempts to read past that point. The application should not attempt to read more data than is specified by the CONTENT_LENGTH variable.
  2. The implementation must support the optional size argument to readline().
  3. The application is free to not supply a size argument to readlines(), and the server or gateway is free to ignore the value of any supplied size argument.
  4. The read, readline and __iter__ methods must return a bytes instance. The readlines method must return a sequence which contains instances of bytes.

The methods listed in the table above must be supported by all servers conforming to this specification. Applications conforming to this specification must not use any other methods or attributes of the input object. In particular, applications must not attempt to close this stream, even if it possesses a close() method.

The input stream should silently ignore attempts to read more than the content length of the request. If no content length is specified the stream must be a dummy stream that does not return anything.

Error Stream

The error stream (web3.errors) provided by the server must support the following methods:

Method Stream Notes
flush() errors 1
write(str) errors 2
writelines(seq) errors 2

The semantics of each method are as documented in the Python Library Reference, except for these notes as listed in the table above:

  1. Since the errors stream may not be rewound, servers and gateways are free to forward write operations immediately, without buffering. In this case, the flush() method may be a no-op. Portable applications, however, cannot assume that output is unbuffered or that flush() is a no-op. They must call flush() if they need to ensure that output has in fact been written. (For example, to minimize intermingling of data from multiple processes writing to the same error log.)
  2. The write() method must accept a string argument, but needn't necessarily accept a bytes argument. The writelines() method must accept a sequence argument that consists entirely of strings, but needn't necessarily accept any bytes instance as a member of the sequence.

The methods listed in the table above must be supported by all servers conforming to this specification. Applications conforming to this specification must not use any other methods or attributes of the errors object. In particular, applications must not attempt to close this stream, even if it possesses a close() method.

Values Returned by A Web3 Application

Web3 applications return a tuple in the form (status, headers, body). If the server supports asynchronous applications (web3.async), the response may be a callable object (which accepts no arguments).

The status value is assumed by a gateway or server to be an HTTP "status" bytes instance like b'200 OK' or b'404 Not Found'. That is, it is a string consisting of a Status-Code and a Reason-Phrase, in that order and separated by a single space, with no surrounding whitespace or other characters. (See RFC 2616, Section 6.1.1 for more information.) The string must not contain control characters, and must not be terminated with a carriage return, linefeed, or combination thereof.

The headers value is assumed by a gateway or server to be a literal Python list of (header_name, header_value) tuples. Each header_name must be a bytes instance representing a valid HTTP header field-name (as defined by RFC 2616, Section 4.2), without a trailing colon or other punctuation. Each header_value must be a bytes instance and must not include any control characters, including carriage returns or linefeeds, either embedded or at the end. (These requirements are to minimize the complexity of any parsing that must be performed by servers, gateways, and intermediate response processors that need to inspect or modify response headers.)

In general, the server or gateway is responsible for ensuring that correct headers are sent to the client: if the application omits a header required by HTTP (or other relevant specifications that are in effect), the server or gateway must add it. For example, the HTTP Date: and Server: headers would normally be supplied by the server or gateway. The gateway must however not override values with the same name if they are emitted by the application.

(A reminder for server/gateway authors: HTTP header names are case-insensitive, so be sure to take that into consideration when examining application-supplied headers!)

Applications and middleware are forbidden from using HTTP/1.1 "hop-by-hop" features or headers, any equivalent features in HTTP/1.0, or any headers that would affect the persistence of the client's connection to the web server. These features are the exclusive province of the actual web server, and a server or gateway should consider it a fatal error for an application to attempt sending them, and raise an error if they are supplied as return values from an application in the headers structure. (For more specifics on "hop-by-hop" features and headers, please see the Other HTTP Features section below.)

Dealing with Compatibility Across Python Versions

Creating Web3 code that runs under both Python 2.6/2.7 and Python 3.1+ requires some care on the part of the developer. In general, the Web3 specification assumes a certain level of equivalence between the Python 2 str type and the Python 3 bytes type. For example, under Python 2, the values present in the Web3 environ will be instances of the str type; in Python 3, these will be instances of the bytes type. The Python 3 bytes type does not possess all the methods of the Python 2 str type, and some methods which it does possess behave differently than the Python 2 str type. Effectively, to ensure that Web3 middleware and applications work across Python versions, developers must do these things:

  1. Do not assume comparison equivalence between text values and bytes values. If you do so, your code may work under Python 2, but it will not work properly under Python 3. For example, don't write somebytes == 'abc'. This will sometimes be true on Python 2 but it will never be true on Python 3, because a sequence of bytes never compares equal to a string under Python 3. Instead, always compare a bytes value with a bytes value, e.g. "somebytes == b'abc'". Code which does this is compatible with and works the same in Python 2.6, 2.7, and 3.1. The b in front of 'abc' signals to Python 3 that the value is a literal bytes instance; under Python 2 it's a forward compatibility placebo.
  2. Don't use the __contains__ method (directly or indirectly) of items that are meant to be byteslike without ensuring that its argument is also a bytes instance. If you do so, your code may work under Python 2, but it will not work properly under Python 3. For example, 'abc' in somebytes' will raise a TypeError under Python 3, but it will return True under Python 2.6 and 2.7. However, b'abc' in somebytes will work the same on both versions. In Python 3.2, this restriction may be partially removed, as it's rumored that bytes types may obtain a __mod__ implementation.
  3. __getitem__ should not be used.
  4. Dont try to use the format method or the __mod__ method of instances of bytes (directly or indirectly). In Python 2, the str type which we treat equivalently to Python 3's bytes supports these method but actual Python 3's bytes instances don't support these methods. If you use these methods, your code will work under Python 2, but not under Python 3.
  5. Do not try to concatenate a bytes value with a string value. This may work under Python 2, but it will not work under Python 3. For example, doing 'abc' + somebytes will work under Python 2, but it will result in a TypeError under Python 3. Instead, always make sure you're concatenating two items of the same type, e.g. b'abc' + somebytes.

Web3 expects byte values in other places, such as in all the values returned by an application.

In short, to ensure compatibility of Web3 application code between Python 2 and Python 3, in Python 2, treat CGI and server variable values in the environment as if they had the Python 3 bytes API even though they actually have a more capable API. Likewise for all stringlike values returned by a Web3 application.

Buffering and Streaming

Generally speaking, applications will achieve the best throughput by buffering their (modestly-sized) output and sending it all at once. This is a common approach in existing frameworks: the output is buffered in a StringIO or similar object, then transmitted all at once, along with the response headers.

The corresponding approach in Web3 is for the application to simply return a single-element body iterable (such as a list) containing the response body as a single string. This is the recommended approach for the vast majority of application functions, that render HTML pages whose text easily fits in memory.

For large files, however, or for specialized uses of HTTP streaming (such as multipart "server push"), an application may need to provide output in smaller blocks (e.g. to avoid loading a large file into memory). It's also sometimes the case that part of a response may be time-consuming to produce, but it would be useful to send ahead the portion of the response that precedes it.

In these cases, applications will usually return a body iterator (often a generator-iterator) that produces the output in a block-by-block fashion. These blocks may be broken to coincide with mulitpart boundaries (for "server push"), or just before time-consuming tasks (such as reading another block of an on-disk file).

Web3 servers, gateways, and middleware must not delay the transmission of any block; they must either fully transmit the block to the client, or guarantee that they will continue transmission even while the application is producing its next block. A server/gateway or middleware may provide this guarantee in one of three ways:

  1. Send the entire block to the operating system (and request that any O/S buffers be flushed) before returning control to the application, OR
  2. Use a different thread to ensure that the block continues to be transmitted while the application produces the next block.
  3. (Middleware only) send the entire block to its parent gateway/server.

By providing this guarantee, Web3 allows applications to ensure that transmission will not become stalled at an arbitrary point in their output data. This is critical for proper functioning of e.g. multipart "server push" streaming, where data between multipart boundaries should be transmitted in full to the client.

Unicode Issues

HTTP does not directly support Unicode, and neither does this interface. All encoding/decoding must be handled by the application; all values passed to or from the server must be of the Python 3 type bytes or instances of the Python 2 type str, not Python 2 unicode or Python 3 str objects.

All "bytes instances" referred to in this specification must:

  • On Python 2, be of type str.
  • On Python 3, be of type bytes.

All "bytes instances" must not :

  • On Python 2, be of type unicode.
  • On Python 3, be of type str.

The result of using a textlike object where a byteslike object is required is undefined.

Values returned from a Web3 app as a status or as response headers must follow RFC 2616 with respect to encoding. That is, the bytes returned must contain a character stream of ISO-8859-1 characters, or the character stream should use RFC 2047 MIME encoding.

On Python platforms which do not have a native bytes-like type (e.g. IronPython, etc.), but instead which generally use textlike strings to represent bytes data, the definition of "bytes instance" can be changed: their "bytes instances" must be native strings that contain only code points representable in ISO-8859-1 encoding (\u0000 through \u00FF, inclusive). It is a fatal error for an application on such a platform to supply strings containing any other Unicode character or code point. Similarly, servers and gateways on those platforms must not supply strings to an application containing any other Unicode characters.

HTTP 1.1 Expect/Continue

Servers and gateways that implement HTTP 1.1 must provide transparent support for HTTP 1.1's "expect/continue" mechanism. This may be done in any of several ways:

  1. Respond to requests containing an Expect: 100-continue request with an immediate "100 Continue" response, and proceed normally.
  2. Proceed with the request normally, but provide the application with a web3.input stream that will send the "100 Continue" response if/when the application first attempts to read from the input stream. The read request must then remain blocked until the client responds.
  3. Wait until the client decides that the server does not support expect/continue, and sends the request body on its own. (This is suboptimal, and is not recommended.)

Note that these behavior restrictions do not apply for HTTP 1.0 requests, or for requests that are not directed to an application object. For more information on HTTP 1.1 Expect/Continue, see RFC 2616, sections 8.2.3 and 10.1.1.

Other HTTP Features

In general, servers and gateways should "play dumb" and allow the application complete control over its output. They should only make changes that do not alter the effective semantics of the application's response. It is always possible for the application developer to add middleware components to supply additional features, so server/gateway developers should be conservative in their implementation. In a sense, a server should consider itself to be like an HTTP "gateway server", with the application being an HTTP "origin server". (See RFC 2616, section 1.3, for the definition of these terms.)

However, because Web3 servers and applications do not communicate via HTTP, what RFC 2616 calls "hop-by-hop" headers do not apply to Web3 internal communications. Web3 applications must not generate any "hop-by-hop" headers [4], attempt to use HTTP features that would require them to generate such headers, or rely on the content of any incoming "hop-by-hop" headers in the environ dictionary. Web3 servers must handle any supported inbound "hop-by-hop" headers on their own, such as by decoding any inbound Transfer-Encoding, including chunked encoding if applicable.

Applying these principles to a variety of HTTP features, it should be clear that a server may handle cache validation via the If-None-Match and If-Modified-Since request headers and the Last-Modified and ETag response headers. However, it is not required to do this, and the application should perform its own cache validation if it wants to support that feature, since the server/gateway is not required to do such validation.

Similarly, a server may re-encode or transport-encode an application's response, but the application should use a suitable content encoding on its own, and must not apply a transport encoding. A server may transmit byte ranges of the application's response if requested by the client, and the application doesn't natively support byte ranges. Again, however, the application should perform this function on its own if desired.

Note that these restrictions on applications do not necessarily mean that every application must reimplement every HTTP feature; many HTTP features can be partially or fully implemented by middleware components, thus freeing both server and application authors from implementing the same features over and over again.

Thread Support

Thread support, or lack thereof, is also server-dependent. Servers that can run multiple requests in parallel, should also provide the option of running an application in a single-threaded fashion, so that applications or frameworks that are not thread-safe may still be used with that server.

Implementation/Application Notes

Server Extension APIs

Some server authors may wish to expose more advanced APIs, that application or framework authors can use for specialized purposes. For example, a gateway based on mod_python might wish to expose part of the Apache API as a Web3 extension.

In the simplest case, this requires nothing more than defining an environ variable, such as mod_python.some_api. But, in many cases, the possible presence of middleware can make this difficult. For example, an API that offers access to the same HTTP headers that are found in environ variables, might return different data if environ has been modified by middleware.

In general, any extension API that duplicates, supplants, or bypasses some portion of Web3 functionality runs the risk of being incompatible with middleware components. Server/gateway developers should not assume that nobody will use middleware, because some framework developers specifically organize their frameworks to function almost entirely as middleware of various kinds.

So, to provide maximum compatibility, servers and gateways that provide extension APIs that replace some Web3 functionality, must design those APIs so that they are invoked using the portion of the API that they replace. For example, an extension API to access HTTP request headers must require the application to pass in its current environ, so that the server/gateway may verify that HTTP headers accessible via the API have not been altered by middleware. If the extension API cannot guarantee that it will always agree with environ about the contents of HTTP headers, it must refuse service to the application, e.g. by raising an error, returning None instead of a header collection, or whatever is appropriate to the API.

These guidelines also apply to middleware that adds information such as parsed cookies, form variables, sessions, and the like to environ. Specifically, such middleware should provide these features as functions which operate on environ, rather than simply stuffing values into environ. This helps ensure that information is calculated from environ after any middleware has done any URL rewrites or other environ modifications.

It is very important that these "safe extension" rules be followed by both server/gateway and middleware developers, in order to avoid a future in which middleware developers are forced to delete any and all extension APIs from environ to ensure that their mediation isn't being bypassed by applications using those extensions!

Application Configuration

This specification does not define how a server selects or obtains an application to invoke. These and other configuration options are highly server-specific matters. It is expected that server/gateway authors will document how to configure the server to execute a particular application object, and with what options (such as threading options).

Framework authors, on the other hand, should document how to create an application object that wraps their framework's functionality. The user, who has chosen both the server and the application framework, must connect the two together. However, since both the framework and the server have a common interface, this should be merely a mechanical matter, rather than a significant engineering effort for each new server/framework pair.

Finally, some applications, frameworks, and middleware may wish to use the environ dictionary to receive simple string configuration options. Servers and gateways should support this by allowing an application's deployer to specify name-value pairs to be placed in environ. In the simplest case, this support can consist merely of copying all operating system-supplied environment variables from os.environ into the environ dictionary, since the deployer in principle can configure these externally to the server, or in the CGI case they may be able to be set via the server's configuration files.

Applications should try to keep such required variables to a minimum, since not all servers will support easy configuration of them. Of course, even in the worst case, persons deploying an application can create a script to supply the necessary configuration values:

from the_app import application

def new_app(environ):
    environ['the_app.configval1'] = b'something'
    return application(environ)

But, most existing applications and frameworks will probably only need a single configuration value from environ, to indicate the location of their application or framework-specific configuration file(s). (Of course, applications should cache such configuration, to avoid having to re-read it upon each invocation.)

URL Reconstruction

If an application wishes to reconstruct a request's complete URL (as a bytes object), it may do so using the following algorithm:

host = environ.get('HTTP_HOST')

scheme = environ['web3.url_scheme']
port = environ['SERVER_PORT']
query = environ['QUERY_STRING']

url = scheme + b'://'

if host:
    url += host
else:
    url += environ['SERVER_NAME']

    if scheme == b'https':
        if port != b'443':
           url += b':' + port
    else:
        if port != b'80':
           url += b':' + port

if 'web3.script_name' in url:
    url += url_quote(environ['web3.script_name'])
else:
    url += environ['SCRIPT_NAME']
if 'web3.path_info' in environ:
    url += url_quote(environ['web3.path_info'])
else:
    url += environ['PATH_INFO']
if query:
    url += b'?' + query

Note that such a reconstructed URL may not be precisely the same URI as requested by the client. Server rewrite rules, for example, may have modified the client's originally requested URL to place it in a canonical form.

Open Questions

  • file_wrapper replacement. Currently nothing is specified here but it's clear that the old system of in-band signalling is broken if it does not provide a way to figure out as a middleware in the process if the response is a file wrapper.

Points of Contention

Outlined below are potential points of contention regarding this specification.

WSGI 1.0 Compatibility

Components written using the WSGI 1.0 specification will not transparently interoperate with components written using this specification. That's because the goals of this proposal and the goals of WSGI 1.0 are not directly aligned.

WSGI 1.0 is obliged to provide specification-level backwards compatibility with versions of Python between 2.2 and 2.7. This specification, however, ditches Python 2.5 and lower compatibility in order to provide compatibility between relatively recent versions of Python 2 (2.6 and 2.7) as well as relatively recent versions of Python 3 (3.1).

It is currently impossible to write components which work reliably under both Python 2 and Python 3 using the WSGI 1.0 specification, because the specification implicitly posits that CGI and server variable values in the environ and values returned via start_response represent a sequence of bytes that can be addressed using the Python 2 string API. It posits such a thing because that sort of data type was the sensible way to represent bytes in all Python 2 versions, and WSGI 1.0 was conceived before Python 3 existed.

Python 3's str type supports the full API provided by the Python 2 str type, but Python 3's str type does not represent a sequence of bytes, it instead represents text. Therefore, using it to represent environ values also requires that the environ byte sequence be decoded to text via some encoding. We cannot decode these bytes to text (at least in any way where the decoding has any meaning other than as a tunnelling mechanism) without widening the scope of WSGI to include server and gateway knowledge of decoding policies and mechanics. WSGI 1.0 never concerned itself with encoding and decoding. It made statements about allowable transport values, and suggested that various values might be best decoded as one encoding or another, but it never required a server to perform any decoding before

Python 3 does not have a stringlike type that can be used instead to represent bytes: it has a bytes type. A bytes type operates quite a bit like a Python 2 str in Python 3.1+, but it lacks behavior equivalent to str.__mod__ and its iteration protocol, and containment, sequence treatment, and equivalence comparisons are different.

In either case, there is no type in Python 3 that behaves just like the Python 2 str type, and a way to create such a type doesn't exist because there is no such thing as a "String ABC" which would allow a suitable type to be built. Due to this design incompatibility, existing WSGI 1.0 servers, middleware, and applications will not work under Python 3, even after they are run through 2to3.

Existing Web-SIG discussions about updating the WSGI specification so that it is possible to write a WSGI application that runs in both Python 2 and Python 3 tend to revolve around creating a specification-level equivalence between the Python 2 str type (which represents a sequence of bytes) and the Python 3 str type (which represents text). Such an equivalence becomes strained in various areas, given the different roles of these types. An arguably more straightforward equivalence exists between the Python 3 bytes type API and a subset of the Python 2 str type API. This specification exploits this subset equivalence.

In the meantime, aside from any Python 2 vs. Python 3 compatibility issue, as various discussions on Web-SIG have pointed out, the WSGI 1.0 specification is too general, providing support (via .write) for asynchronous applications at the expense of implementation complexity. This specification uses the fundamental incompatibility between WSGI 1.0 and Python 3 as a natural divergence point to create a specification with reduced complexity by changing specialized support for asynchronous applications.

To provide backwards compatibility for older WSGI 1.0 applications, so that they may run on a Web3 stack, it is presumed that Web3 middleware will be created which can be used "in front" of existing WSGI 1.0 applications, allowing those existing WSGI 1.0 applications to run under a Web3 stack. This middleware will require, when under Python 3, an equivalence to be drawn between Python 3 str types and the bytes values represented by the HTTP request and all the attendant encoding-guessing (or configuration) it implies.

Note

Such middleware might in the future, instead of drawing an equivalence between Python 3 str and HTTP byte values, make use of a yet-to-be-created "ebytes" type (aka "bytes-with-benefits"), particularly if a String ABC proposal is accepted into the Python core and implemented.

Conversely, it is presumed that WSGI 1.0 middleware will be created which will allow a Web3 application to run behind a WSGI 1.0 stack on the Python 2 platform.

Environ and Response Values as Bytes

Casual middleware and application writers may consider the use of bytes as environment values and response values inconvenient. In particular, they won't be able to use common string formatting functions such as ('%s' % bytes_val) or bytes_val.format('123') because bytes don't have the same API as strings on platforms such as Python 3 where the two types differ. Likewise, on such platforms, stdlib HTTP-related API support for using bytes interchangeably with text can be spotty. In places where bytes are inconvenient or incompatible with library APIs, middleware and application writers will have to decode such bytes to text explicitly. This is particularly inconvenient for middleware writers: to work with environment values as strings, they'll have to decode them from an implied encoding and if they need to mutate an environ value, they'll then need to encode the value into a byte stream before placing it into the environ. While the use of bytes by the specification as environ values might be inconvenient for casual developers, it provides several benefits.

Using bytes types to represent HTTP and server values to an application most closely matches reality because HTTP is fundamentally a bytes-oriented protocol. If the environ values are mandated to be strings, each server will need to use heuristics to guess about the encoding of various values provided by the HTTP environment. Using all strings might increase casual middleware writer convenience, but will also lead to ambiguity and confusion when a value cannot be decoded to a meaningful non-surrogate string.

Use of bytes as environ values avoids any potential for the need for the specification to mandate that a participating server be informed of encoding configuration parameters. If environ values are treated as strings, and so must be decoded from bytes, configuration parameters may eventually become necessary as policy clues from the application deployer. Such a policy would be used to guess an appropriate decoding strategy in various circumstances, effectively placing the burden for enforcing a particular application encoding policy upon the server. If the server must serve more than one application, such configuration would quickly become complex. Many policies would also be impossible to express declaratively.

In reality, HTTP is a complicated and legacy-fraught protocol which requires a complex set of heuristics to make sense of. It would be nice if we could allow this protocol to protect us from this complexity, but we cannot do so reliably while still providing to application writers a level of control commensurate with reality. Python applications must often deal with data embedded in the environment which not only must be parsed by legacy heuristics, but does not conform even to any existing HTTP specification. While these eventualities are unpleasant, they crop up with regularity, making it impossible and undesirable to hide them from application developers, as application developers are the only people who are able to decide upon an appropriate action when an HTTP specification violation is detected.

Some have argued for mixed use of bytes and string values as environ values. This proposal avoids that strategy. Sole use of bytes as environ values makes it possible to fit this specification entirely in one's head; you won't need to guess about which values are strings and which are bytes.

This protocol would also fit in a developer's head if all environ values were strings, but this specification doesn't use that strategy. This will likely be the point of greatest contention regarding the use of bytes. In defense of bytes: developers often prefer protocols with consistent contracts, even if the contracts themselves are suboptimal. If we hide encoding issues from a developer until a value that contains surrogates causes problems after it has already reached beyond the I/O boundary of their application, they will need to do a lot more work to fix assumptions made by their application than if we were to just present the problem much earlier in terms of "here's some bytes, you decode them". This is also a counter-argument to the "bytes are inconvenient" assumption: while presenting bytes to an application developer may be inconvenient for a casual application developer who doesn't care about edge cases, they are extremely convenient for the application developer who needs to deal with complex, dirty eventualities, because use of bytes allows him the appropriate level of control with a clear separation of responsibility.

If the protocol uses bytes, it is presumed that libraries will be created to make working with bytes-only in the environ and within return values more pleasant; for example, analogues of the WSGI 1.0 libraries named "WebOb" and "Werkzeug". Such libraries will fill the gap between convenience and control, allowing the spec to remain simple and regular while still allowing casual authors a convenient way to create Web3 middleware and application components. This seems to be a reasonable alternative to baking encoding policy into the protocol, because many such libraries can be created independently from the protocol, and application developers can choose the one that provides them the appropriate levels of control and convenience for a particular job.

Here are some alternatives to using all bytes:

  • Have the server decode all values representing CGI and server environ values into strings using the latin-1 encoding, which is lossless. Smuggle any undecodable bytes within the resulting string.
  • Encode all CGI and server environ values to strings using the utf-8 encoding with the surrogateescape error handler. This does not work under any existing Python 2.
  • Encode some values into bytes and other values into strings, as decided by their typical usages.

Applications Should be Allowed to Read web3.input Past CONTENT_LENGTH

At [6], Graham Dumpleton makes the assertion that wsgi.input should be required to return the empty string as a signifier of out-of-data, and that applications should be allowed to read past the number of bytes specified in CONTENT_LENGTH, depending only upon the empty string as an EOF marker. WSGI relies on an application "being well behaved and once all data specified by CONTENT_LENGTH is read, that it processes the data and returns any response. That same socket connection could then be used for a subsequent request." Graham would like WSGI adapters to be required to wrap raw socket connections: "this wrapper object will need to count how much data has been read, and when the amount of data reaches that as defined by CONTENT_LENGTH, any subsequent reads should return an empty string instead." This may be useful to support chunked encoding and input filters.

web3.input Unknown Length

There's no documented way to indicate that there is content in environ['web3.input'], but the content length is unknown.

read() of web3.input Should Support No-Size Calling Convention

At [6], Graham Dumpleton makes the assertion that the read() method of wsgi.input should be callable without arguments, and that the result should be "all available request content". Needs discussion.

Comment Armin: I changed the spec to require that from an implementation. I had too much pain with that in the past already. Open for discussions though.

Input Filters should set environ CONTENT_LENGTH to -1

At [6], Graham Dumpleton suggests that an input filter might set environ['CONTENT_LENGTH'] to -1 to indicate that it mutated the input.

headers as Literal List of Two-Tuples

Why do we make applications return a headers structure that is a literal list of two-tuples? I think the iterability of headers needs to be maintained while it moves up the stack, but I don't think we need to be able to mutate it in place at all times. Could we loosen that requirement?

Comment Armin: Strong yes

Removed Requirement that Middleware Not Block

This requirement was removed: "middleware components must not block iteration waiting for multiple values from an application iterable. If the middleware needs to accumulate more data from the application before it can produce any output, it must yield an empty string." This requirement existed to support asynchronous applications and servers (see PEP 333's "Middleware Handling of Block Boundaries"). Asynchronous applications are now serviced explicitly by web3.async capable protocol (a Web3 application callable may itself return a callable).

web3.script_name and web3.path_info

These values are required to be placed into the environment by an origin server under this specification. Unlike SCRIPT_NAME and PATH_INFO, these must be the original URL-encoded variants derived from the request URI. We probably need to figure out how these should be computed originally, and what their values should be if the server performs URL rewriting.

Long Response Headers

Bob Brewer notes on Web-SIG [7]:

Each header_value must not include any control characters, including carriage returns or linefeeds, either embedded or at the end. (These requirements are to minimize the complexity of any parsing that must be performed by servers, gateways, and intermediate response processors that need to inspect or modify response headers.) [1]

That's understandable, but HTTP headers are defined as (mostly) *TEXT, and "words of *TEXT MAY contain characters from character sets other than ISO-8859-1 only when encoded according to the rules of RFC 2047." [2] And RFC 2047 specifies that "an 'encoded-word' may not be more than 75 characters long... If it is desirable to encode more text than will fit in an 'encoded-word' of 75 characters, multiple 'encoded-word's (separated by CRLF SPACE) may be used." [3] This satisfies HTTP header folding rules, as well: "Header fields can be extended over multiple lines by preceding each extra line with at least one SP or HT." [1]

So in my reading of HTTP, some code somewhere should introduce newlines in longish, encoded response header values. I see three options:

  1. Keep things as they are and disallow response header values if they contain words over 75 chars that are outside the ISO-8859-1 character set.
  2. Allow newline characters in WSGI response headers.
  3. Require/strongly suggest WSGI servers to do the encoding and folding before sending the value over HTTP.

Request Trailers and Chunked Transfer Encoding

When using chunked transfer encoding on request content, the RFCs allow there to be request trailers. These are like request headers but come after the final null data chunk. These trailers are only available when the chunked data stream is finite length and when it has all been read in. Neither WSGI nor Web3 currently supports them.

References

[1](1, 2, 3) PEP 333: Python Web Services Gateway Interface (http://www.python.org/dev/peps/pep-0333/)
[2](1, 2) The Common Gateway Interface Specification, v 1.1, 3rd Draft (http://cgi-spec.golux.com/draft-coar-cgi-v11-03.txt)
[3]"Chunked Transfer Coding" -- HTTP/1.1, section 3.6.1 (http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.6.1)
[4]"End-to-end and Hop-by-hop Headers" -- HTTP/1.1, Section 13.5.1 (http://www.w3.org/Protocols/rfc2616/rfc2616-sec13.html#sec13.5.1)
[5]mod_ssl Reference, "Environment Variables" (http://www.modssl.org/docs/2.8/ssl_reference.html#ToC25)
[6](1, 2, 3) Details on WSGI 1.0 amendments/clarifications. (http://blog.dscpl.com.au/2009/10/details-on-wsgi-10-amendmentsclarificat.html)
[7][Web-SIG] WSGI and long response header values http://mail.python.org/pipermail/web-sig/2006-September/002244.html

pep-0445 Add new APIs to customize Python memory allocators

PEP:445
Title:Add new APIs to customize Python memory allocators
Version:$Revision$
Last-Modified:$Date$
Author:Victor Stinner <victor.stinner at gmail.com>
BDFL-Delegate:Antoine Pitrou <solipsis@pitrou.net>
Status:Final
Type:Standards Track
Content-Type:text/x-rst
Created:15-june-2013
Python-Version:3.4
Resolution:http://mail.python.org/pipermail/python-dev/2013-July/127222.html

Abstract

This PEP proposes new Application Programming Interfaces (API) to customize Python memory allocators. The only implementation required to conform to this PEP is CPython, but other implementations may choose to be compatible, or to re-use a similar scheme.

Rationale

Use cases:

  • Applications embedding Python which want to isolate Python memory from the memory of the application, or want to use a different memory allocator optimized for its Python usage
  • Python running on embedded devices with low memory and slow CPU. A custom memory allocator can be used for efficiency and/or to get access all the memory of the device.
  • Debug tools for memory allocators:
    • track the memory usage (find memory leaks)
    • get the location of a memory allocation: Python filename and line number, and the size of a memory block
    • detect buffer underflow, buffer overflow and misuse of Python allocator APIs (see Redesign Debug Checks on Memory Block Allocators as Hooks)
    • force memory allocations to fail to test handling of the MemoryError exception

Proposal

New Functions and Structures

  • Add a new GIL-free (no need to hold the GIL) memory allocator:

    • void* PyMem_RawMalloc(size_t size)
    • void* PyMem_RawRealloc(void *ptr, size_t new_size)
    • void PyMem_RawFree(void *ptr)
    • The newly allocated memory will not have been initialized in any way.
    • Requesting zero bytes returns a distinct non-NULL pointer if possible, as if PyMem_Malloc(1) had been called instead.
  • Add a new PyMemAllocator structure:

    typedef struct {
        /* user context passed as the first argument to the 3 functions */
        void *ctx;
    
        /* allocate a memory block */
        void* (*malloc) (void *ctx, size_t size);
    
        /* allocate or resize a memory block */
        void* (*realloc) (void *ctx, void *ptr, size_t new_size);
    
        /* release a memory block */
        void (*free) (void *ctx, void *ptr);
    } PyMemAllocator;
    
  • Add a new PyMemAllocatorDomain enum to choose the Python allocator domain. Domains:

    • PYMEM_DOMAIN_RAW: PyMem_RawMalloc(), PyMem_RawRealloc() and PyMem_RawFree()
    • PYMEM_DOMAIN_MEM: PyMem_Malloc(), PyMem_Realloc() and PyMem_Free()
    • PYMEM_DOMAIN_OBJ: PyObject_Malloc(), PyObject_Realloc() and PyObject_Free()
  • Add new functions to get and set memory block allocators:

    • void PyMem_GetAllocator(PyMemAllocatorDomain domain, PyMemAllocator *allocator)
    • void PyMem_SetAllocator(PyMemAllocatorDomain domain, PyMemAllocator *allocator)
    • The new allocator must return a distinct non-NULL pointer when requesting zero bytes
    • For the PYMEM_DOMAIN_RAW domain, the allocator must be thread-safe: the GIL is not held when the allocator is called.
  • Add a new PyObjectArenaAllocator structure:

    typedef struct {
        /* user context passed as the first argument to the 2 functions */
        void *ctx;
    
        /* allocate an arena */
        void* (*alloc) (void *ctx, size_t size);
    
        /* release an arena */
        void (*free) (void *ctx, void *ptr, size_t size);
    } PyObjectArenaAllocator;
    
  • Add new functions to get and set the arena allocator used by pymalloc:

    • void PyObject_GetArenaAllocator(PyObjectArenaAllocator *allocator)
    • void PyObject_SetArenaAllocator(PyObjectArenaAllocator *allocator)
  • Add a new function to reinstall the debug checks on memory allocators when a memory allocator is replaced with PyMem_SetAllocator():

    • void PyMem_SetupDebugHooks(void)
    • Install the debug hooks on all memory block allocators. The function can be called more than once, hooks are only installed once.
    • The function does nothing is Python is not compiled in debug mode.
  • Memory block allocators always return NULL if size is greater than PY_SSIZE_T_MAX. The check is done before calling the inner function.

Note

The pymalloc allocator is optimized for objects smaller than 512 bytes with a short lifetime. It uses memory mappings with a fixed size of 256 KB called "arenas".

Here is how the allocators are set up by default:

  • PYMEM_DOMAIN_RAW, PYMEM_DOMAIN_MEM: malloc(), realloc() and free(); call malloc(1) when requesting zero bytes
  • PYMEM_DOMAIN_OBJ: pymalloc allocator which falls back on PyMem_Malloc() for allocations larger than 512 bytes
  • pymalloc arena allocator: VirtualAlloc() and VirtualFree() on Windows, mmap() and munmap() when available, or malloc() and free()

Redesign Debug Checks on Memory Block Allocators as Hooks

Since Python 2.3, Python implements different checks on memory allocators in debug mode:

  • Newly allocated memory is filled with the byte 0xCB, freed memory is filled with the byte 0xDB.
  • Detect API violations, ex: PyObject_Free() called on a memory block allocated by PyMem_Malloc()
  • Detect write before the start of the buffer (buffer underflow)
  • Detect write after the end of the buffer (buffer overflow)

In Python 3.3, the checks are installed by replacing PyMem_Malloc(), PyMem_Realloc(), PyMem_Free(), PyObject_Malloc(), PyObject_Realloc() and PyObject_Free() using macros. The new allocator allocates a larger buffer and writes a pattern to detect buffer underflow, buffer overflow and use after free (by filling the buffer with the byte 0xDB). It uses the original PyObject_Malloc() function to allocate memory. So PyMem_Malloc() and PyMem_Realloc() indirectly call``PyObject_Malloc()`` and PyObject_Realloc().

This PEP redesigns the debug checks as hooks on the existing allocators in debug mode. Examples of call traces without the hooks:

  • PyMem_RawMalloc() => _PyMem_RawMalloc() => malloc()
  • PyMem_Realloc() => _PyMem_RawRealloc() => realloc()
  • PyObject_Free() => _PyObject_Free()

Call traces when the hooks are installed (debug mode):

  • PyMem_RawMalloc() => _PyMem_DebugMalloc() => _PyMem_RawMalloc() => malloc()
  • PyMem_Realloc() => _PyMem_DebugRealloc() => _PyMem_RawRealloc() => realloc()
  • PyObject_Free() => _PyMem_DebugFree() => _PyObject_Free()

As a result, PyMem_Malloc() and PyMem_Realloc() now call malloc() and realloc() in both release mode and debug mode, instead of calling PyObject_Malloc() and PyObject_Realloc() in debug mode.

When at least one memory allocator is replaced with PyMem_SetAllocator(), the PyMem_SetupDebugHooks() function must be called to reinstall the debug hooks on top on the new allocator.

Don't call malloc() directly anymore

PyObject_Malloc() falls back on PyMem_Malloc() instead of malloc() if size is greater or equal than 512 bytes, and PyObject_Realloc() falls back on PyMem_Realloc() instead of realloc()

Direct calls to malloc() are replaced with PyMem_Malloc(), or PyMem_RawMalloc() if the GIL is not held.

External libraries like zlib or OpenSSL can be configured to allocate memory using PyMem_Malloc() or PyMem_RawMalloc(). If the allocator of a library can only be replaced globally (rather than on an object-by-object basis), it shouldn't be replaced when Python is embedded in an application.

For the "track memory usage" use case, it is important to track memory allocated in external libraries to have accurate reports, because these allocations can be large (e.g. they can raise a MemoryError exception) and would otherwise be missed in memory usage reports.

Examples

Use case 1: Replace Memory Allocators, keep pymalloc

Dummy example wasting 2 bytes per memory block, and 10 bytes per pymalloc arena:

#include <stdlib.h>

size_t alloc_padding = 2;
size_t arena_padding = 10;

void* my_malloc(void *ctx, size_t size)
{
    int padding = *(int *)ctx;
    return malloc(size + padding);
}

void* my_realloc(void *ctx, void *ptr, size_t new_size)
{
    int padding = *(int *)ctx;
    return realloc(ptr, new_size + padding);
}

void my_free(void *ctx, void *ptr)
{
    free(ptr);
}

void* my_alloc_arena(void *ctx, size_t size)
{
    int padding = *(int *)ctx;
    return malloc(size + padding);
}

void my_free_arena(void *ctx, void *ptr, size_t size)
{
    free(ptr);
}

void setup_custom_allocator(void)
{
    PyMemAllocator alloc;
    PyObjectArenaAllocator arena;

    alloc.ctx = &alloc_padding;
    alloc.malloc = my_malloc;
    alloc.realloc = my_realloc;
    alloc.free = my_free;

    PyMem_SetAllocator(PYMEM_DOMAIN_RAW, &alloc);
    PyMem_SetAllocator(PYMEM_DOMAIN_MEM, &alloc);
    /* leave PYMEM_DOMAIN_OBJ unchanged, use pymalloc */

    arena.ctx = &arena_padding;
    arena.alloc = my_alloc_arena;
    arena.free = my_free_arena;
    PyObject_SetArenaAllocator(&arena);

    PyMem_SetupDebugHooks();
}

Use case 2: Replace Memory Allocators, override pymalloc

If you have a dedicated allocator optimized for allocations of objects smaller than 512 bytes with a short lifetime, pymalloc can be overriden (replace PyObject_Malloc()).

Dummy example wasting 2 bytes per memory block:

#include <stdlib.h>

size_t padding = 2;

void* my_malloc(void *ctx, size_t size)
{
    int padding = *(int *)ctx;
    return malloc(size + padding);
}

void* my_realloc(void *ctx, void *ptr, size_t new_size)
{
    int padding = *(int *)ctx;
    return realloc(ptr, new_size + padding);
}

void my_free(void *ctx, void *ptr)
{
    free(ptr);
}

void setup_custom_allocator(void)
{
    PyMemAllocator alloc;
    alloc.ctx = &padding;
    alloc.malloc = my_malloc;
    alloc.realloc = my_realloc;
    alloc.free = my_free;

    PyMem_SetAllocator(PYMEM_DOMAIN_RAW, &alloc);
    PyMem_SetAllocator(PYMEM_DOMAIN_MEM, &alloc);
    PyMem_SetAllocator(PYMEM_DOMAIN_OBJ, &alloc);

    PyMem_SetupDebugHooks();
}

The pymalloc arena does not need to be replaced, because it is no more used by the new allocator.

Use case 3: Setup Hooks On Memory Block Allocators

Example to setup hooks on all memory block allocators:

struct {
    PyMemAllocator raw;
    PyMemAllocator mem;
    PyMemAllocator obj;
    /* ... */
} hook;

static void* hook_malloc(void *ctx, size_t size)
{
    PyMemAllocator *alloc = (PyMemAllocator *)ctx;
    void *ptr;
    /* ... */
    ptr = alloc->malloc(alloc->ctx, size);
    /* ... */
    return ptr;
}

static void* hook_realloc(void *ctx, void *ptr, size_t new_size)
{
    PyMemAllocator *alloc = (PyMemAllocator *)ctx;
    void *ptr2;
    /* ... */
    ptr2 = alloc->realloc(alloc->ctx, ptr, new_size);
    /* ... */
    return ptr2;
}

static void hook_free(void *ctx, void *ptr)
{
    PyMemAllocator *alloc = (PyMemAllocator *)ctx;
    /* ... */
    alloc->free(alloc->ctx, ptr);
    /* ... */
}

void setup_hooks(void)
{
    PyMemAllocator alloc;
    static int installed = 0;

    if (installed)
        return;
    installed = 1;

    alloc.malloc = hook_malloc;
    alloc.realloc = hook_realloc;
    alloc.free = hook_free;
    PyMem_GetAllocator(PYMEM_DOMAIN_RAW, &hook.raw);
    PyMem_GetAllocator(PYMEM_DOMAIN_MEM, &hook.mem);
    PyMem_GetAllocator(PYMEM_DOMAIN_OBJ, &hook.obj);

    alloc.ctx = &hook.raw;
    PyMem_SetAllocator(PYMEM_DOMAIN_RAW, &alloc);

    alloc.ctx = &hook.mem;
    PyMem_SetAllocator(PYMEM_DOMAIN_MEM, &alloc);

    alloc.ctx = &hook.obj;
    PyMem_SetAllocator(PYMEM_DOMAIN_OBJ, &alloc);
}

Note

PyMem_SetupDebugHooks() does not need to be called because memory allocator are not replaced: the debug checks on memory block allocators are installed automatically at startup.

Performances

The implementation of this PEP (issue #3329) has no visible overhead on the Python benchmark suite.

Results of the Python benchmarks suite (-b 2n3): some tests are 1.04x faster, some tests are 1.04 slower. Results of pybench microbenchmark: "+0.1%" slower globally (diff between -4.9% and +5.6%).

The full output of benchmarks is attached to the issue #3329.

Rejected Alternatives

More specific functions to get/set memory allocators

It was originally proposed a larger set of C API functions, with one pair of functions for each allocator domain:

  • void PyMem_GetRawAllocator(PyMemAllocator *allocator)
  • void PyMem_GetAllocator(PyMemAllocator *allocator)
  • void PyObject_GetAllocator(PyMemAllocator *allocator)
  • void PyMem_SetRawAllocator(PyMemAllocator *allocator)
  • void PyMem_SetAllocator(PyMemAllocator *allocator)
  • void PyObject_SetAllocator(PyMemAllocator *allocator)

This alternative was rejected because it is not possible to write generic code with more specific functions: code must be duplicated for each memory allocator domain.

Make PyMem_Malloc() reuse PyMem_RawMalloc() by default

If PyMem_Malloc() called PyMem_RawMalloc() by default, calling PyMem_SetAllocator(PYMEM_DOMAIN_RAW, alloc) would also patch PyMem_Malloc() indirectly.

This alternative was rejected because PyMem_SetAllocator() would have a different behaviour depending on the domain. Always having the same behaviour is less error-prone.

Add a new PYDEBUGMALLOC environment variable

It was proposed to add a new PYDEBUGMALLOC environment variable to enable debug checks on memory block allocators. It would have had the same effect as calling the PyMem_SetupDebugHooks(), without the need to write any C code. Another advantage is to allow to enable debug checks even in release mode: debug checks would always be compiled in, but only enabled when the environment variable is present and non-empty.

This alternative was rejected because a new environment variable would make Python initialization even more complex. PEP 432 tries to simplify the CPython startup sequence.

Use macros to get customizable allocators

To have no overhead in the default configuration, customizable allocators would be an optional feature enabled by a configuration option or by macros.

This alternative was rejected because the use of macros implies having to recompile extensions modules to use the new allocator and allocator hooks. Not having to recompile Python nor extension modules makes debug hooks easier to use in practice.

Pass the C filename and line number

Define allocator functions as macros using __FILE__ and __LINE__ to get the C filename and line number of a memory allocation.

Example of PyMem_Malloc macro with the modified PyMemAllocator structure:

typedef struct {
    /* user context passed as the first argument
       to the 3 functions */
    void *ctx;

    /* allocate a memory block */
    void* (*malloc) (void *ctx, const char *filename, int lineno,
                     size_t size);

    /* allocate or resize a memory block */
    void* (*realloc) (void *ctx, const char *filename, int lineno,
                      void *ptr, size_t new_size);

    /* release a memory block */
    void (*free) (void *ctx, const char *filename, int lineno,
                  void *ptr);
} PyMemAllocator;

void* _PyMem_MallocTrace(const char *filename, int lineno,
                         size_t size);

/* the function is still needed for the Python stable ABI */
void* PyMem_Malloc(size_t size);

#define PyMem_Malloc(size) \
        _PyMem_MallocTrace(__FILE__, __LINE__, size)

The GC allocator functions would also have to be patched. For example, _PyObject_GC_Malloc() is used in many C functions and so objects of different types would have the same allocation location.

This alternative was rejected because passing a filename and a line number to each allocator makes the API more complex: pass 3 new arguments (ctx, filename, lineno) to each allocator function, instead of just a context argument (ctx). Having to also modify GC allocator functions adds too much complexity for a little gain.

GIL-free PyMem_Malloc()

In Python 3.3, when Python is compiled in debug mode, PyMem_Malloc() indirectly calls PyObject_Malloc() which requires the GIL to be held (it isn't thread-safe). That's why PyMem_Malloc() must be called with the GIL held.

This PEP changes PyMem_Malloc(): it now always calls malloc() rather than PyObject_Malloc(). The "GIL must be held" restriction could therefore be removed from PyMem_Malloc().

This alternative was rejected because allowing to call PyMem_Malloc() without holding the GIL can break applications which setup their own allocators or allocator hooks. Holding the GIL is convenient to develop a custom allocator: no need to care about other threads. It is also convenient for a debug allocator hook: Python objects can be safely inspected, and the C API may be used for reporting.

Moreover, calling PyGILState_Ensure() in a memory allocator has unexpected behaviour, especially at Python startup and when creating of a new Python thread state. It is better to free custom allocators of the responsibility of acquiring the GIL.

Don't add PyMem_RawMalloc()

Replace malloc() with PyMem_Malloc(), but only if the GIL is held. Otherwise, keep malloc() unchanged.

The PyMem_Malloc() is used without the GIL held in some Python functions. For example, the main() and Py_Main() functions of Python call PyMem_Malloc() whereas the GIL do not exist yet. In this case, PyMem_Malloc() would be replaced with malloc() (or PyMem_RawMalloc()).

This alternative was rejected because PyMem_RawMalloc() is required for accurate reports of the memory usage. When a debug hook is used to track the memory usage, the memory allocated by direct calls to malloc() cannot be tracked. PyMem_RawMalloc() can be hooked and so all the memory allocated by Python can be tracked, including memory allocated without holding the GIL.

Use existing debug tools to analyze memory use

There are many existing debug tools to analyze memory use. Some examples: Valgrind, Purify, Clang AddressSanitizer, failmalloc, etc.

The problem is to retrieve the Python object related to a memory pointer to read its type and/or its content. Another issue is to retrieve the source of the memory allocation: the C backtrace is usually useless (same reasoning than macros using __FILE__ and __LINE__, see Pass the C filename and line number), the Python filename and line number (or even the Python traceback) is more useful.

This alternative was rejected because classic tools are unable to introspect Python internals to collect such information. Being able to setup a hook on allocators called with the GIL held allows to collect a lot of useful data from Python internals.

Add a msize() function

Add another function to PyMemAllocator and PyObjectArenaAllocator structures:

size_t msize(void *ptr);

This function returns the size of a memory block or a memory mapping. Return (size_t)-1 if the function is not implemented or if the pointer is unknown (ex: NULL pointer).

On Windows, this function can be implemented using _msize() and VirtualQuery().

The function can be used to implement a hook tracking the memory usage. The free() method of an allocator only gets the address of a memory block, whereas the size of the memory block is required to update the memory usage.

The additional msize() function was rejected because only few platforms implement it. For example, Linux with the GNU libc does not provide a function to get the size of a memory block. msize() is not currently used in the Python source code. The function would only be used to track memory use, and make the API more complex. A debug hook can implement the function internally, there is no need to add it to PyMemAllocator and PyObjectArenaAllocator structures.

No context argument

Simplify the signature of allocator functions, remove the context argument:

  • void* malloc(size_t size)
  • void* realloc(void *ptr, size_t new_size)
  • void free(void *ptr)

It is likely for an allocator hook to be reused for PyMem_SetAllocator() and PyObject_SetAllocator(), or even PyMem_SetRawAllocator(), but the hook must call a different function depending on the allocator. The context is a convenient way to reuse the same custom allocator or hook for different Python allocators.

In C++, the context can be used to pass this.

External Libraries

Examples of API used to customize memory allocators.

Libraries used by Python:

Other libraries:

The new ctx parameter of this PEP was inspired by the API of zlib and Oracle's OCI libraries.

See also the GNU libc: Memory Allocation Hooks which uses a different approach to hook memory allocators.

Memory Allocators

The C standard library provides the well known malloc() function. Its implementation depends on the platform and of the C library. The GNU C library uses a modified ptmalloc2, based on "Doug Lea's Malloc" (dlmalloc). FreeBSD uses jemalloc. Google provides tcmalloc which is part of gperftools.

malloc() uses two kinds of memory: heap and memory mappings. Memory mappings are usually used for large allocations (ex: larger than 256 KB), whereas the heap is used for small allocations.

On UNIX, the heap is handled by brk() and sbrk() system calls, and it is contiguous. On Windows, the heap is handled by HeapAlloc() and can be discontiguous. Memory mappings are handled by mmap() on UNIX and VirtualAlloc() on Windows, they can be discontiguous.

Releasing a memory mapping gives back immediatly the memory to the system. On UNIX, the heap memory is only given back to the system if the released block is located at the end of the heap. Otherwise, the memory will only be given back to the system when all the memory located after the released memory is also released.

To allocate memory on the heap, an allocator tries to reuse free space. If there is no contiguous space big enough, the heap must be enlarged, even if there is more free space than required size. This issue is called the "memory fragmentation": the memory usage seen by the system is higher than real usage. On Windows, HeapAlloc() creates a new memory mapping with VirtualAlloc() if there is not enough free contiguous memory.

CPython has a pymalloc allocator for allocations smaller than 512 bytes. This allocator is optimized for small objects with a short lifetime. It uses memory mappings called "arenas" with a fixed size of 256 KB.

Other allocators:

This PEP allows to choose exactly which memory allocator is used for your application depending on its usage of the memory (number of allocations, size of allocations, lifetime of objects, etc.).

pep-0446 Make newly created file descriptors non-inheritable

PEP:446
Title:Make newly created file descriptors non-inheritable
Version:$Revision$
Last-Modified:$Date$
Author:Victor Stinner <victor.stinner at gmail.com>
Status:Final
Type:Standards Track
Content-Type:text/x-rst
Created:5-August-2013
Python-Version:3.4

Abstract

Leaking file descriptors in child processes causes various annoying issues and is a known major security vulnerability. Using the subprocess module with the close_fds parameter set to True is not possible in all cases.

This PEP proposes to make all file descriptors created by Python non-inheritable by default to reduce the risk of these issues. This PEP fixes also a race condition in multi-threaded applications on operating systems supporting atomic flags to create non-inheritable file descriptors.

We are aware of the code breakage this is likely to cause, and doing it anyway for the good of mankind. (Details in the section "Backward Compatibility" below.)

Rationale

Inheritance of File Descriptors

Each operating system handles the inheritance of file descriptors differently. Windows creates non-inheritable handles by default, whereas UNIX and the POSIX API on Windows create inheritable file descriptors by default. Python prefers the POSIX API over the native Windows API, to have a single code base and to use the same type for file descriptors, and so it creates inheritable file descriptors.

There is one exception: os.pipe() creates non-inheritable pipes on Windows, whereas it creates inheritable pipes on UNIX. The reason is an implementation artifact: os.pipe() calls CreatePipe() on Windows (native API), whereas it calls pipe() on UNIX (POSIX API). The call to CreatePipe() was added in Python in 1994, before the introduction of pipe() in the POSIX API in Windows 98. The issue #4708 proposes to change os.pipe() on Windows to create inheritable pipes.

Inheritance of File Descriptors on Windows

On Windows, the native type of file objects is handles (C type HANDLE). These handles have a HANDLE_FLAG_INHERIT flag which defines if a handle can be inherited in a child process or not. For the POSIX API, the C runtime (CRT) also provides file descriptors (C type int). The handle of a file descriptor can be retrieve using the function _get_osfhandle(fd). A file descriptor can be created from a handle using the function _open_osfhandle(handle).

Using CreateProcess(), handles are only inherited if their inheritable flag (HANDLE_FLAG_INHERIT) is set and the bInheritHandles parameter of CreateProcess() is TRUE; all file descriptors except standard streams (0, 1, 2) are closed in the child process, even if bInheritHandles is TRUE. Using the spawnv() function, all inheritable handles and all inheritable file descriptors are inherited in the child process. This function uses the undocumented fields cbReserved2 and lpReserved2 of the STARTUPINFO structure to pass an array of file descriptors.

To replace standard streams (stdin, stdout, stderr) using CreateProcess(), the STARTF_USESTDHANDLES flag must be set in the dwFlags field of the STARTUPINFO structure and the bInheritHandles parameter of CreateProcess() must be set to TRUE. So when at least one standard stream is replaced, all inheritable handles are inherited by the child process.

The default value of the close_fds parameter of subprocess process is True (bInheritHandles=FALSE) if stdin, stdout and stderr parameters are None, False (bInheritHandles=TRUE) otherwise.

See also:

Only Inherit Some Handles on Windows

Since Windows Vista, CreateProcess() supports an extension of the STARTUPINFO struture: the STARTUPINFOEX structure. Using this new structure, it is possible to specify a list of handles to inherit: PROC_THREAD_ATTRIBUTE_HANDLE_LIST. Read Programmatically controlling which handles are inherited by new processes in Win32 (Raymond Chen, Dec 2011) for more information.

Before Windows Vista, it is possible to make handles inheritable and call CreateProcess() with bInheritHandles=TRUE. This option works if all other handles are non-inheritable. There is a race condition: if another thread calls CreateProcess() with bInheritHandles=TRUE, handles will also be inherited in the second process.

Microsoft suggests to use a lock to avoid the race condition: read Q315939: PRB: Child Inherits Unintended Handles During CreateProcess Call (last review: November 2006). The Python issue #16500 "Add an atfork module" proposes to add such lock, it can be used to make handles non-inheritable without the race condition. Such lock only protects against a race condition between Python threads; C threads are not protected.

Another option is to duplicate handles that must be inherited, passing the values of the duplicated handles to the child process, so the child process can steal duplicated handles using DuplicateHandle() with DUPLICATE_CLOSE_SOURCE. Handle values change between the parent and the child process because the handles are duplicated (twice); the parent and/or the child process must be adapted to handle this change. If the child program cannot be modified, an intermediate program can be used to steal handles from the parent process before spawning the final child program. The intermediate program has to pass the handle from the child process to the parent process. The parent may have to close duplicated handles if all handles were not stolen, for example if the intermediate process fails. If the command line is used to pass the handle values, the command line must be modified when handles are duplicated, because their values are modified.

This PEP does not include a solution to this problem because there is no perfect solution working on all Windows versions. This point is deferred until use cases relying on handle or file descriptor inheritance on Windows are well known, so we can choose the best solution and carefully test its implementation.

Inheritance of File Descriptors on UNIX

POSIX provides a close-on-exec flag on file descriptors to automatically close a file descriptor when the C function execv() is called. File descriptors with the close-on-exec flag cleared are inherited in the child process, file descriptors with the flag set are closed in the child process.

The flag can be set in two syscalls (one to get current flags, a second to set new flags) using fcntl():

int flags, res;
flags = fcntl(fd, F_GETFD);
if (flags == -1) { /* handle the error */ }
flags |= FD_CLOEXEC;
/* or "flags &= ~FD_CLOEXEC;" to clear the flag */
res = fcntl(fd, F_SETFD, flags);
if (res == -1) { /* handle the error */ }

FreeBSD, Linux, Mac OS X, NetBSD, OpenBSD and QNX also support setting the flag in a single syscall using ioctl():

int res;
res = ioctl(fd, FIOCLEX, 0);
if (!res) { /* handle the error */ }

NOTE: The close-on-exec flag has no effect on fork(): all file descriptors are inherited by the child process. The Python issue #16500 "Add an atfork module" proposes to add a new atfork module to execute code at fork, which may be used to automatically close file descriptors.

Issues with Inheritable File Descriptors

Most of the time, inheritable file descriptors "leaked" to child processes are not noticed, because they don't cause major bugs. It does not mean that these bugs must not be fixed.

Two common issues with inherited file descriptors:

  • On Windows, a directory cannot be removed before all file handles open in the directory are closed. The same issue can be seen with files, except if the file was created with the FILE_SHARE_DELETE flag (O_TEMPORARY mode for open()).
  • If a listening socket is leaked to a child process, the socket address cannot be reused before the parent and child processes terminated. For example, if a web server spawns a new program to handle a process, and the server restarts while the program is not done, the server cannot start because the TCP port is still in use.

Example of issues in open source projects:

  • Mozilla (Firefox): open since 2002-05
  • dbus library: fixed in 2008-05 (dbus commit), close file descriptors in the child process
  • autofs: fixed in 2009-02, set the CLOEXEC flag
  • qemu: fixed in 2009-12 (qemu commit), set CLOEXEC flag
  • Tor: fixed in 2010-12, set CLOEXEC flag
  • OCaml: open since 2011-04, "PR#5256: Processes opened using Unix.open_process* inherit all opened file descriptors (including sockets)"
  • ØMQ: open since 2012-08
  • Squid: open since 2012-07

See also: Excuse me son, but your code is leaking !!! (Dan Walsh, March 2012) for SELinux issues with leaked file descriptors.

Security Vulnerability

Leaking sensitive file handles and file descriptors can lead to security vulnerabilities. An untrusted child process might read sensitive data like passwords or take control of the parent process though a leaked file descriptor. With a leaked listening socket, a child process can accept new connections to read sensitive data.

Example of vulnerabilities:

Read also the CERT Secure Coding Standards: FIO42-C. Ensure files are properly closed when they are no longer needed.

Issues fixed in the subprocess module

Inherited file descriptors caused 4 issues in the subprocess module:

These issues were fixed in Python 3.2 by 4 different changes in the subprocess module:

  • Pipes are now non-inheritable;
  • The default value of the close_fds parameter is now True, with one exception on Windows: the default value is False if at least one standard stream is replaced;
  • A new pass_fds parameter has been added;
  • Creation of a _posixsubprocess module implemented in C.

Atomic Creation of non-inheritable File Descriptors

In a multi-threaded application, an inheritable file descriptor may be created just before a new program is spawned, before the file descriptor is made non-inheritable. In this case, the file descriptor is leaked to the child process. This race condition could be avoided if the file descriptor is created directly non-inheritable.

FreeBSD, Linux, Mac OS X, Windows and many other operating systems support creating non-inheritable file descriptors with the inheritable flag cleared atomically at the creation of the file descriptor.

A new WSA_FLAG_NO_HANDLE_INHERIT flag for WSASocket() was added in Windows 7 SP1 and Windows Server 2008 R2 SP1 to create non-inheritable sockets. If this flag is used on an older Windows version (ex: Windows XP SP3), WSASocket() fails with WSAEPROTOTYPE.

On UNIX, new flags were added for files and sockets:

  • O_CLOEXEC: available on Linux (2.6.23), FreeBSD (8.3), Mac OS 10.8, OpenBSD 5.0, Solaris 11, QNX, BeOS, next NetBSD release (6.1?). This flag is part of POSIX.1-2008.
  • SOCK_CLOEXEC flag for socket() and socketpair(), available on Linux 2.6.27, OpenBSD 5.2, NetBSD 6.0.
  • fcntl(): F_DUPFD_CLOEXEC flag, available on Linux 2.6.24, OpenBSD 5.0, FreeBSD 9.1, NetBSD 6.0, Solaris 11. This flag is part of POSIX.1-2008.
  • fcntl(): F_DUP2FD_CLOEXEC flag, available on FreeBSD 9.1 and Solaris 11.
  • recvmsg(): MSG_CMSG_CLOEXEC, available on Linux 2.6.23, NetBSD 6.0.

On Linux older than 2.6.23, O_CLOEXEC flag is simply ignored. So fcntl() must be called to check if the file descriptor is non-inheritable: O_CLOEXEC is not supported if the FD_CLOEXEC flag is missing. On Linux older than 2.6.27, socket() or socketpair() fail with errno set to EINVAL if the SOCK_CLOEXEC flag is set in the socket type.

New functions:

  • dup3(): available on Linux 2.6.27 (and glibc 2.9)
  • pipe2(): available on Linux 2.6.27 (and glibc 2.9)
  • accept4(): available on Linux 2.6.28 (and glibc 2.10)

On Linux older than 2.6.28, accept4() fails with errno set to ENOSYS.

Summary:

Operating System Atomic File Atomic Socket
FreeBSD 8.3 (2012) X
Linux 2.6.23 (2007) 2.6.27 (2008)
Mac OS X 10.8 (2012) X
NetBSD 6.1 (?) 6.0 (2012)
OpenBSD 5.0 (2011) 5.2 (2012)
Solaris 11 (2011) X
Windows XP (2001) Seven SP1 (2011), 2008 R2 SP1 (2011)

Legend:

  • "Atomic File": first version of the operating system supporting creating atomically a non-inheritable file descriptor using open()
  • "Atomic Socket": first version of the operating system supporting creating atomically a non-inheritable socket
  • "X": not supported yet

See also:

Status of Python 3.3

Python 3.3 creates inheritable file descriptors on all platforms, except os.pipe() which creates non-inheritable file descriptors on Windows.

New constants and functions related to the atomic creation of non-inheritable file descriptors were added to Python 3.3: os.O_CLOEXEC, os.pipe2() and socket.SOCK_CLOEXEC.

On UNIX, the subprocess module closes all file descriptors in the child process by default, except standard streams (0, 1, 2) and file descriptors of the pass_fds parameter. If the close_fds parameter is set to False, all inheritable file descriptors are inherited in the child process.

On Windows, the subprocess closes all handles and file descriptors in the child process by default. If at least one standard stream (stdin, stdout or stderr) is replaced (ex: redirected into a pipe), all inheritable handles and file descriptors 0, 1 and 2 are inherited in the child process.

Using the functions of the os.execv*() and os.spawn*() families, all inheritable handles and all inheritable file descriptors are inherited by the child process.

On UNIX, the multiprocessing module uses os.fork() and so all file descriptors are inherited by child processes.

On Windows, all inheritable handles and file descriptors 0, 1 and 2 are inherited by the child process using the multiprocessing module, all file descriptors except standard streams are closed.

Summary:

Module FD on UNIX Handles on Windows FD on Windows
subprocess, default STD, pass_fds none STD
subprocess, replace stdout STD, pass_fds all STD
subprocess, close_fds=False all all STD
multiprocessing not applicable all STD
os.execv(), os.spawn() all all all

Legend:

  • "all": all inheritable file descriptors or handles are inherited in the child process
  • "none": all handles are closed in the child process
  • "STD": only file descriptors 0 (stdin), 1 (stdout) and 2 (stderr) are inherited in the child process
  • "pass_fds": file descriptors of the pass_fds parameter of the subprocess are inherited
  • "not applicable": on UNIX, the multiprocessing uses fork(), so this case is not affected by this PEP.

Closing All Open File Descriptors

On UNIX, the subprocess module closes almost all file descriptors in the child process. This operation requires MAXFD system calls, where MAXFD is the maximum number of file descriptors, even if there are only few open file descriptors. This maximum can be read using: os.sysconf("SC_OPEN_MAX").

The operation can be slow if MAXFD is large. For example, on a FreeBSD buildbot with MAXFD=655,000, the operation took 300 ms: see issue #11284: slow close file descriptors.

On Linux, Python 3.3 gets the list of all open file descriptors from /proc/<PID>/fd/, and so performances depends on the number of open file descriptors, not on MAXFD.

See also:

  • Python issue #1663329: subprocess close_fds perform poor if SC_OPEN_MAX is high
  • Squid Bug #837033: Squid should set CLOEXEC on opened FDs. "32k+ close() calls in each child process take a long time ([12-56] seconds) in Xen PV guests."

Proposal

Non-inheritable File Descriptors

The following functions are modified to make newly created file descriptors non-inheritable by default:

  • asyncore.dispatcher.create_socket()
  • io.FileIO
  • io.open()
  • open()
  • os.dup()
  • os.fdopen()
  • os.open()
  • os.openpty()
  • os.pipe()
  • select.devpoll()
  • select.epoll()
  • select.kqueue()
  • socket.socket()
  • socket.socket.accept()
  • socket.socket.dup()
  • socket.socket.fromfd()
  • socket.socketpair()

os.dup2() still creates inheritable by default, see below.

When available, atomic flags are used to make file descriptors non-inheritable. The atomicity is not guaranteed because a fallback is required when atomic flags are not available.

New Functions And Methods

New functions available on all platforms:

  • os.get_inheritable(fd: int): return True if the file descriptor can be inherited by child processes, False otherwise.
  • os.set_inheritable(fd: int, inheritable: bool): set the inheritable flag of the specified file descriptor.

New functions only available on Windows:

  • os.get_handle_inheritable(handle: int): return True if the handle can be inherited by child processes, False otherwise.
  • os.set_handle_inheritable(handle: int, inheritable: bool): set the inheritable flag of the specified handle.

New methods:

  • socket.socket.get_inheritable(): return True if the socket can be inherited by child processes, False otherwise.
  • socket.socket.set_inheritable(inheritable: bool): set the inheritable flag of the specified socket.

Other Changes

On UNIX, subprocess makes file descriptors of the pass_fds parameter inheritable. The file descriptor is made inheritable in the child process after the fork() and before execv(), so the inheritable flag of file descriptors is unchanged in the parent process.

os.dup2() has a new optional inheritable parameter: os.dup2(fd, fd2, inheritable=True). fd2 is created inheritable by default, but non-inheritable if inheritable is False.

os.dup2() behaves differently than os.dup() because the most common use case of os.dup2() is to replace the file descriptors of the standard streams: stdin (0), stdout (1) and stderr (2). Standard streams are expected to be inherited by child processes.

Backward Compatibility

This PEP break applications relying on inheritance of file descriptors. Developers are encouraged to reuse the high-level Python module subprocess which handles the inheritance of file descriptors in a portable way.

Applications using the subprocess module with the pass_fds parameter or using only os.dup2() to redirect standard streams should not be affected.

Python no longer conform to POSIX, since file descriptors are now made non-inheritable by default. Python was not designed to conform to POSIX, but was designed to develop portable applications.

Rejected Alternatives

Add a new open_noinherit() function

In June 2007, Henning von Bargen proposed on the python-dev mailing list to add a new open_noinherit() function to fix issues of inherited file descriptors in child processes. At this time, the default value of the close_fds parameter of the subprocess module was False.

Read the mail thread: [Python-Dev] Proposal for a new function "open_noinherit" to avoid problems with subprocesses and security risks.

PEP 433

PEP 433, "Easier suppression of file descriptor inheritance", was a previous attempt proposing various other alternatives, but no consensus could be reached.

pep-0447 Add __getdescriptor__ method to metaclass

PEP:447
Title:Add __getdescriptor__ method to metaclass
Version:$Revision$
Last-Modified:$Date$
Author:Ronald Oussoren <ronaldoussoren at mac.com>
Status:Draft
Type:Standards Track
Content-Type:text/x-rst
Created:12-Jun-2013
Post-History:2-Jul-2013, 15-Jul-2013, 29-Jul-2013

Abstract

Currently object.__getattribute__ and super.__getattribute__ peek in the __dict__ of classes on the MRO for a class when looking for an attribute. This PEP adds an optional __getdescriptor__ method to a metaclass that can be used to override this behavior.

That is, the MRO walking loop in _PyType_Lookup and super.__getattribute__ gets changed from:

def lookup(mro_list, name):
    for cls in mro_list:
        if name in cls.__dict__:
            return cls.__dict__

    return NotFound

to:

def lookup(mro_list, name):
    for cls in mro_list:
        try:
            return cls.__getdescriptor__(name)
        except AttributeError:
            pass

    return NotFound

Rationale

It is currently not possible to influence how the super class [2] looks up attributes (that is, super.__getattribute__ unconditionally peeks in the class __dict__), and that can be problematic for dynamic classes that can grow new methods on demand.

The __getdescriptor__ method makes it possible to dynamically add attributes even when looking them up using the super class [2].

The new method affects object.__getattribute__ (and PyObject_GenericGetAttr [3]) as well for consistency and to have a single place to implement dynamic attribute resolution for classes.

Background

The current behavior of super.__getattribute__ causes problems for classes that are dynamic proxies for other (non-Python) classes or types, an example of which is PyObjC [6]. PyObjC creates a Python class for every class in the Objective-C runtime, and looks up methods in the Objective-C runtime when they are used. This works fine for normal access, but doesn't work for access with super objects. Because of this PyObjC currently includes a custom super that must be used with its classes.

The API in this PEP makes it possible to remove the custom super and simplifies the implementation because the custom lookup behavior can be added in a central location.

The superclass attribute lookup hook

Both super.__getattribute__ and object.__getattribute__ (or PyObject_GenericGetAttr [3] and in particular _PyType_Lookup in C code) walk an object's MRO and currently peek in the class' __dict__ to look up attributes.

With this proposal both lookup methods no longer peek in the class __dict__ but call the special method __getdescriptor__, which is a slot defined on the metaclass. The default implementation of that method looks up the name the class __dict__, which means that attribute lookup is unchanged unless a metatype actually defines the new special method.

Aside: Attribute resolution algorithm in Python

The attribute resolution proces as implemented by object.__getattribute__ (or PyObject_GenericGetAttr`` in CPython's implementation) is fairly straightforward, but not entirely so without reading C code.

The current CPython implementation of object.__getattribute__ is basicly equivalent to the following (pseudo-) Python code (excluding some house keeping and speed tricks):

def _PyType_Lookup(tp, name):
    mro = tp.mro()
    assert isinstance(mro, tuple)

    for base in mro:
       assert isinstance(base, type)

       # PEP 447 will change these lines:
       try:
           return base.__dict__[name]
       except KeyError:
           pass

    return None


class object:
    def __getattribute__(self, name):
        assert isinstance(name, str)

        tp = type(self)
        descr = _PyType_Lookup(tp, name)

        f = None
        if descr is not None:
            f = descr.__get__
            if f is not None and descr.__set__ is not None:
                # Data descriptor
                return f(descr, self, type(self))

        dict = self.__dict__
        if dict is not None:
            try:
                return self.__dict__[name]
            except KeyError:
                pass

        if f is not None:
            # Non-data descriptor
            return f(descr, self, type(self))

        if descr is not None:
            # Regular class attribute
            return descr

        raise AttributeError(name)


class super:
    def __getattribute__(self, name):
       assert isinstance(name, unicode)

       if name != '__class__':
           starttype = self.__self_type__
           mro = startype.mro()

           try:
               idx = mro.index(self.__thisclass__)

           except ValueError:
               pass

           else:
               for base in mro[idx+1:]:
                   # PEP 447 will change these lines:
                   try:
                       descr = base.__dict__[name]
                   except KeyError:
                       continue

                   f = descr.__get__
                   if f is not None:
                       return f(descr,
                           None if (self.__self__ is self.__self_type__) else self.__self__,
                           starttype)

                   else:
                       return descr

       return object.__getattribute__(self, name)

This PEP should change the dict lookup at the lines starting at "# PEP 447" with a method call to perform the actual lookup, making is possible to affect that lookup both for normal attribute access and access through the super proxy [2].

Note that specific classes can already completely override the default behaviour by implementing their own __getattribute__ slot (with or without calling the super class implementation).

In Python code

A meta type can define a method __getdescriptor__ that is called during attribute resolution by both super.__getattribute__ and object.__getattribute:

class MetaType(type):
    def __getdescriptor__(cls, name):
        try:
            return cls.__dict__[name]
        except KeyError:
            raise AttributeError(name) from None

The __getdescriptor__ method has as its arguments a class (which is an instance of the meta type) and the name of the attribute that is looked up. It should return the value of the attribute without invoking descriptors, and should raise AttributeError [5] when the name cannot be found.

The type [4] class provides a default implementation for __getdescriptor__, that looks up the name in the class dictionary.

Example usage

The code below implements a silly metaclass that redirects attribute lookup to uppercase versions of names:

class UpperCaseAccess (type):
    def __getdescriptor__(cls, name):
        try:
            return cls.__dict__[name.upper()]
        except KeyError:
            raise AttributeError(name) from None

class SillyObject (metaclass=UpperCaseAccess):
    def m(self):
        return 42

    def M(self):
        return "fourtytwo"

obj = SillyObject()
assert obj.m() == "fortytwo"

As mentioned earlier in this PEP a more realistic use case of this functionallity is a __getdescriptor__ method that dynamicly populates the class __dict__ based on attribute access, primarily when it is not possible to reliably keep the class dict in sync with its source, for example because the source used to populate __dict__ is dynamic as well and does not have triggers that can be used to detect changes to that source.

An example of that are the class bridges in PyObjC: the class bridge is a Python object (class) that represents an Objective-C class and conceptually has a Python method for every Objective-C method in the Objective-C class. As with Python it is possible to add new methods to an Objective-C class, or replace existing ones, and there are no callbacks that can be used to detect this.

In C code

A new slot tp_getdescriptor is added to the PyTypeObject struct, this slot corresponds to the __getdescriptor__ method on type [4].

The slot has the following prototype:

PyObject* (*getdescriptorfunc)(PyTypeObject* cls, PyObject* name);

This method should lookup name in the namespace of cls, without looking at superclasses, and should not invoke descriptors. The method returns NULL without setting an exception when the name cannot be found, and returns a new reference otherwise (not a borrowed reference).

Use of this hook by the interpreter

The new method is required for metatypes and as such is defined on type_. Both super.__getattribute__ and object.__getattribute__/PyObject_GenericGetAttr [3] (through _PyType_Lookup) use the this __getdescriptor__ method when walking the MRO.

Other changes to the implementation

The change for PyObject_GenericGetAttr [3] will be done by changing the private function _PyType_Lookup. This currently returns a borrowed reference, but must return a new reference when the __getdescriptor__ method is present. Because of this _PyType_Lookup will be renamed to _PyType_LookupName, this will cause compile-time errors for all out-of-tree users of this private API.

The attribute lookup cache in Objects/typeobject.c is disabled for classes that have a metaclass that overrides __getdescriptor__, because using the cache might not be valid for such classes.

Impact of this PEP on introspection

Use of the method introduced in this PEP can affect introspection of classes with a metaclass that uses a custom __getdescriptor__ method. This section lists those changes.

  • dir might not show all attributes

    As with a custom __getattribute__ method dir() might not see all (instance) attributes when using the __getdescriptor__() method to dynamicly resolve attributes.

    The solution for that is quite simple: classes using __getdescriptor__ should also implement __dir__ if they want full support for the builtin dir function.

  • inspect.getattr_static might not show all attributes

    The function inspect.getattr_static intentionally does not invoke __getattribute__ and descriptors to avoid invoking user code during introspection with this function. The __getdescriptor__ method will also be ignored and is another way in which the result of inspect.getattr_static can be different from that of builtin.getattr.

  • inspect.getmembers and inspect.get_class_attrs

    Both of these functions directly access the class __dict__ of classes along the MRO, and hence can be affected by a custom __getdescriptor__ method.

    TODO: I haven't fully worked out what the impact of this is, and if there are mitigations for those using either updates to these functions, or additional methods that users should implement to be fully compatible with these functions.

Performance impact

The pybench output below compares an implementation of this PEP with the regular source tree, both based on changeset a5681f50bae2, run on an idle machine an Core i7 processor running Centos 6.4.

Even though the machine was idle there were clear differences between runs, I've seen difference in "minimum time" vary from -0.1% to +1.5%, with similar (but slightly smaller) differences in the "average time" difference.

-------------------------------------------------------------------------------
PYBENCH 2.1
-------------------------------------------------------------------------------
* using CPython 3.4.0a0 (default, Jul 29 2013, 13:01:34) [GCC 4.4.7 20120313 (Red Hat 4.4.7-3)]
* disabled garbage collection
* system check interval set to maximum: 2147483647
* using timer: time.perf_counter
* timer: resolution=1e-09, implementation=clock_gettime(CLOCK_MONOTONIC)

-------------------------------------------------------------------------------
Benchmark: pep447.pybench
-------------------------------------------------------------------------------

    Rounds: 10
    Warp:   10
    Timer:  time.perf_counter

    Machine Details:
       Platform ID:    Linux-2.6.32-358.114.1.openstack.el6.x86_64-x86_64-with-centos-6.4-Final
       Processor:      x86_64

    Python:
       Implementation: CPython
       Executable:     /tmp/default-pep447/bin/python3
       Version:        3.4.0a0
       Compiler:       GCC 4.4.7 20120313 (Red Hat 4.4.7-3)
       Bits:           64bit
       Build:          Jul 29 2013 14:09:12 (#default)
       Unicode:        UCS4


-------------------------------------------------------------------------------
Comparing with: default.pybench
-------------------------------------------------------------------------------

    Rounds: 10
    Warp:   10
    Timer:  time.perf_counter

    Machine Details:
       Platform ID:    Linux-2.6.32-358.114.1.openstack.el6.x86_64-x86_64-with-centos-6.4-Final
       Processor:      x86_64

    Python:
       Implementation: CPython
       Executable:     /tmp/default/bin/python3
       Version:        3.4.0a0
       Compiler:       GCC 4.4.7 20120313 (Red Hat 4.4.7-3)
       Bits:           64bit
       Build:          Jul 29 2013 13:01:34 (#default)
       Unicode:        UCS4


Test                             minimum run-time        average  run-time
                                 this    other   diff    this    other   diff
-------------------------------------------------------------------------------
          BuiltinFunctionCalls:    45ms    44ms   +1.3%    45ms    44ms   +1.3%
           BuiltinMethodLookup:    26ms    27ms   -2.4%    27ms    27ms   -2.2%
                 CompareFloats:    33ms    34ms   -0.7%    33ms    34ms   -1.1%
         CompareFloatsIntegers:    66ms    67ms   -0.9%    66ms    67ms   -0.8%
               CompareIntegers:    51ms    50ms   +0.9%    51ms    50ms   +0.8%
        CompareInternedStrings:    34ms    33ms   +0.4%    34ms    34ms   -0.4%
                  CompareLongs:    29ms    29ms   -0.1%    29ms    29ms   -0.0%
                CompareStrings:    43ms    44ms   -1.8%    44ms    44ms   -1.8%
    ComplexPythonFunctionCalls:    44ms    42ms   +3.9%    44ms    42ms   +4.1%
                 ConcatStrings:    33ms    33ms   -0.4%    33ms    33ms   -1.0%
               CreateInstances:    47ms    48ms   -2.9%    47ms    49ms   -3.4%
            CreateNewInstances:    35ms    36ms   -2.5%    36ms    36ms   -2.5%
       CreateStringsWithConcat:    69ms    70ms   -0.7%    69ms    70ms   -0.9%
                  DictCreation:    52ms    50ms   +3.1%    52ms    50ms   +3.0%
             DictWithFloatKeys:    40ms    44ms  -10.1%    43ms    45ms   -5.8%
           DictWithIntegerKeys:    32ms    36ms  -11.2%    35ms    37ms   -4.6%
            DictWithStringKeys:    29ms    34ms  -15.7%    35ms    40ms  -11.0%
                      ForLoops:    30ms    29ms   +2.2%    30ms    29ms   +2.2%
                    IfThenElse:    38ms    41ms   -6.7%    38ms    41ms   -6.9%
                   ListSlicing:    36ms    36ms   -0.7%    36ms    37ms   -1.3%
                NestedForLoops:    43ms    45ms   -3.1%    43ms    45ms   -3.2%
      NestedListComprehensions:    39ms    40ms   -1.7%    39ms    40ms   -2.1%
          NormalClassAttribute:    86ms    82ms   +5.1%    86ms    82ms   +5.0%
       NormalInstanceAttribute:    42ms    42ms   +0.3%    42ms    42ms   +0.0%
           PythonFunctionCalls:    39ms    38ms   +3.5%    39ms    38ms   +2.8%
             PythonMethodCalls:    51ms    49ms   +3.0%    51ms    50ms   +2.8%
                     Recursion:    67ms    68ms   -1.4%    67ms    68ms   -1.4%
                  SecondImport:    41ms    36ms  +12.5%    41ms    36ms  +12.6%
           SecondPackageImport:    45ms    40ms  +13.1%    45ms    40ms  +13.2%
         SecondSubmoduleImport:    92ms    95ms   -2.4%    95ms    98ms   -3.6%
       SimpleComplexArithmetic:    28ms    28ms   -0.1%    28ms    28ms   -0.2%
        SimpleDictManipulation:    57ms    57ms   -1.0%    57ms    58ms   -1.0%
         SimpleFloatArithmetic:    29ms    28ms   +4.7%    29ms    28ms   +4.9%
      SimpleIntFloatArithmetic:    37ms    41ms   -8.5%    37ms    41ms   -8.7%
       SimpleIntegerArithmetic:    37ms    41ms   -9.4%    37ms    42ms  -10.2%
      SimpleListComprehensions:    33ms    33ms   -1.9%    33ms    34ms   -2.9%
        SimpleListManipulation:    28ms    30ms   -4.3%    29ms    30ms   -4.1%
          SimpleLongArithmetic:    26ms    26ms   +0.5%    26ms    26ms   +0.5%
                    SmallLists:    40ms    40ms   +0.1%    40ms    40ms   +0.1%
                   SmallTuples:    46ms    47ms   -2.4%    46ms    48ms   -3.0%
         SpecialClassAttribute:   126ms   120ms   +4.7%   126ms   121ms   +4.4%
      SpecialInstanceAttribute:    42ms    42ms   +0.6%    42ms    42ms   +0.8%
                StringMappings:    94ms    91ms   +3.9%    94ms    91ms   +3.8%
              StringPredicates:    48ms    49ms   -1.7%    48ms    49ms   -2.1%
                 StringSlicing:    45ms    45ms   +1.4%    46ms    45ms   +1.5%
                     TryExcept:    23ms    22ms   +4.9%    23ms    22ms   +4.8%
                    TryFinally:    32ms    32ms   -0.1%    32ms    32ms   +0.1%
                TryRaiseExcept:    17ms    17ms   +0.9%    17ms    17ms   +0.5%
                  TupleSlicing:    49ms    48ms   +1.1%    49ms    49ms   +1.0%
                   WithFinally:    48ms    47ms   +2.3%    48ms    47ms   +2.4%
               WithRaiseExcept:    45ms    44ms   +0.8%    45ms    45ms   +0.5%
-------------------------------------------------------------------------------
Totals:                          2284ms  2287ms   -0.1%  2306ms  2308ms   -0.1%

(this=pep447.pybench, other=default.pybench)

A run of the benchmark suite (with option "-b 2n3") also seems to indicate that the performance impact is minimal:

Report on Linux fangorn.local 2.6.32-358.114.1.openstack.el6.x86_64 #1 SMP Wed Jul 3 02:11:25 EDT 2013 x86_64 x86_64
Total CPU cores: 8

### call_method_slots ###
Min: 0.304120 -> 0.282791: 1.08x faster
Avg: 0.304394 -> 0.282906: 1.08x faster
Significant (t=2329.92)
Stddev: 0.00016 -> 0.00004: 4.1814x smaller

### call_simple ###
Min: 0.249268 -> 0.221175: 1.13x faster
Avg: 0.249789 -> 0.221387: 1.13x faster
Significant (t=2770.11)
Stddev: 0.00012 -> 0.00013: 1.1101x larger

### django_v2 ###
Min: 0.632590 -> 0.601519: 1.05x faster
Avg: 0.635085 -> 0.602653: 1.05x faster
Significant (t=321.32)
Stddev: 0.00087 -> 0.00051: 1.6933x smaller

### fannkuch ###
Min: 1.033181 -> 0.999779: 1.03x faster
Avg: 1.036457 -> 1.001840: 1.03x faster
Significant (t=260.31)
Stddev: 0.00113 -> 0.00070: 1.6112x smaller

### go ###
Min: 0.526714 -> 0.544428: 1.03x slower
Avg: 0.529649 -> 0.547626: 1.03x slower
Significant (t=-93.32)
Stddev: 0.00136 -> 0.00136: 1.0028x smaller

### iterative_count ###
Min: 0.109748 -> 0.116513: 1.06x slower
Avg: 0.109816 -> 0.117202: 1.07x slower
Significant (t=-357.08)
Stddev: 0.00008 -> 0.00019: 2.3664x larger

### json_dump_v2 ###
Min: 2.554462 -> 2.609141: 1.02x slower
Avg: 2.564472 -> 2.620013: 1.02x slower
Significant (t=-76.93)
Stddev: 0.00538 -> 0.00481: 1.1194x smaller

### meteor_contest ###
Min: 0.196336 -> 0.191925: 1.02x faster
Avg: 0.196878 -> 0.192698: 1.02x faster
Significant (t=61.86)
Stddev: 0.00053 -> 0.00041: 1.2925x smaller

### nbody ###
Min: 0.228039 -> 0.235551: 1.03x slower
Avg: 0.228857 -> 0.236052: 1.03x slower
Significant (t=-54.15)
Stddev: 0.00130 -> 0.00029: 4.4810x smaller

### pathlib ###
Min: 0.108501 -> 0.105339: 1.03x faster
Avg: 0.109084 -> 0.105619: 1.03x faster
Significant (t=311.08)
Stddev: 0.00022 -> 0.00011: 1.9314x smaller

### regex_effbot ###
Min: 0.057905 -> 0.056447: 1.03x faster
Avg: 0.058055 -> 0.056760: 1.02x faster
Significant (t=79.22)
Stddev: 0.00006 -> 0.00015: 2.7741x larger

### silent_logging ###
Min: 0.070810 -> 0.072436: 1.02x slower
Avg: 0.070899 -> 0.072609: 1.02x slower
Significant (t=-191.59)
Stddev: 0.00004 -> 0.00008: 2.2640x larger

### spectral_norm ###
Min: 0.290255 -> 0.299286: 1.03x slower
Avg: 0.290335 -> 0.299541: 1.03x slower
Significant (t=-572.10)
Stddev: 0.00005 -> 0.00015: 2.8547x larger

### threaded_count ###
Min: 0.107215 -> 0.115206: 1.07x slower
Avg: 0.107488 -> 0.115996: 1.08x slower
Significant (t=-109.39)
Stddev: 0.00016 -> 0.00076: 4.8665x larger

The following not significant results are hidden, use -v to show them:
call_method, call_method_unknown, chaos, fastpickle, fastunpickle, float, formatted_logging, hexiom2, json_load, normal_startup, nqueens, pidigits, raytrace, regex_compile, regex_v8, richards, simple_logging, startup_nosite, telco, unpack_sequence.

Alternative proposals

__getattribute_super__

An earlier version of this PEP used the following static method on classes:

def __getattribute_super__(cls, name, object, owner): pass

This method performed name lookup as well as invoking descriptors and was necessarily limited to working only with super.__getattribute__.

Reuse tp_getattro

It would be nice to avoid adding a new slot, thus keeping the API simpler and easier to understand. A comment on Issue 18181 [1] asked about reusing the tp_getattro slot, that is super could call the tp_getattro slot of all methods along the MRO.

That won't work because tp_getattro will look in the instance __dict__ before it tries to resolve attributes using classes in the MRO. This would mean that using tp_getattro instead of peeking the class dictionaries changes the semantics of the super class [2].

pep-0448 Additional Unpacking Generalizations

PEP:448
Title:Additional Unpacking Generalizations
Version:$Revision$
Last-Modified:$Date$
Author:Joshua Landau <joshua at landau.ws>
Discussions-To:python-ideas at python.org
Status:Final
Type:Standards Track
Content-Type:text/x-rst
Created:29-Jun-2013
Python-Version:3.5
Post-History:

Abstract

This PEP proposes extended usages of the * iterable unpacking operator and ** dictionary unpacking operators to allow unpacking in more positions, an arbitrary number of times, and in additional circumstances. Specifically, in function calls, in comprehensions and generator expressions, and in displays.

Function calls are proposed to support an arbitrary number of unpackings rather than just one:

>>> print(*[1], *[2], 3)
1 2 3
>>> dict(**{'x': 1}, y=2, **{'z': 3})
{'x': 1, 'y': 2, 'z': 3}

Unpacking is proposed to be allowed inside tuple, list, set, and dictionary displays:

>>> *range(4), 4
(0, 1, 2, 3, 4)
>>> [*range(4), 4]
[0, 1, 2, 3, 4]
>>> {*range(4), 4}
{0, 1, 2, 3, 4}
>>> {'x': 1, **{'y': 2}}
{'x': 1, 'y': 2}

In dictionaries, later values will always override earlier ones:

>>> {'x': 1, **{'x': 2}}
{'x': 2}

>>> {**{'x': 2}, 'x': 1}
{'x': 1}

This PEP does not include unpacking operators inside list, set and dictionary comprehensions although this has not been ruled out for future proposals.

Rationale

Current usage of the * iterable unpacking operator features unnecessary restrictions that can harm readability.

Unpacking multiple times has an obvious rationale. When you want to unpack several iterables into a function definition or follow an unpack with more positional arguments, the most natural way would be to write:

function(**kw_arguments, **more_arguments)

function(*arguments, argument)

Simple examples where this is useful are print and str.format. Instead, you could be forced to write:

kwargs = dict(kw_arguments)
kwargs.update(more_arguments)
function(**kwargs)

args = list(arguments)
args.append(arg)
function(*args)

or, if you know to do so:

from collections import ChainMap
function(**ChainMap(more_arguments, arguments))

from itertools import chain
function(*chain(args, [arg]))

which add unnecessary line-noise and, with the first methods, causes duplication of work.

There are two primary rationales for unpacking inside of containers. Firstly there is a symmetry of assignment, where fst, *other, lst = elems and elems = fst, *other, lst are approximate inverses, ignoring the specifics of types. This, in effect, simplifies the language by removing special cases.

Secondly, it vastly simplifies types of "addition" such as combining dictionaries, and does so in an unambiguous and well-defined way:

combination = {**first_dictionary, "x": 1, "y": 2}

instead of:

combination = first_dictionary.copy()
combination.update({"x": 1, "y": 2})

which is especially important in contexts where expressions are preferred. This is also useful as a more readable way of summing iterables into a list, such as my_list + list(my_tuple) + list(my_range) which is now equivalent to just [*my_list, *my_tuple, *my_range].

Specification

Function calls may accept an unbounded number of * and ** unpackings. There will be no restriction of the order of positional arguments with relation to * unpackings nor any restriction of the order of keyword arguments with relation to ** unpackings.

Function calls continue to have the restriction that keyword arguments must follow positional arguments and ** unpackings must additionally follow * unpackings.

Currently, if an argument is given multiple times — such as a positional argument given both positionally and by keyword — a TypeError is raised. This remains true for duplicate arguments provided through multiple ** unpackings, e.g. f(**{'x': 2}, **{'x': 3}), except that the error will be detected at runtime.

A function looks like this:

function(
    argument or *args, argument or *args, ...,
    kwargument or *args, kwargument or *args, ...,
    kwargument or **kwargs, kwargument or **kwargs, ...
)

Tuples, lists, sets and dictionaries will allow unpacking. This will act as if the elements from unpacked items were inserted in order at the site of unpacking, much as happens in unpacking in a function-call. Dictionaries require ** unpacking; all the others require * unpacking.

The keys in a dictionary remain in a right-to-left priority order, so {**{'a': 1}, 'a': 2, **{'a': 3}} evaluates to {'a': 3}. There is no restriction on the number or position of unpackings.

Disadvantages

The allowable orders for arguments in a function call are more complicated than before. The simplest explanation for the rules may be "positional arguments precede keyword arguments and ** unpacking; * unpacking precedes ** unpacking".

Whilst *elements, = iterable causes elements to be a list, elements = *iterable, causes elements to be a tuple. The reason for this may confuse people unfamiliar with the construct.

Concerns have been raised about the unexpected difference between duplicate keys in dictionaries being allowed but duplicate keys in function call syntax raising an error. Although this is already the case with current syntax, this proposal might exacerbate the issue. It remains to be seen how much of an issue this is in practice.

Variations

The PEP originally considered whether the ordering of argument types in a function call (positional, keyword, * or **) could become less strict. This met little support so the idea was shelved.

Earlier iterations of this PEP allowed unpacking operators inside list, set, and dictionary comprehensions as a flattening operator over iterables of containers:

>>> ranges = [range(i) for i in range(5)]
>>> [*item for item in ranges]
[0, 0, 1, 0, 1, 2, 0, 1, 2, 3]

>>> {*item for item in ranges}
{0, 1, 2, 3}

This was met with a mix of strong concerns about readability and mild support. In order not to disadvantage the less controversial aspects of the PEP, this was not accepted with the rest of the proposal.

Unbracketed comprehensions in function calls, such as f(x for x in it), are already valid. These could be extended to:

f(*x for x in it) == f((*x for x in it))
f(**x for x in it) == f({**x for x in it})

However, it wasn't clear if this was the best behaviour or if it should unpack into the arguments of the call to f. Since this is likely to be confusing and is of only very marginal utility, it is not included in this PEP. Instead, these will throw a SyntaxError and comprehensions with explicit brackets should be used instead.

Approval

This PEP was accepted by Guido on February 25, 2015 [1].

Implementation

An implementation for Python 3.5 is found at Issue 2292 on bug tracker [2]. This currently includes support for unpacking inside comprehensions, which should be removed.

References

[1]PEP accepted, "PEP 448 review", Guido van Rossum (https://mail.python.org/pipermail/python-dev/2015-February/138564.html)
[2]Issue 2292, "Missing *-unpacking generalizations", Thomas Wouters (http://bugs.python.org/issue2292)
[3]Discussion on Python-ideas list, "list / array comprehensions extension", Alexander Heger (http://mail.python.org/pipermail/python-ideas/2011-December/013097.html)

pep-0449 Removal of the PyPI Mirror Auto Discovery and Naming Scheme

PEP:449
Title:Removal of the PyPI Mirror Auto Discovery and Naming Scheme
Version:$Revision$
Last-Modified:$Date$
Author:Donald Stufft <donald at stufft.io>
BDFL-Delegate:Richard Jones <richard@python.org>
Discussions-To:distutils-sig at python.org
Status:Accepted
Type:Process
Content-Type:text/x-rst
Created:04-Aug-2013
Post-History:04-Aug-2013
Replaces:381
Resolution:http://mail.python.org/pipermail/distutils-sig/2013-August/022518.html

Abstract

This PEP provides a path to deprecate and ultimately remove the auto discovery of PyPI mirrors as well as the hard coded naming scheme which requires delegating a domain name under pypi.python.org to a third party.

Rationale

The PyPI mirroring infrastructure (defined in PEP381 [1]) provides a means to mirror the content of PyPI used by the automatic installers. It also provides a method for auto discovery of mirrors and a consistent naming scheme.

There are a number of problems with the auto discovery protocol and the naming scheme:

  • They give control over a *.python.org domain name to a third party, allowing that third party to set or read cookies on the pypi.python.org and python.org domain name.
  • The use of a sub domain of pypi.python.org means that the mirror operators will never be able to get a SSL certificate of their own, and giving them one for a python.org domain name is unlikely to happen.
  • The auto discovery uses an unauthenticated protocol (DNS).
  • The lack of a TLS certificate on these domains means that clients can not be sure that they have not been a victim of DNS poisoning or a MITM attack.
  • The auto discovery protocol was designed to enable a client to automatically select a mirror for use. This is no longer a requirement because the CDN that PyPI is now using a globally distributed network of servers which will automatically select one close to the client without any effort on the clients part.
  • The auto discovery protocol and use of the consistent naming scheme has only ever been implemented by one installer (pip), and its implementation, besides being insecure, has serious issues with performance and is slated for removal with it's next release (1.5).
  • While there are provisions in PEP381 [1] that would solve some of these issues for a dedicated client it would not solve the issues that affect a users browser. Additionally these provisions have not been implemented by any installer to date.

Due to the number of issues, some of them very serious, and the CDN which provides most of the benefit of the auto discovery and consistent naming scheme this PEP proposes to first deprecate and then remove the [a..z].pypi.python.org names for mirrors and the last.pypi.python.org name for the auto discovery protocol. The ability to mirror and the method of mirror will not be affected and will continue to exist as written in PEP381 [1]. Operators of existing mirrors are encouraged to acquire their own domains and certificates to use for their mirrors if they wish to continue hosting them.

Plan for Deprecation & Removal

Immediately upon acceptance of this PEP documentation on PyPI will be updated to reflect the deprecated nature of the official public mirrors and will direct users to external resources like http://www.pypi-mirrors.org/ to discover unofficial public mirrors if they wish to use one.

Mirror operators, if they wish to continue operating their mirror, should acquire a domain name to represent their mirror and, if they are able, a TLS certificate. Once they have acquired a domain they should redirect their assigned N.pypi.python.org domain name to their new domain. On Feb 15th, 2014 the DNS entries for [a..z].pypi.python.org and last.pypi.python.org will be removed. At any time prior to Feb 15th, 2014 a mirror operator may request that their domain name be reclaimed by PyPI and pointed back at the master.

Why Feb 15th, 2014

The most critical decision of this PEP is the final cut off date. If the date is too soon then it needlessly punishes people by forcing them to drop everything to update their deployment scripts. If the date is too far away then the extended period of time does not help with the migration effort and merely puts off the migration until a later date.

The date of Feb 15th, 2014 has been chosen because it is roughly 6 months from the date of the PEP. This should ensure a lengthy period of time to enable people to update their deployment procedures to point to the new domains names without merely padding the cut off date.

Why the DNS entries must be removed

While it would be possible to simply reclaim the domain names used in mirror and direct them back at PyPI in order to prevent users from needing to update configurations to point away from those domains this has a number of issues.

  • Anyone who currently has these names hard coded in their configuration has them hard coded as HTTP. This means that by allowing these names to continue resolving we make it simple for a MITM operator to attack users by rewriting the redirect to HTTPS prior to giving it to the client.
  • The overhead of maintaining several domains pointing at PyPI has proved troublesome for the small number of N.pypi.python.org domains that have already been reclaimed. They often times get mis-configured when things change on the service which often leaves them broken for months at a time until somebody notices. By leaving them in we leave users of these domains open to random breakages which are less likely to get caught or noticed.
  • People using these domains have explicitly chosen to use them for one reason or another. One such reason may be because they do not wish to deploy from a host located in a particular country. If these domains continue to resolve but do not point at their existing locations we have silently removed this choice from the existing users of those domains.

That being said, removing the entries will require users who have modified their configuration to either point back at the master (PyPI) or select a new mirror name to point at. This is regarded as a regrettable requirement to protect PyPI itself and the users of the mirrors from the attacks outlined above or, at the very least, require them to make an informed decision about the insecurity.

Public or Private Mirrors

The mirroring protocol will continue to exist as defined in PEP381 [1] and people are encouraged to to host public and private mirrors if they so desire. The recommended mirroring client is Bandersnatch [2].

pep-0450 Adding A Statistics Module To The Standard Library

PEP: 450
Title: Adding A Statistics Module To The Standard Library
Version: $Revision$
Last-Modified: $Date$
Author: Steven D'Aprano <steve at pearwood.info>
Status: Final
Type: Standards Track
Content-Type: text/plain
Created: 01-Aug-2013
Python-Version: 3.4
Post-History: 13-Sep-2013

Abstract

    This PEP proposes the addition of a module for common statistics functions
    such as mean, median, variance and standard deviation to the Python
    standard library. See also http://bugs.python.org/issue18606


Rationale

    The proposed statistics module is motivated by the "batteries included"
    philosophy towards the Python standard library.  Raymond Hettinger and
    other senior developers have requested a quality statistics library that
    falls somewhere in between high-end statistics libraries and ad hoc
    code.[1]  Statistical functions such as mean, standard deviation and others
    are obvious and useful batteries, familiar to any Secondary School student.
    Even cheap scientific calculators typically include multiple statistical
    functions such as:

    - mean
    - population and sample variance
    - population and sample standard deviation
    - linear regression
    - correlation coefficient

    Graphing calculators aimed at Secondary School students typically
    include all of the above, plus some or all of:

    - median
    - mode
    - functions for calculating the probability of random variables
      from the normal, t, chi-squared, and F distributions
    - inference on the mean

    and others[2].  Likewise spreadsheet applications such as Microsoft Excel,
    LibreOffice and Gnumeric include rich collections of statistical
    functions[3].

    In contrast, Python currently has no standard way to calculate even the
    simplest and most obvious statistical functions such as mean.  For those
    who need statistical functions in Python, there are two obvious solutions:

    - install numpy and/or scipy[4];

    - or use a Do It Yourself solution.

    Numpy is perhaps the most full-featured solution, but it has a few
    disadvantages:

    - It may be overkill for many purposes.  The documentation for numpy even
      warns

          "It can be hard to know what functions are available in
          numpy.  This is not a complete list, but it does cover
          most of them."[5]

      and then goes on to list over 270 functions, only a small number of
      which are related to statistics.

    - Numpy is aimed at those doing heavy numerical work, and may be
      intimidating to those who don't have a background in computational
      mathematics and computer science.  For example, numpy.mean takes four
      arguments:

        mean(a, axis=None, dtype=None, out=None)

      although fortunately for the beginner or casual numpy user, three are
      optional and numpy.mean does the right thing in simple cases:

          >>>  numpy.mean([1, 2, 3, 4])
          2.5

    - For many people, installing numpy may be difficult or impossible.  For
      example, people in corporate environments may have to go through a
      difficult, time-consuming process before being permitted to install
      third-party software.  For the casual Python user, having to learn about
      installing third-party packages in order to average a list of numbers is
      unfortunate.

    This leads to option number 2, DIY statistics functions.  At first glance,
    this appears to be an attractive option, due to the apparent simplicity of
    common statistical functions.  For example:

        def mean(data):
            return sum(data)/len(data)

        def variance(data):
            # Use the Computational Formula for Variance.
            n = len(data)
            ss = sum(x**2 for x in data) - (sum(data)**2)/n
            return ss/(n-1)

        def standard_deviation(data):
            return math.sqrt(variance(data))

    The above appears to be correct with a casual test:

        >>> data = [1, 2, 4, 5, 8]
        >>> variance(data)
        7.5

    But adding a constant to every data point should not change the variance:

        >>> data = [x+1e12 for x in data]
        >>> variance(data)
        0.0

    And variance should *never* be negative:

        >>> variance(data*100)
        -1239429440.1282566

    By contrast, the proposed reference implementation gets the exactly correct
    answer 7.5 for the first two examples, and a reasonably close answer for
    the third: 6.012. numpy does no better[6].

    Even simple statistical calculations contain traps for the unwary, starting
    with the Computational Formula itself.  Despite the name, it is numerically
    unstable and can be extremely inaccurate, as can be seen above.  It is
    completely unsuitable for computation by computer[7].  This problem plagues
    users of many programming language, not just Python[8], as coders reinvent
    the same numerically inaccurate code over and over again[9], or advise
    others to do so[10].

    It isn't just the variance and standard deviation. Even the mean is not
    quite as straight-forward as it might appear.  The above implementation
    seems too simple to have problems, but it does:

    - The built-in sum can lose accuracy when dealing with floats of wildly
      differing magnitude.  Consequently, the above naive mean fails this
      "torture test":

          assert mean([1e30, 1, 3, -1e30]) == 1

      returning 0 instead of 1, a purely computational error of 100%.

    - Using math.fsum inside mean will make it more accurate with float data,
      but it also has the side-effect of converting any arguments to float
      even when unnecessary.  E.g. we should expect the mean of a list of
      Fractions to be a Fraction, not a float.

    While the above mean implementation does not fail quite as catastrophically
    as the naive variance does, a standard library function can do much better
    than the DIY versions.

    The example above involves an especially bad set of data, but even for
    more realistic data sets accuracy is important.  The first step in
    interpreting variation in data (including dealing with ill-conditioned
    data) is often to standardize it to a series with variance 1 (and often
    mean 0).  This standardization requires accurate computation of the mean
    and variance of the raw series.  Naive computation of mean and variance
    can lose precision very quickly.  Because precision bounds accuracy, it is
    important to use the most precise algorithms for computing mean and
    variance that are practical, or the results of standardization are
    themselves useless.


Comparison To Other Languages/Packages

    The proposed statistics library is not intended to be a competitor to such
    third-party libraries as numpy/scipy, or of proprietary full-featured
    statistics packages aimed at professional statisticians such as Minitab,
    SAS and Matlab.  It is aimed at the level of graphing and scientific
    calculators.

    Most programming languages have little or no built-in support for
    statistics functions.  Some exceptions:

    R
        R (and its proprietary cousin, S) is a programming language designed
        for statistics work. It is extremely popular with statisticians and
        is extremely feature-rich[11].

    C#

        The C# LINQ package includes extension methods to calculate the
        average of enumerables[12].

    Ruby

        Ruby does not ship with a standard statistics module, despite some
        apparent demand[13].  Statsample appears to be a feature-rich third-
        party library, aiming to compete with R[14].

    PHP

        PHP has an extremely feature-rich (although mostly undocumented) set
        of advanced statistical functions[15].

    Delphi

        Delphi includes standard statistical functions including Mean, Sum,
        Variance, TotalVariance, MomentSkewKurtosis in its Math library[16].

    GNU Scientific Library

        The GNU Scientific Library includes standard statistical functions,
        percentiles, median and others[17].  One innovation I have borrowed
        from the GSL is to allow the caller to optionally specify the pre-
        calculated mean of the sample (or an a priori known population mean)
        when calculating the variance and standard deviation[18].


Design Decisions Of The Module

    My intention is to start small and grow the library as needed, rather than
    try to include everything from the start. Consequently, the current
    reference implementation includes only a small number of functions: mean,
    variance, standard deviation, median, mode. (See the reference
    implementation for a full list.)

    I have aimed for the following design features:

    - Correctness over speed.  It is easier to speed up a correct but slow
      function than to correct a fast but buggy one.

    - Concentrate on data in sequences, allowing two-passes over the data,
      rather than potentially compromise on accuracy for the sake of a one-pass
      algorithm.  Functions expect data will be passed as a list or other
      sequence; if given an iterator, they may internally convert to a list.

    - Functions should, as much as possible, honour any type of numeric data.
      E.g. the mean of a list of Decimals should be a Decimal, not a float.
      When this is not possible, treat float as the "lowest common data type".

    - Although functions support data sets of floats, Decimals or Fractions,
      there is no guarantee that *mixed* data sets will be supported. (But on
      the other hand, they aren't explicitly rejected either.)

    - Plenty of documentation, aimed at readers who understand the basic
      concepts but may not know (for example) which variance they should use
      (population or sample?). Mathematicians and statisticians have a terrible
      habit of being inconsistent with both notation and terminology[19], and
      having spent many hours making sense of the contradictory/confusing
      definitions in use, it is only fair that I do my best to clarify rather
      than obfuscate the topic.

    - But avoid going into tedious[20] mathematical detail.


API

    The initial version of the library will provide univariate (single
    variable) statistics functions.  The general API will be based on a
    functional model ``function(data, ...) -> result``, where ``data``
    is a mandatory iterable of (usually) numeric data.

    The author expects that lists will be the most common data type used,
    but any iterable type should be acceptable.  Where necessary, functions
    may convert to lists internally.  Where possible, functions are
    expected to conserve the type of the data values, for example, the mean
    of a list of Decimals should be a Decimal rather than float.


    Calculating mean, median and mode

        The ``mean``, ``median*`` and ``mode`` functions take a single
        mandatory argument and return the appropriate statistic, e.g.:

            >>> mean([1, 2, 3])
            2.0

        Functions provided are:

            * mean(data) -> arithmetic mean of data.

            * median(data) -> median (middle value) of data, taking the
              average of the two middle values when there are an even
              number of values.

            * median_high(data) -> high median of data, taking the
              larger of the two middle values when the number of items
              is even.

            * median_low(data) -> low median of data, taking the smaller
              of the two middle values when the number of items is even.

            * median_grouped(data, interval=1) -> 50th percentile of
              grouped data, using interpolation.

            * mode(data) -> most common data point.

        ``mode`` is the sole exception to the rule that the data argument
        must be numeric.  It will also accept an iterable of nominal data,
        such as strings.


    Calculating variance and standard deviation

        In order to be similar to scientific calculators, the statistics
        module will include separate functions for population and sample
        variance and standard deviation.  All four functions have similar
        signatures, with a single mandatory argument, an iterable of
        numeric data, e.g.:

            >>> variance([1, 2, 2, 2, 3])
            0.5

        All four functions also accept a second, optional, argument, the
        mean of the data.  This is modelled on a similar API provided by
        the GNU Scientific Library[18].  There are three use-cases for
        using this argument, in no particular order:

            1)  The value of the mean is known *a priori*.

            2)  You have already calculated the mean, and wish to avoid
                calculating it again.

            3)  You wish to (ab)use the variance functions to calculate
                the second moment about some given point other than the
                mean.

        In each case, it is the caller's responsibility to ensure that
        given argument is meaningful.

        Functions provided are:

            * variance(data, xbar=None) -> sample variance of data,
              optionally using xbar as the sample mean.

            * stdev(data, xbar=None) -> sample standard deviation of
              data, optionally using xbar as the sample mean.

            * pvariance(data, mu=None) -> population variance of data,
              optionally using mu as the population mean.

            * pstdev(data, mu=None) -> population standard deviation of
              data, optionally using mu as the population mean.

    Other functions

        There is one other public function:

            * sum(data, start=0) -> high-precision sum of numeric data.


Specification

    As the proposed reference implementation is in pure Python,
    other Python implementations can easily make use of the module
    unchanged, or adapt it as they see fit.


What Should Be The Name Of The Module?

    This will be a top-level module "statistics".

    There was some interest in turning math into a package, and making this a
    sub-module of math, but the general consensus eventually agreed on a
    top-level module.  Other potential but rejected names included "stats" (too
    much risk of confusion with existing "stat" module), and "statslib"
    (described as "too C-like").


Discussion And Resolved Issues

    This proposal has been previously discussed here[21].
 
    A number of design issues were resolved during the discussion on
    Python-Ideas and the initial code review.  There was a lot of concern
    about the addition of yet another ``sum`` function to the standard
    library, see the FAQs below for more details.  In addition, the
    initial implementation of ``sum`` suffered from some rounding issues
    and other design problems when dealing with  Decimals.  Oscar
    Benjamin's assistance in resolving this was invaluable.

    Another issue was the handling of data in the form of iterators.  The
    first implementation of variance silently swapped between a one- and
    two-pass algorithm, depending on whether the data was in the form of
    an iterator or sequence.  This proved to be a design mistake, as the
    calculated variance could differ slightly depending on the algorithm
    used, and ``variance`` etc. were changed to internally generate a list
    and always use the more accurate two-pass implementation.

    One controversial design involved the functions to calculate median,
    which were implemented as attributes on the ``median`` callable, e.g.
    ``median``, ``median.low``, ``median.high`` etc.  Although there is
    at least one existing use of this style in the standard library, in
    ``unittest.mock``, the code reviewers felt that this was too unusual
    for the standard library.  Consequently, the design has been changed
    to a more traditional design of separate functions with a pseudo-
    namespace naming convention, ``median_low``, ``median_high``, etc.

    Another issue that was of concern to code reviewers was the existence
    of a function calculating the sample mode of continuous data, with
    some people questioning the choice of algorithm, and whether it was
    a sufficiently common need to be included.  So it was dropped from
    the API, and ``mode`` now implements only the basic schoolbook
    algorithm based on counting unique values.

    Another significant point of discussion was calculating statistics of
    timedelta objects.  Although the statistics module will not directly
    support timedelta objects, it is possible to support this use-case by
    converting them to numbers first using the ``timedelta.total_seconds``
    method.


Frequently Asked Questions

    Q: Shouldn't this module spend time on PyPI before being considered for
       the standard library?

    A: Older versions of this module have been available on PyPI[22] since
       2010. Being much simpler than numpy, it does not require many years of
       external development.

    Q: Does the standard library really need yet another version of ``sum``?

    A: This proved to be the most controversial part of the reference
       implementation.  In one sense, clearly three sums is two too many.  But
       in another sense, yes.  The reasons why the two existing versions are
       unsuitable are described here[23] but the short summary is:

       - the built-in sum can lose precision with floats;

       - the built-in sum accepts any non-numeric data type that supports
         the + operator, apart from strings and bytes;

       - math.fsum is high-precision, but coerces all arguments to float.

       There was some interest in "fixing" one or the other of the existing
       sums. If this occurs before 3.4 feature-freeze, the decision to keep
       statistics.sum can be re-considered.

    Q: Will this module be backported to older versions of Python?

    A: The module currently targets 3.3, and I will make it available on PyPI
       for 3.3 for the foreseeable future. Backporting to older versions of
       the 3.x series is likely (but not yet decided). Backporting to 2.7 is
       less likely but not ruled out.

    Q: Is this supposed to replace numpy?

    A: No. While it is likely to grow over the years (see open issues below)
       it is not aimed to replace, or even compete directly with, numpy. Numpy
       is a full-featured numeric library aimed at professionals, the nuclear
       reactor of numeric libraries in the Python ecosystem. This is just a
       battery, as in "batteries included", and is aimed at an intermediate
       level somewhere between "use numpy" and "roll your own version".


Future Work

    - At this stage, I am unsure of the best API for multivariate statistical
      functions such as linear regression, correlation coefficient, and
      covariance. Possible APIs include:

        * Separate arguments for x and y data:
          function([x0, x1, ...], [y0, y1, ...])

        * A single argument for (x, y) data:
          function([(x0, y0), (x1, y1), ...])

          This API is preferred by GvR[24].

        * Selecting arbitrary columns from a 2D array:
          function([[a0, x0, y0, z0], [a1, x1, y1, z1], ...], x=1, y=2)

        * Some combination of the above.

      In the absence of a consensus of preferred API for multivariate stats,
      I will defer including such multivariate functions until Python 3.5.

    - Likewise, functions for calculating probability of random variables and
      inference testing (e.g. Student's t-test) will be deferred until 3.5.

    - There is considerable interest in including one-pass functions that can
      calculate multiple statistics from data in iterator form, without having
      to convert to a list. The experimental "stats" package on PyPI includes
      co-routine versions of statistics functions. Including these will be
      deferred to 3.5.


References

    [1] http://mail.python.org/pipermail/python-dev/2010-October/104721.html

    [2] http://support.casio.com/pdf/004/CP330PLUSver310_Soft_E.pdf

    [3] Gnumeric:
            https://projects.gnome.org/gnumeric/functions.shtml

        LibreOffice:
            https://help.libreoffice.org/Calc/Statistical_Functions_Part_One
            https://help.libreoffice.org/Calc/Statistical_Functions_Part_Two
            https://help.libreoffice.org/Calc/Statistical_Functions_Part_Three
            https://help.libreoffice.org/Calc/Statistical_Functions_Part_Four
            https://help.libreoffice.org/Calc/Statistical_Functions_Part_Five

    [4] Scipy: http://scipy-central.org/
        Numpy: http://www.numpy.org/

    [5] http://wiki.scipy.org/Numpy_Functions_by_Category

    [6] Tested with numpy 1.6.1 and Python 2.7.

    [7] http://www.johndcook.com/blog/2008/09/26/comparing-three-methods-of-computing-standard-deviation/

    [8] http://rosettacode.org/wiki/Standard_deviation

    [9] https://bitbucket.org/larsyencken/simplestats/src/c42e048a6625/src/basic.py

    [10] http://stackoverflow.com/questions/2341340/calculate-mean-and-variance-with-one-iteration

    [11] http://www.r-project.org/

    [12] http://msdn.microsoft.com/en-us/library/system.linq.enumerable.average.aspx

    [13] https://www.bcg.wisc.edu/webteam/support/ruby/standard_deviation

    [14] http://ruby-statsample.rubyforge.org/

    [15] http://www.php.net/manual/en/ref.stats.php

    [16] http://www.ayton.id.au/gary/it/Delphi/D_maths.htm#Delphi%20Statistical%20functions.

    [17] http://www.gnu.org/software/gsl/manual/html_node/Statistics.html

    [18] http://www.gnu.org/software/gsl/manual/html_node/Mean-and-standard-deviation-and-variance.html

    [19] http://mathworld.wolfram.com/Skewness.html

    [20] At least, tedious to those who don't like this sort of thing.

    [21] http://mail.python.org/pipermail/python-ideas/2011-September/011524.html

    [22] https://pypi.python.org/pypi/stats/

    [23] http://mail.python.org/pipermail/python-ideas/2013-August/022630.html

    [24] https://mail.python.org/pipermail/python-dev/2013-September/128429.html


Copyright

    This document has been placed in the public domain.



pep-0451 A ModuleSpec Type for the Import System

PEP:451
Title:A ModuleSpec Type for the Import System
Version:$Revision$
Last-Modified:$Date$
Author:Eric Snow <ericsnowcurrently at gmail.com>
BDFL-Delegate:Brett Cannon <brett@python.org>, Nick Coghlan <ncoghlan@gmail.com>
Discussions-To:import-sig at python.org
Status:Final
Type:Standards Track
Content-Type:text/x-rst
Created:8-Aug-2013
Python-Version:3.4
Post-History:8-Aug-2013, 28-Aug-2013, 18-Sep-2013, 24-Sep-2013, 4-Oct-2013
Resolution:https://mail.python.org/pipermail/python-dev/2013-November/130104.html

Abstract

This PEP proposes to add a new class to importlib.machinery called "ModuleSpec". It will provide all the import-related information used to load a module and will be available without needing to load the module first. Finders will directly provide a module's spec instead of a loader (which they will continue to provide indirectly). The import machinery will be adjusted to take advantage of module specs, including using them to load modules.

Terms and Concepts

The changes in this proposal are an opportunity to make several existing terms and concepts more clear, whereas currently they are (unfortunately) ambiguous. New concepts are also introduced in this proposal. Finally, it's worth explaining a few other existing terms with which people may not be so familiar. For the sake of context, here is a brief summary of all three groups of terms and concepts. A more detailed explanation of the import system is found at [2].

name

In this proposal, a module's "name" refers to its fully-qualified name, meaning the fully-qualified name of the module's parent (if any) joined to the simple name of the module by a period.

finder

A "finder" is an object that identifies the loader that the import system should use to load a module. Currently this is accomplished by calling the finder's find_module() method, which returns the loader.

Finders are strictly responsible for providing the loader, which they do through their find_module() method. The import system then uses that loader to load the module.

loader

A "loader" is an object that is used to load a module during import. Currently this is done by calling the loader's load_module() method. A loader may also provide APIs for getting information about the modules it can load, as well as about data from sources associated with such a module.

Right now loaders (via load_module()) are responsible for certain boilerplate, import-related operations. These are:

  1. Perform some (module-related) validation
  2. Create the module object
  3. Set import-related attributes on the module
  4. "Register" the module to sys.modules
  5. Exec the module
  6. Clean up in the event of failure while loading the module

This all takes place during the import system's call to Loader.load_module().

origin

This is a new term and concept. The idea of it exists subtly in the import system already, but this proposal makes the concept explicit.

"origin" in an import context means the system (or resource within a system) from which a module originates. For the purposes of this proposal, "origin" is also a string which identifies such a resource or system. "origin" is applicable to all modules.

For example, the origin for built-in and frozen modules is the interpreter itself. The import system already identifies this origin as "built-in" and "frozen", respectively. This is demonstrated in the following module repr: "<module 'sys' (built-in)>".

In fact, the module repr is already a relatively reliable, though implicit, indicator of a module's origin. Other modules also indicate their origin through other means, as described in the entry for "location".

It is up to the loader to decide on how to interpret and use a module's origin, if at all.

location

This is a new term. However the concept already exists clearly in the import system, as associated with the __file__ and __path__ attributes of modules, as well as the name/term "path" elsewhere.

A "location" is a resource or "place", rather than a system at large, from which a module is loaded. It qualifies as an "origin". Examples of locations include filesystem paths and URLs. A location is identified by the name of the resource, but may not necessarily identify the system to which the resource pertains. In such cases the loader would have to identify the system itself.

In contrast to other kinds of module origin, a location cannot be inferred by the loader just by the module name. Instead, the loader must be provided with a string to identify the location, usually by the finder that generates the loader. The loader then uses this information to locate the resource from which it will load the module. In theory you could load the module at a given location under various names.

The most common example of locations in the import system are the files from which source and extension modules are loaded. For these modules the location is identified by the string in the __file__ attribute. Although __file__ isn't particularly accurate for some modules (e.g. zipped), it is currently the only way that the import system indicates that a module has a location.

A module that has a location may be called "locatable".

cache

The import system stores compiled modules in the __pycache__ directory as an optimization. This module cache that we use today was provided by PEP 3147. For this proposal, the relevant API for module caching is the __cache__ attribute of modules and the cache_from_source() function in importlib.util. Loaders are responsible for putting modules into the cache (and loading out of the cache). Currently the cache is only used for compiled source modules. However, loaders may take advantage of the module cache for other kinds of modules.

package

The concept does not change, nor does the term. However, the distinction between modules and packages is mostly superficial. Packages are modules. They simply have a __path__ attribute and import may add attributes bound to submodules. The typically perceived difference is a source of confusion. This proposal explicitly de-emphasizes the distinction between packages and modules where it makes sense to do so.

Motivation

The import system has evolved over the lifetime of Python. In late 2002 PEP 302 introduced standardized import hooks via finders and loaders and sys.meta_path. The importlib module, introduced with Python 3.1, now exposes a pure Python implementation of the APIs described by PEP 302, as well as of the full import system. It is now much easier to understand and extend the import system. While a benefit to the Python community, this greater accessabilty also presents a challenge.

As more developers come to understand and customize the import system, any weaknesses in the finder and loader APIs will be more impactful. So the sooner we can address any such weaknesses the import system, the better...and there are a couple we hope to take care of with this proposal.

Firstly, any time the import system needs to save information about a module we end up with more attributes on module objects that are generally only meaningful to the import system. It would be nice to have a per-module namespace in which to put future import-related information and to pass around within the import system. Secondly, there's an API void between finders and loaders that causes undue complexity when encountered. The PEP 420 (namespace packages) implementation had to work around this. The complexity surfaced again during recent efforts on a separate proposal. [1]

The finder and loader sections above detail current responsibility of both. Notably, loaders are not required to provide any of the functionality of their load_module() method through other methods. Thus, though the import-related information about a module is likely available without loading the module, it is not otherwise exposed.

Furthermore, the requirements associated with load_module() are common to all loaders and mostly are implemented in exactly the same way. This means every loader has to duplicate the same boilerplate code. importlib.util provides some tools that help with this, but it would be more helpful if the import system simply took charge of these responsibilities. The trouble is that this would limit the degree of customization that load_module() could easily continue to facilitate.

More importantly, While a finder could provide the information that the loader's load_module() would need, it currently has no consistent way to get it to the loader. This is a gap between finders and loaders which this proposal aims to fill.

Finally, when the import system calls a finder's find_module(), the finder makes use of a variety of information about the module that is useful outside the context of the method. Currently the options are limited for persisting that per-module information past the method call, since it only returns the loader. Popular options for this limitation are to store the information in a module-to-info mapping somewhere on the finder itself, or store it on the loader.

Unfortunately, loaders are not required to be module-specific. On top of that, some of the useful information finders could provide is common to all finders, so ideally the import system could take care of those details. This is the same gap as before between finders and loaders.

As an example of complexity attributable to this flaw, the implementation of namespace packages in Python 3.3 (see PEP 420) added FileFinder.find_loader() because there was no good way for find_module() to provide the namespace search locations.

The answer to this gap is a ModuleSpec object that contains the per-module information and takes care of the boilerplate functionality involved with loading the module.

Specification

The goal is to address the gap between finders and loaders while changing as little of their semantics as possible. Though some functionality and information is moved to the new ModuleSpec type, their behavior should remain the same. However, for the sake of clarity the finder and loader semantics will be explicitly identified.

Here is a high-level summary of the changes described by this PEP. More detail is available in later sections.

importlib.machinery.ModuleSpec (new)

An encapsulation of a module's import-system-related state during import. See the ModuleSpec section below for a more detailed description.

  • ModuleSpec(name, loader, *, origin=None, loader_state=None, is_package=None)

Attributes:

  • name - a string for the fully-qualified name of the module.
  • loader - the loader to use for loading.
  • origin - the name of the place from which the module is loaded, e.g. "builtin" for built-in modules and the filename for modules loaded from source.
  • submodule_search_locations - list of strings for where to find submodules, if a package (None otherwise).
  • loader_state - a container of extra module-specific data for use during loading.
  • cached (property) - a string for where the compiled module should be stored.
  • parent (RO-property) - the fully-qualified name of the package to which the module belongs as a submodule (or None).
  • has_location (RO-property) - a flag indicating whether or not the module's "origin" attribute refers to a location.

importlib.util Additions

These are ModuleSpec factory functions, meant as a convenience for finders. See the Factory Functions section below for more detail.

  • spec_from_file_location(name, location, *, loader=None, submodule_search_locations=None) - build a spec from file-oriented information and loader APIs.
  • spec_from_loader(name, loader, *, origin=None, is_package=None) - build a spec with missing information filled in by using loader APIs.

Other API Additions

  • importlib.find_spec(name, path=None, target=None) will work exactly the same as importlib.find_loader() (which it replaces), but return a spec instead of a loader.

For finders:

  • importlib.abc.MetaPathFinder.find_spec(name, path, target) and importlib.abc.PathEntryFinder.find_spec(name, target) will return a module spec to use during import.

For loaders:

  • importlib.abc.Loader.exec_module(module) will execute a module in its own namespace. It replaces importlib.abc.Loader.load_module(), taking over its module execution functionality.
  • importlib.abc.Loader.create_module(spec) (optional) will return the module to use for loading.

For modules:

  • Module objects will have a new attribute: __spec__.

API Changes

  • InspectLoader.is_package() will become optional.

Deprecations

  • importlib.abc.MetaPathFinder.find_module()
  • importlib.abc.PathEntryFinder.find_module()
  • importlib.abc.PathEntryFinder.find_loader()
  • importlib.abc.Loader.load_module()
  • importlib.abc.Loader.module_repr()
  • importlib.util.set_package()
  • importlib.util.set_loader()
  • importlib.find_loader()

Removals

These were introduced prior to Python 3.4's release, so they can simply be removed.

  • importlib.abc.Loader.init_module_attrs()
  • importlib.util.module_to_load()

Other Changes

  • The import system implementation in importlib will be changed to make use of ModuleSpec.
  • importlib.reload() will make use of ModuleSpec.
  • A module's import-related attributes (other than __spec__) will no longer be used directly by the import system during that module's import. However, this does not impact use of those attributes (e.g. __path__) when loading other modules (e.g. submodules).
  • Import-related attributes should no longer be added to modules directly, except by the import system.
  • The module type's __repr__() will be a thin wrapper around a pure Python implementation which will leverage ModuleSpec.
  • The spec for the __main__ module will reflect the appropriate name and origin.

Backward-Compatibility

  • If a finder does not define find_spec(), a spec is derived from the loader returned by find_module().
  • PathEntryFinder.find_loader() still takes priority over find_module().
  • Loader.load_module() is used if exec_module() is not defined.

What Will not Change?

  • The syntax and semantics of the import statement.
  • Existing finders and loaders will continue to work normally.
  • The import-related module attributes will still be initialized with the same information.
  • Finders will still create loaders (now storing them in specs).
  • Loader.load_module(), if a module defines it, will have all the same requirements and may still be called directly.
  • Loaders will still be responsible for module data APIs.
  • importlib.reload() will still overwrite the import-related attributes.

Responsibilities

Here's a quick breakdown of where responsibilities lie after this PEP.

finders:

  • create/identify a loader that can load the module.
  • create the spec for the module.

loaders:

  • create the module (optional).
  • execute the module.

ModuleSpec:

  • orchestrate module loading
  • boilerplate for module loading, including managing sys.modules and setting import-related attributes
  • create module if loader doesn't
  • call loader.exec_module(), passing in the module in which to exec
  • contain all the information the loader needs to exec the module
  • provide the repr for modules

What Will Existing Finders and Loaders Have to Do Differently?

Immediately? Nothing. The status quo will be deprecated, but will continue working. However, here are the things that the authors of finders and loaders should change relative to this PEP:

  • Implement find_spec() on finders.
  • Implement exec_module() on loaders, if possible.

The ModuleSpec factory functions in importlib.util are intended to be helpful for converting existing finders. spec_from_loader() and spec_from_file_location() are both straight-forward utilities in this regard.

For existing loaders, exec_module() should be a relatively direct conversion from the non-boilerplate portion of load_module(). In some uncommon cases the loader should also implement create_module().

ModuleSpec Users

ModuleSpec objects have 3 distinct target audiences: Python itself, import hooks, and normal Python users.

Python will use specs in the import machinery, in interpreter startup, and in various standard library modules. Some modules are import-oriented, like pkgutil, and others are not, like pickle and pydoc. In all cases, the full ModuleSpec API will get used.

Import hooks (finders and loaders) will make use of the spec in specific ways. First of all, finders may use the spec factory functions in importlib.util to create spec objects. They may also directly adjust the spec attributes after the spec is created. Secondly, the finder may bind additional information to the spec (in finder_extras) for the loader to consume during module creation/execution. Finally, loaders will make use of the attributes on a spec when creating and/or executing a module.

Python users will be able to inspect a module's __spec__ to get import-related information about the object. Generally, Python applications and interactive users will not be using the ModuleSpec factory functions nor any the instance methods.

How Loading Will Work

Here is an outline of what the import machinery does during loading, adjusted to take advantage of the module's spec and the new loader API:

module = None
if spec.loader is not None and hasattr(spec.loader, 'create_module'):
    module = spec.loader.create_module(spec)
if module is None:
    module = ModuleType(spec.name)
# The import-related module attributes get set here:
_init_module_attrs(spec, module)

if spec.loader is None and spec.submodule_search_locations is not None:
    # Namespace package
    sys.modules[spec.name] = module
elif not hasattr(spec.loader, 'exec_module'):
    spec.loader.load_module(spec.name)
    # __loader__ and __package__ would be explicitly set here for
    # backwards-compatibility.
else:
    sys.modules[spec.name] = module
    try:
        spec.loader.exec_module(module)
    except BaseException:
        try:
            del sys.modules[spec.name]
        except KeyError:
            pass
        raise
module_to_return = sys.modules[spec.name]

These steps are exactly what Loader.load_module() is already expected to do. Loaders will thus be simplified since they will only need to implement exec_module().

Note that we must return the module from sys.modules. During loading the module may have replaced itself in sys.modules. Since we don't have a post-import hook API to accommodate the use case, we have to deal with it. However, in the replacement case we do not worry about setting the import-related module attributes on the object. The module writer is on their own if they are doing this.

How Reloading Will Work

Here is the corresponding outline for reload():

_RELOADING = {}

def reload(module):
    try:
        name = module.__spec__.name
    except AttributeError:
        name = module.__name__
    spec = find_spec(name, target=module)

    if sys.modules.get(name) is not module:
        raise ImportError
    if spec in _RELOADING:
        return _RELOADING[name]
    _RELOADING[name] = module
    try:
        if spec.loader is None:
            # Namespace loader
            _init_module_attrs(spec, module)
            return module
        if spec.parent and spec.parent not in sys.modules:
            raise ImportError

        _init_module_attrs(spec, module)
        # Ignoring backwards-compatibility call to load_module()
        # for simplicity.
        spec.loader.exec_module(module)
        return sys.modules[name]
    finally:
        del _RELOADING[name]

A key point here is the switch to Loader.exec_module() means that loaders will no longer have an easy way to know at execution time if it is a reload or not. Before this proposal, they could simply check to see if the module was already in sys.modules. Now, by the time exec_module() is called during load (not reload) the import machinery would already have placed the module in sys.modules. This is part of the reason why find_spec() has the "target" parameter.

The semantics of reload will remain essentially the same as they exist already [5]. The impact of this PEP on some kinds of lazy loading modules was a point of discussion. [4]

ModuleSpec

Attributes

Each of the following names is an attribute on ModuleSpec objects. A value of None indicates "not set". This contrasts with module objects where the attribute simply doesn't exist. Most of the attributes correspond to the import-related attributes of modules. Here is the mapping. The reverse of this mapping describes how the import machinery sets the module attributes right before calling exec_module().

On ModuleSpec On Modules
name __name__
loader __loader__
parent __package__
origin __file__*
cached __cached__*,**
submodule_search_locations __path__**
loader_state -
has_location -
* Set on the module only if spec.has_location is true.
** Set on the module only if the spec attribute is not None.

While parent and has_location are read-only properties, the remaining attributes can be replaced after the module spec is created and even after import is complete. This allows for unusual cases where directly modifying the spec is the best option. However, typical use should not involve changing the state of a module's spec.

origin

"origin" is a string for the name of the place from which the module originates. See origin above. Aside from the informational value, it is also used in the module's repr. In the case of a spec where "has_location" is true, __file__ is set to the value of "origin". For built-in modules "origin" would be set to "built-in".

has_location

As explained in the location section above, many modules are "locatable", meaning there is a corresponding resource from which the module will be loaded and that resource can be described by a string. In contrast, non-locatable modules can't be loaded in this fashion, e.g. builtin modules and modules dynamically created in code. For these, the name is the only way to access them, so they have an "origin" but not a "location".

"has_location" is true if the module is locatable. In that case the spec's origin is used as the location and __file__ is set to spec.origin. If additional location information is required (e.g. zipimport), that information may be stored in spec.loader_state.

"has_location" may be implied from the existence of a load_data() method on the loader.

Incidentally, not all locatable modules will be cache-able, but most will.

submodule_search_locations

The list of location strings, typically directory paths, in which to search for submodules. If the module is a package this will be set to a list (even an empty one). Otherwise it is None.

The name of the corresponding module attribute, __path__, is relatively ambiguous. Instead of mirroring it, we use a more explicit attribute name that makes the purpose clear.

loader_state

A finder may set loader_state to any value to provide additional data for the loader to use during loading. A value of None is the default and indicates that there is no additional data. Otherwise it can be set to any object, such as a dict, list, or types.SimpleNamespace, containing the relevant extra information.

For example, zipimporter could use it to pass the zip archive name to the loader directly, rather than needing to derive it from origin or create a custom loader for each find operation.

loader_state is meant for use by the finder and corresponding loader. It is not guaranteed to be a stable resource for any other use.

Factory Functions

spec_from_file_location(name, location, *, loader=None, submodule_search_locations=None)

Build a spec from file-oriented information and loader APIs.

  • "origin" will be set to the location.
  • "has_location" will be set to True.
  • "cached" will be set to the result of calling cache_from_source().
  • "origin" can be deduced from loader.get_filename() (if "location" is not passed in.
  • "loader" can be deduced from suffix if the location is a filename.
  • "submodule_search_locations" can be deduced from loader.is_package() and from os.path.dirname(location) if location is a filename.

spec_from_loader(name, loader, *, origin=None, is_package=None)

Build a spec with missing information filled in by using loader APIs.

  • "has_location" can be deduced from loader.get_data.
  • "origin" can be deduced from loader.get_filename().
  • "submodule_search_locations" can be deduced from loader.is_package() and from os.path.dirname(location) if location is a filename.

Backward Compatibility

ModuleSpec doesn't have any. This would be a different story if Finder.find_module() were to return a module spec instead of loader. In that case, specs would have to act like the loader that would have been returned instead. Doing so would be relatively simple, but is an unnecessary complication. It was part of earlier versions of this PEP.

Subclassing

Subclasses of ModuleSpec are allowed, but should not be necessary. Simply setting loader_state or adding functionality to a custom finder or loader will likely be a better fit and should be tried first. However, as long as a subclass still fulfills the requirements of the import system, objects of that type are completely fine as the return value of Finder.find_spec(). The same points apply to duck-typing.

Existing Types

Module Objects

Other than adding __spec__, none of the import-related module attributes will be changed or deprecated, though some of them could be; any such deprecation can wait until Python 4.

A module's spec will not be kept in sync with the corresponding import- related attributes. Though they may differ, in practice they will typically be the same.

One notable exception is that case where a module is run as a script by using the -m flag. In that case module.__spec__.name will reflect the actual module name while module.__name__ will be __main__.

A module's spec is not guaranteed to be identical between two modules with the same name. Likewise there is no guarantee that successive calls to importlib.find_spec() will return the same object or even an equivalent object, though at least the latter is likely.

Finders

Finders are still responsible for identifying, and typically creating, the loader that should be used to load a module. That loader will now be stored in the module spec returned by find_spec() rather than returned directly. As is currently the case without the PEP, if a loader would be costly to create, that loader can be designed to defer the cost until later.

MetaPathFinder.find_spec(name, path=None, target=None)

PathEntryFinder.find_spec(name, target=None)

Finders must return ModuleSpec objects when find_spec() is called. This new method replaces find_module() and find_loader() (in the PathEntryFinder case). If a loader does not have find_spec(), find_module() and find_loader() are used instead, for backward-compatibility.

Adding yet another similar method to loaders is a case of practicality. find_module() could be changed to return specs instead of loaders. This is tempting because the import APIs have suffered enough, especially considering PathEntryFinder.find_loader() was just added in Python 3.3. However, the extra complexity and a less-than- explicit method name aren't worth it.

The "target" parameter of find_spec()

A call to find_spec() may optionally include a "target" argument. This is the module object that will be used subsequently as the target of loading. During normal import (and by default) "target" is None, meaning the target module has yet to be created. During reloading the module passed in to reload() is passed through to find_spec() as the target. This argument allows the finder to build the module spec with more information than is otherwise available. Doing so is particularly relevant in identifying the loader to use.

Through find_spec() the finder will always identify the loader it will return in the spec (or return None). At the point the loader is identified, the finder should also decide whether or not the loader supports loading into the target module, in the case that "target" is passed in. This decision may entail consulting with the loader.

If the finder determines that the loader does not support loading into the target module, it should either find another loader or raise ImportError (completely stopping import of the module). This determination is especially important during reload since, as noted in How Reloading Will Work, loaders will no longer be able to trivially identify a reload situation on their own.

Two alternatives were presented to the "target" parameter: Loader.supports_reload() and adding "target" to Loader.exec_module() instead of find_spec(). supports_reload() was the initial approach to the reload situation. [6] However, there was some opposition to the loader-specific, reload-centric approach. [7]

As to "target" on exec_module(), the loader may need other information from the target module (or spec) during reload, more than just "does this loader support reloading this module", that is no longer available with the move away from load_module(). A proposal on the table was to add something like "target" to exec_module(). [8] However, putting "target" on find_spec() instead is more in line with the goals of this PEP. Furthermore, it obviates the need for supports_reload().

Namespace Packages

Currently a path entry finder may return (None, portions) from find_loader() to indicate it found part of a possible namespace package. To achieve the same effect, find_spec() must return a spec with "loader" set to None (a.k.a. not set) and with submodule_search_locations set to the same portions as would have been provided by find_loader(). It's up to PathFinder how to handle such specs.

Loaders

Loader.exec_module(module)

Loaders will have a new method, exec_module(). Its only job is to "exec" the module and consequently populate the module's namespace. It is not responsible for creating or preparing the module object, nor for any cleanup afterward. It has no return value. exec_module() will be used during both loading and reloading.

exec_module() should properly handle the case where it is called more than once. For some kinds of modules this may mean raising ImportError every time after the first time the method is called. This is particularly relevant for reloading, where some kinds of modules do not support in-place reloading.

Loader.create_module(spec)

Loaders may also implement create_module() that will return a new module to exec. It may return None to indicate that the default module creation code should be used. One use case, though atypical, for create_module() is to provide a module that is a subclass of the builtin module type. Most loaders will not need to implement create_module(),

create_module() should properly handle the case where it is called more than once for the same spec/module. This may include returning None or raising ImportError.

Note

exec_module() and create_module() should not set any import-related module attributes. The fact that load_module() does is a design flaw that this proposal aims to correct.

Other changes:

PEP 420 introduced the optional module_repr() loader method to limit the amount of special-casing in the module type's __repr__(). Since this method is part of ModuleSpec, it will be deprecated on loaders. However, if it exists on a loader it will be used exclusively.

Loader.init_module_attr() method, added prior to Python 3.4's release, will be removed in favor of the same method on ModuleSpec.

However, InspectLoader.is_package() will not be deprecated even though the same information is found on ModuleSpec. ModuleSpec can use it to populate its own is_package if that information is not otherwise available. Still, it will be made optional.

In addition to executing a module during loading, loaders will still be directly responsible for providing APIs concerning module-related data.

Other Changes

  • The various finders and loaders provided by importlib will be updated to comply with this proposal.
  • Any other implmentations of or dependencies on the import-related APIs (particularly finders and loaders) in the stdlib will be likewise adjusted to this PEP. While they should continue to work, any such changes that get missed should be considered bugs for the Python 3.4.x series.
  • The spec for the __main__ module will reflect how the interpreter was started. For instance, with -m the spec's name will be that of the module used, while __main__.__name__ will still be "__main__".
  • We will add importlib.find_spec() to mirror importlib.find_loader() (which becomes deprecated).
  • importlib.reload() is changed to use ModuleSpec.
  • importlib.reload() will now make use of the per-module import lock.

Reference Implementation

A reference implementation is available at http://bugs.python.org/issue18864.

Implementation Notes

* The implementation of this PEP needs to be cognizant of its impact on pkgutil (and setuptools). pkgutil has some generic function-based extensions to PEP 302 which may break if importlib starts wrapping loaders without the tools' knowledge.

* Other modules to look at: runpy (and pythonrun.c), pickle, pydoc, inspect.

For instance, pickle should be updated in the __main__ case to look at module.__spec__.name.

Rejected Additions to the PEP

There were a few proposed additions to this proposal that did not fit well enough into its scope.

There is no "PathModuleSpec" subclass of ModuleSpec that separates out has_location, cached, and submodule_search_locations. While that might make the separation cleaner, module objects don't have that distinction. ModuleSpec will support both cases equally well.

While "ModuleSpec.is_package" would be a simple additional attribute (aliasing self.submodule_search_locations is not None), it perpetuates the artificial (and mostly erroneous) distinction between modules and packages.

The module spec Factory Functions could be classmethods on ModuleSpec. However that would expose them on all modules via __spec__, which has the potential to unnecessarily confuse non-advanced Python users. The factory functions have a specific use case, to support finder authors. See ModuleSpec Users.

Likewise, several other methods could be added to ModuleSpec that expose the specific uses of module specs by the import machinery:

  • create() - a wrapper around Loader.create_module().
  • exec(module) - a wrapper around Loader.exec_module().
  • load() - an analogue to the deprecated Loader.load_module().

As with the factory functions, exposing these methods via module.__spec__ is less than desireable. They would end up being an attractive nuisance, even if only exposed as "private" attributes (as they were in previous versions of this PEP). If someone finds a need for these methods later, we can expose the via an appropriate API (separate from ModuleSpec) at that point, perhaps relative to PEP 406 (import engine).

Conceivably, the load() method could optionally take a list of modules with which to interact instead of sys.modules. Also, load() could be leveraged to implement multi-version imports. Both are interesting ideas, but definitely outside the scope of this proposal.

Others left out:

  • Add ModuleSpec.submodules (RO-property) - returns possible submodules relative to the spec.
  • Add ModuleSpec.loaded (RO-property) - the module in sys.module, if any.
  • Add ModuleSpec.data - a descriptor that wraps the data API of the spec's loader.
  • Also see [3].

pep-0452 API for Cryptographic Hash Functions v2.0

PEP: 452
Title: API for Cryptographic Hash Functions v2.0
Version: $Revision$
Last-Modified: $Date$
Author: A.M. Kuchling <amk at amk.ca>, Christian Heimes <christian at python.org>
Status: Draft
Type: Informational
Created: 15-Aug-2013
Post-History: 
Replaces: 247

Abstract

    There are several different modules available that implement
    cryptographic hashing algorithms such as MD5 or SHA.  This
    document specifies a standard API for such algorithms, to make it
    easier to switch between different implementations.


Specification

    All hashing modules should present the same interface.  Additional
    methods or variables can be added, but those described in this
    document should always be present.

    Hash function modules define one function:

    new([string])            (unkeyed hashes)
    new(key, [string], [digestmod])    (keyed hashes)

        Create a new hashing object and return it.  The first form is
        for hashes that are unkeyed, such as MD5 or SHA.  For keyed
        hashes such as HMAC, 'key' is a required parameter containing
        a string giving the key to use.  In both cases, the optional
        'string' parameter, if supplied, will be immediately hashed
        into the object's starting state, as if obj.update(string) was
        called.

        After creating a hashing object, arbitrary bytes can be fed
        into the object using its update() method, and the hash value
        can be obtained at any time by calling the object's digest()
        method.

        Although the parameter is called 'string', hashing objects operate
        on 8-bit data only. Both 'key' and 'string' must be a bytes-like
        object (bytes, bytearray...). A hashing object may support
        one-dimensional, contiguous buffers as argument, too. Text
        (unicode) is no longer supported in Python 3.x. Python 2.x
        implementations may take ASCII-only unicode as argument, but
        portable code should not rely on the feature.

        Arbitrary additional keyword arguments can be added to this
        function, but if they're not supplied, sensible default values
        should be used.  For example, 'rounds' and 'digest_size'
        keywords could be added for a hash function which supports a
        variable number of rounds and several different output sizes,
        and they should default to values believed to be secure.

    Hash function modules define one variable:

    digest_size

        An integer value; the size of the digest produced by the
        hashing objects created by this module, measured in bytes.
        You could also obtain this value by creating a sample object
        and accessing its 'digest_size' attribute, but it can be
        convenient to have this value available from the module.
        Hashes with a variable output size will set this variable to
        None.

    Hashing objects require the following attribute:

    digest_size

        This attribute is identical to the module-level digest_size
        variable, measuring the size of the digest produced by the
        hashing object, measured in bytes.  If the hash has a variable
        output size, this output size must be chosen when the hashing
        object is created, and this attribute must contain the
        selected size.  Therefore None is *not* a legal value for this
        attribute.

    block_size

        An integer value or ``NotImplemented``; the internal block size
        of the hash algorithm in bytes. The block size is used by the
        HMAC module to pad the secret key to digest_size or to hash the
        secret key if it is longer than digest_size. If no HMAC
        algorithm is standardized for the the hash algorithm, return
        ``NotImplemented`` instead.

    name

        A text string value; the canonical, lowercase name of the hashing
        algorithm. The name should be a suitable parameter for
        :func:`hashlib.new`.

    Hashing objects require the following methods:

    copy()

        Return a separate copy of this hashing object.  An update to
        this copy won't affect the original object.

    digest()

        Return the hash value of this hashing object as a bytes
        containing 8-bit data.  The object is not altered in any way
        by this function; you can continue updating the object after
        calling this function.

    hexdigest()

        Return the hash value of this hashing object as a string
        containing hexadecimal digits.  Lowercase letters should be used
        for the digits 'a' through 'f'.  Like the .digest() method, this
        method mustn't alter the object.

    update(string)

        Hash bytes-like 'string' into the current state of the hashing
        object. update() can be called any number of times during a
        hashing object's lifetime.

    Hashing modules can define additional module-level functions or
    object methods and still be compliant with this specification.

    Here's an example, using a module named 'MD5':

        >>> import hashlib
        >>> from Crypto.Hash import MD5
        >>> m = MD5.new()
        >>> isinstance(m, hashlib.CryptoHash)
        True
        >>> m.name
        'md5'
        >>> m.digest_size
        16
        >>> m.block_size
        64
        >>> m.update(b'abc')
        >>> m.digest()
        b'\x90\x01P\x98<\xd2O\xb0\xd6\x96?}(\xe1\x7fr'
        >>> m.hexdigest()
        '900150983cd24fb0d6963f7d28e17f72'
        >>> MD5.new(b'abc').digest()
        b'\x90\x01P\x98<\xd2O\xb0\xd6\x96?}(\xe1\x7fr'


Rationale

    The digest size is measured in bytes, not bits, even though hash
    algorithm sizes are usually quoted in bits; MD5 is a 128-bit
    algorithm and not a 16-byte one, for example.  This is because, in
    the sample code I looked at, the length in bytes is often needed
    (to seek ahead or behind in a file; to compute the length of an
    output string) while the length in bits is rarely used.
    Therefore, the burden will fall on the few people actually needing
    the size in bits, who will have to multiply digest_size by 8.

    It's been suggested that the update() method would be better named
    append().  However, that method is really causing the current
    state of the hashing object to be updated, and update() is already
    used by the md5 and sha modules included with Python, so it seems
    simplest to leave the name update() alone.

    The order of the constructor's arguments for keyed hashes was a
    sticky issue.  It wasn't clear whether the key should come first
    or second.  It's a required parameter, and the usual convention is
    to place required parameters first, but that also means that the
    'string' parameter moves from the first position to the second.
    It would be possible to get confused and pass a single argument to
    a keyed hash, thinking that you're passing an initial string to an
    unkeyed hash, but it doesn't seem worth making the interface
    for keyed hashes more obscure to avoid this potential error.


Changes from Version 1.0 to Version 2.0

    Version 2.0 of API for Cryptographic Hash Functions clarifies some
    aspects of the API and brings it up-to-date. It also formalized aspects
    that were already de-facto standards and provided by most
    implementations.

    Version 2.0 introduces the following new attributes:

    name

        The name property was made mandatory by :issue:`18532`.

    block_size

        The new version also specifies that the return value
        ``NotImplemented`` prevents HMAC support.

    Version 2.0 takes the separation of binary and text data in Python
    3.0 into account. The 'string' argument to new() and update() as
    well as the 'key' argument must be bytes-like objects. On Python
    2.x a hashing object may also support ASCII-only unicode. The actual
    name of argument is not changed as it is part of the public API.
    Code may depend on the fact that the argument is called 'string'.


Recommended names for common hashing algorithms

    algorithm       variant         recommended name
    ----------      ---------       ----------------
    MD5                             md5
    RIPEMD-160                      ripemd160
    SHA-1                           sha1
    SHA-2           SHA-224         sha224
                    SHA-256         sha256
                    SHA-384         sha384
                    SHA-512         sha512
    SHA-3           SHA-3-224       sha3_224
                    SHA-3-256       sha3_256
                    SHA-3-384       sha3_384
                    SHA-3-512       sha3_512
    WHIRLPOOL                       whirlpool


Changes

    2001-09-17: Renamed clear() to reset(); added digest_size attribute
                to objects; added .hexdigest() method.
    2001-09-20: Removed reset() method completely.
    2001-09-28: Set digest_size to None for variable-size hashes.
    2013-08-15: Added block_size and name attributes; clarified that
               'string' actually referes to bytes-like objects.


Acknowledgements

    Thanks to Aahz, Andrew Archibald, Rich Salz, Itamar
    Shtull-Trauring, and the readers of the python-crypto list for
    their comments on this PEP.


Copyright

    This document has been placed in the public domain.



pep-0453 Explicit bootstrapping of pip in Python installations

PEP:453
Title:Explicit bootstrapping of pip in Python installations
Version:$Revision$
Last-Modified:$Date$
Author:Donald Stufft <donald at stufft.io>, Nick Coghlan <ncoghlan at gmail.com>
BDFL-Delegate:Martin von Lรถwis
Status:Final
Type:Standards Track
Content-Type:text/x-rst
Created:10-Aug-2013
Post-History:30-Aug-2013, 15-Sep-2013, 18-Sep-2013, 19-Sep-2013, 23-Sep-2013, 29-Sep-2013, 13-Oct-2013, 20-Oct-2013
Resolution:https://mail.python.org/pipermail/python-dev/2013-October/129810.html

Abstract

This PEP proposes that the Installing Python Modules guide in Python 2.7, 3.3 and 3.4 be updated to officially recommend the use of pip as the default installer for Python packages, and that appropriate technical changes be made in Python 3.4 to provide pip by default in support of that recommendation.

PEP Acceptance

This PEP was accepted for inclusion in Python 3.4 by Martin von Lรถwis on Tuesday 22nd October, 2013.

Issue 19347 has been created to track the implementation of this PEP.

Rationale

There are two related, but distinct rationales for the proposal in this PEP. The first relates to the experience of new users, while the second relates to better enabling the evolution of the broader Python packaging ecosystem.

Improving the new user experience

Currently, on systems without a platform package manager and repository, installing a third-party Python package into a freshly installed Python requires first identifying an appropriate package manager and then installing it.

Even on systems that do have a platform package manager, it is unlikely to include every package that is available on the Python Package Index, and even when a desired third-party package is available, the correct name in the platform package manager may not be clear.

This means that, to work effectively with the Python Package Index ecosystem, users must know which package manager to install, where to get it, and how to install it. The effect of this is that third-party Python projects are currently required to choose from a variety of undesirable alternatives:

  • Assume the user already has a suitable cross-platform package manager installed.
  • Duplicate the instructions and tell their users how to install the package manager.
  • Completely forgo the use of dependencies to ease installation concerns for their users.

All of these available options have significant drawbacks.

If a project simply assumes a user already has the tooling then beginning users may get a confusing error message when the installation command doesn't work. Some operating systems may ease this pain by providing a global hook that looks for commands that don't exist and suggest an OS package they can install to make the command work, but that only works on systems with platform package managers that include a package that provides the relevant cross-platform installer command (such as many major Linux distributions). No such assistance is available for Windows and Mac OS X users, or more conservative Linux distributions. The challenges of dealing with this problem for beginners (who are often also completely new to programming, the use of command line tools and editing system environment variables) are a regular feature of feedback the core Python developers receive from professional educators and others introducing new users to Python.

If a project chooses to duplicate the installation instructions and tell their users how to install the package manager before telling them how to install their own project then whenever these instructions need updates they need updating by every project that has duplicated them. This is particular problematic when there are multiple competing installation tools available, and different projects recommend different tools.

This specific problem can be partially alleviated by strongly promoting pip as the default installer and recommending that other projects reference pip's own bootstrapping instructions rather than duplicating them. However the user experience created by this approach still isn't particularly good (although there is an effort under way to create a combined Windows installer for pip and its dependencies that should improve matters on that platform, and Mac OS X and *nix platforms generally have wget and hence the ability to easily download and run the bootstrap scripts from the command line).

The projects that have decided to forgo dependencies altogether are forced to either duplicate the efforts of other projects by inventing their own solutions to problems or are required to simply include the other projects in their own source trees. Both of these options present their own problems either in duplicating maintenance work across the ecosystem or potentially leaving users vulnerable to security issues because the included code or duplicated efforts are not automatically updated when upstream releases a new version.

By officially recommending and providing by default a specific cross-platform package manager it will be easier for users trying to install these third-party packages as well as easier for the people distributing them as they should now be able to safely assume that most users will have the appropriate installation tools available (or access to clear instructions on how to obtain them). This is expected to become more important in the future as the Wheel [17] package format (deliberately) does not have a built in "installer" in the form of setup.py so users wishing to install from a wheel file will want an installer even in the simplest cases.

Reducing the burden of actually installing a third-party package should also decrease the pressure to add every useful module to the standard library. This will allow additions to the standard library to focus more on why Python should have a particular tool out of the box, and why it is reasonable for that package to adopt the standard library's 18-24 month feature release cycle, instead of using the general difficulty of installing third-party packages as justification for inclusion.

Providing a standard installation system also helps with bootstrapping alternate build and installer systems, such as zc.buildout, hashdist and conda. So long as pip install <tool> works, then a standard Python-specific installer provides a reasonably secure, cross platform mechanism to get access to these utilities.

Enabling the evolution of the broader Python packaging ecosystem

As no new packaging standard can achieve widespread adoption without a transition strategy that covers the versions of Python that are in widespread current use (rather than merely future versions, like most language features), the change proposed in this PEP is considered a necessary step in the evolution of the Python packaging ecosystem

The broader community has embraced the Python Package Index as a mechanism for distributing and installing Python software, but the different concerns of language evolution and secure software distribution mean that a faster feature release cycle that encompasses older versions is needed to properly support the latter.

In addition, the core CPython development team have the luxury of dropping support for earlier Python versions well before the rest of the community, as downstream commercial redistributors pick up the task of providing support for those versions to users that still need it, while many third party libraries maintain compatibility with those versions as long as they remain in widespread use.

This means that the current setup.py install based model for package installation poses serious difficulties for the development and adoption of new packaging standards, as, depending on how a project writes their setup.py file, the installation command (along with other operations) may end up invoking the standard library's distutils package.

As an indicator of how this may cause problems for the broader ecosystem, consider that the feature set of distutils in Python 2.6 was frozen in June 2008 (with the release of Python 2.6b1), while the feature set of distutils in Python 2.7 was frozen in April 2010 (with the release of Python 2.7b1).

By contrast, using a separate installer application like pip (which ensures that even setup.py files that invoke distutils directly still support the new packaging standards) makes it possible to support new packaging standards in older versions of Python, just by upgrading pip (which receives new feature releases roughly every 6 months). The situation on older versions of Python is further improved by making it easier for end users to install and upgrade newer build systems like setuptools or improved PyPI upload utilities like twine.

It is not coincidental that this proposed model of using a separate installer program with more metadata heavy and less active distribution formats matches that used by most operating systems (including Windows since the introduction of the installer service and the MSI file format), as well as many other language specific installers.

For Python 2.6, this compatibility issue is largely limited to various enterprise Linux distributions (and their downstream derivatives). These distributions often have even slower update cycles than CPython, so they offer full support for versions of Python that are considered "security fix only" versions upstream (and sometimes may even be to the point where the core development team no longer support them at all - you can still get commercial support for Python 2.3 if you really need it!).

In practice, the fact that tools like wget and curl are readily available on Linux systems, that most users of Python on Linux are already familiar with the command line, and that most Linux distributions ship with a default configuration that makes running Python scripts easy, means that the existing pip bootstrapping instructions for any *nix system are already quite straightforward. Even if pip isn't provided by the system package manager, then using wget or curl to retrieve the bootstrap script from www.pip-installer.org and then running it is just a couple of shell commands that can easily be copied and pasted as necessary.

Accordingly, for any version of Python on any *nix system, the need to bootstrap pip in older versions isn't considered a major barrier to adoption of new packaging standards, since it's just one more small speedbump encountered by users of these long term stable releases. For *nix systems, this PEP's formal endorsement of pip as the preferred default packaging tool is seen as more important than the underlying technical details involved in making pip available by default, since it shifts the nature of the conversation between the developers of pip and downstream repackagers of both pip and CPython.

For Python 2.7, on the other hand, the compatibility issue for adopting new metadata standards is far more widespread, as it affects the python.org binary installers for Windows and Mac OS X, as well as even relatively fast moving *nix platforms.

Firstly, and unlike Python 2.6, Python 2.7 is still a fully supported upstream version, and will remain so until the release of Python 2.7.9 (currently scheduled for May 2015), at which time it is expected to enter the usual "security fix only" mode. That means there are at least another 19 months where Python 2.7 is a deployment target for Python applications that enjoys full upstream support. Even after the core development team switches 2.7 to security release only mode in 2015, Python 2.7 will likely remain a commercially supported legacy target out beyond 2020.

While Python 3 already presents a compelling alternative over Python 2 for new Python applications and deployments without an existing investment in Python 2 and without a dependency on specific Python 2 only third party modules (a set which is getting ever smaller over time), it is going to take longer to create compelling business cases to update existing Python 2.7 based infrastructure to Python 3, especially in situations where the culture of automated testing is weak (or nonexistent), making it difficult to effectively use the available migration utilities.

While this PEP only proposes documentation changes for Python 2.7, once pip has a Windows installer available, a separate PEP will be created and submitted proposing the creation and distribution of aggregate installers for future CPython 2.7 maintenance releases that combine the CPython, pip and Python Launcher for Windows installers into a single download (the separate downloads would still remain available - the aggregate installers would be provided as a convenience, and as a clear indication of the recommended operating environment for Python in Windows systems).

Why pip?

pip has been chosen as the preferred default installer, as it is an already popular tool that addresses several design and user experience issues with its predecessor easy_install (these issues can't readily be fixed in easy_install itself due to backwards compatibility concerns). pip is also well suited to working within the bounds of a single Python runtime installation (including associated virtual environments), which is a desirable feature for a tool bundled with CPython.

Other tools like zc.buildout and conda are more ambitious in their aims (and hence substantially better than pip at handling external binary dependencies), so it makes sense for the Python ecosystem to treat them more like platform package managers to interoperate with rather than as the default cross-platform installation tool. This relationship is similar to that between pip and platform package management systems like apt and yum (which are also designed to handle arbitrary binary dependencies).

Proposal Overview

This PEP proposes that the Installing Python Modules guide be updated to officially recommend the use of pip as the default installer for Python packages, rather than the current approach of recommending the direct invocation of the setup.py install command.

However, to avoid recommending a tool that CPython does not provide, it is further proposed that the pip [18] package manager be made available by default when installing CPython 3.4 or later and when creating virtual environments using the standard library's venv module via the pyvenv command line utility.

To support that end, this PEP proposes the inclusion of an ensurepip bootstrapping module in Python 3.4, as well as automatic invocation of that module from pyvenv and changes to the way Python installed scripts are handled on Windows. Using a bootstrap module rather than providing pip directly helps to clearly demarcate development responsibilities, and to avoid inadvertently downgrading pip when updating CPython.

To provide clear guidance for new users of Python that may not be starting with the latest release, this PEP also proposes that the "Installing Python Modules" guides in Python 2.7 and 3.3 be updated to recommend installing and using pip, rather than invoking distutils directly. It does not propose backporting any of the code changes that are being proposed for Python 3.4.

Finally, the PEP also strongly recommends that CPython redistributors and other Python implementations ensure that pip is available by default, or at the very least, explicitly document the fact that it is not included.

This PEP does not propose making pip (or any dependencies) directly available as part of the standard library. Instead, pip will be a bundled application provided along with CPython for the convenience of Python users, but subject to its own development life cycle and able to be upgraded independently of the core interpreter and standard library.

Explicit bootstrapping mechanism

An additional module called ensurepip will be added to the standard library whose purpose is to install pip and any of its dependencies into the appropriate location (most commonly site-packages). It will expose a callable named bootstrap() as well as offer direct execution via python -m ensurepip.

The bootstrap will not contact PyPI, but instead rely on a private copy of pip stored inside the standard library. Accordingly, only options related to the installation location will be supported (--user, --root, etc).

It is considered desirable that users be strongly encouraged to use the latest available version of pip, in order to take advantage of the ongoing efforts to improve the security of the PyPI based ecosystem, as well as benefiting from the efforts to improve the speed, reliability and flexibility of that ecosystem.

In order to satisfy this goal of providing the most recent version of pip by default, the private copy of pip will be updated in CPython maintenance releases, which should align well with the 6-month cycle used for new pip releases.

Security considerations

The design in this PEP has been deliberately chosen to avoid making any significant changes to the trust model of CPython for end users that do not subsequently run the command pip install --upgrade pip.

The installers will contain all the components of a fully functioning version of Python, including the pip installer. The installation process will not require network access, and will not rely on trusting the security of the network connection established between pip and the Python package index.

Only users that choose to use pip to communicate with PyPI will need to pay attention to the additional security considerations that come with doing so.

However, the core CPython team will still assist with reviewing and resolving at least the certificate update management issue currently affecting the requests project (and hence pip), and may also be able to offer assistance in resolving other identified security concerns [6].

Reliability considerations

By including the bootstrap as part of the standard library (rather than solely as a feature of the binary installers), the correct operation of the bootstrap command can be easily tested using the existing CPython buildbot infrastructure rather than adding significantly to the testing burden for the installers themselves.

Implementation strategy

To ensure there is no need for network access when installing Python or creating virtual environments, the ensurepip module will, as an implementation detail, include a complete private copy of pip and its dependencies which will be used to extract pip and install it into the target environment. It is important to stress that this private copy of pip is only an implementation detail and it should not be relied on or assumed to exist beyond the public capabilities exposed through the ensurepip module (and indirectly through venv).

There is not yet a reference ensurepip implementation. The existing get-pip.py bootstrap script demonstrates an earlier variation of the general concept, but the standard library version would take advantage of the improved distribution capabilities offered by the CPython installers to include private copies of pip and setuptools as wheel files (rather than as embedded base64 encoded data), and would not try to contact PyPI (instead installing directly from the private wheel files).

Rather than including separate code to handle the bootstrapping, the ensurepip module will manipulate sys.path appropriately to allow the wheel files to be used to install themselves, either into the current Python installation or into a virtual environment (as determined by the options passed to the bootstrap command).

It is proposed that the implementation be carried out in five separate steps (all steps after the first two are independent of each other and can be carried out in any order):

  • the first step would update the "Installing Python Modules" documentation to recommend the use of pip and reference the pip team's instructions for downloading and installing it. This change would be applied to Python 2.7, 3.3, and 3.4.
  • the ensurepip module and the private copies of the most recently released versions of pip and setuptools would be added to Python 3.4 and the 3.4 "Installing Python Modules" documentation updated accordingly.
  • the CPython Windows installer would be updated to offer the new pip installation option for Python 3.4.
  • the CPython Mac OS X installer would be updated to offer the new pip installation option for Python 3.4.
  • the venv module and pyvenv command would be updated to make use of ensurepip in Python 3.4
  • the PATH handling on Windows would be updated for Python 3.4+

Integration timeline

If this PEP is accepted, the proposed time frame for integration of pip into the CPython release is as follows:

  • as soon as possible after the release of 3.4.0 alpha 4
    • Documentation updated and ensurepip implemented based on a pre-release version of pip 1.5.
    • All other proposed functional changes for Python 3.4 implemented, including the installer updates to invoke ensurepip.
  • by November 20th (3 days prior to the scheduled date of 3.4.0 beta 1)
    • ensurepip updated to use a pip 1.5 release candidate.
    • PEP 101 updated to cover ensuring the bundled version of pip is up to date.
  • by November 24th (scheduled date of 3.4.0 beta 1)
    • As with any other new feature, all proposed functional changes for Python 3.4 must be implemented prior to the beta feature freeze.
  • by December 29th (1 week prior to the scheduled date of 3.4.0 beta 2)
    • requests certificate management issue resolved
    • ensurepip updated to the final release of pip 1.5, or a subsequent maintenance release (including a suitably updated vendored copy of requests)

(See PEP 429 for the current official scheduled dates of each release. Dates listed above are accurate as of October 20th, 2013.)

If there is no final or maintenance release of pip 1.5 with a suitable updated version of requests available by one week before the scheduled Python 3.4 beta 2 release, then implementation of this PEP will be deferred to Python 3.5. Note that this scenario is considered unlikely - the tentative date for the pip 1.5 release is currently December 1st.

In future CPython releases, this kind of coordinated scheduling shouldn't be needed: the CPython release manager will be able to just update to the latest released version of pip. However, in this case, some fixes are needed in pip in order to allow the bundling to work correctly, and the certificate update mechanism for requests needs to be improved, so the pip 1.5 release cycle needs to be properly aligned with the CPython 3.4 beta releases.

Proposed CLI

The proposed CLI is based on a subset of the existing pip install options:

Usage:
  python -m ensurepip [options]

General Options:
  -h, --help          Show help.
  -v, --verbose       Give more output. Option is additive, and can be used up to 3 times.
  -V, --version       Show the pip version that would be extracted and exit.
  -q, --quiet         Give less output.

Installation Options:
  -U, --upgrade       Upgrade pip and dependencies, even if already installed
  --user              Install using the user scheme.
  --root <dir>        Install everything relative to this alternate root directory.

In most cases, end users won't need to use this CLI directly, as pip should have been installed automatically when installing Python or when creating a virtual environment. However, it is formally documented as a public interface to support at least these known use cases:

  • Windows and Mac OS X installations where the "Install pip" option was not chosen during installation
  • any installation where the user previously ran "pip uninstall pip"

Users that want to retrieve the latest version from PyPI, or otherwise need more flexibility, can then invoke the extracted pip appropriately.

Proposed module API

The proposed ensurepip module API consists of the following two functions:

def version():
    """
    Returns a string specifying the bundled version of pip.
    """

def bootstrap(root=None, upgrade=False, user=False, verbosity=0):
    """
    Bootstrap pip into the current Python installation (or the given root
    directory).
    """

Invocation from the CPython installers

The CPython Windows and Mac OS X installers will each gain a new option:

  • Install pip (the default Python package management utility)?

This option will be checked by default.

If the option is checked, then the installer will invoke the following command with the just installed Python:

python -m ensurepip --upgrade

This ensures that, by default, installing or updating CPython will ensure that the installed version of pip is at least as recent as the one included with that version of CPython. If a newer version of pip has already been installed then python -m ensurepip --upgrade will simply return without doing anything.

Installing from source

Just as the prebuilt binary installers will be updated to run python -m ensurepip by default, a similar change will be made to the make install and make altinstall commands of the source distribution. The directory settings in the sysconfig module should ensure the pip components are automatically installed to the expected locations.

ensurepip itself (including the private copy of pip and its dependencies) will always be installed normally (as it is a regular part of the standard library), but an option will be provided to skip the invocation of ensurepip.

This means that even installing from source will provide pip by default, but redistributors provide pip by other means (or not providing it at all) will still be able to opt out of installing it using ensurepip.

Changes to virtual environments

Python 3.3 included a standard library approach to virtual Python environments through the venv module. Since its release it has become clear that very few users have been willing to use this feature directly, in part due to the lack of an installer present by default inside of the virtual environment. They have instead opted to continue using the virtualenv package which does include pip installed by default.

To make the venv more useful to users it will be modified to issue the pip bootstrap by default inside of the new environment while creating it. This will allow people the same convenience inside of the virtual environment as this PEP provides outside of it as well as bringing the venv module closer to feature parity with the external virtualenv package, making it a more suitable replacement.

To handle cases where a user does not wish to have pip bootstrapped into their virtual environment a --without-pip option will be added.

The venv.EnvBuilder and venv.create APIs will be updated to accept one new parameter: with_pip (defaulting to False).

The new default for the module API is chosen for backwards compatibility with the current behaviour (as it is assumed that most invocation of the venv module happens through third part tools that likely will not want pip installed without explicitly requesting it), while the default for the command line interface is chosen to try to ensure pip is available in most virtual environments without additional action on the part of the end user.

As this change will only benefit Python 3.4 and later versions, the third-party virtualenv project will still be needed to obtain a consistent cross-version experience in Python 3.3 and 2.7.

Documentation

The "Installing Python Modules" section of the standard library documentation in Python 2.7, 3.3 and 3.4 will be updated to recommend the use of the pip installer, either provided by default in Python 3.4 or retrieved and installed by the user in Python 2.7 or 3.3. It will give a brief description of the most common commands and options, but delegate to the externally maintained pip documentation for the full details.

In Python 3.4, the pyvenv and venv documentation will also be updated to reference the revised module installation guide.

The existing content of the module installation guide will be retained in all versions, but under a new "Invoking distutils directly" subsection.

Bundling CA certificates with CPython

The ensurepip implementation will include the pip CA bundle along with the rest of pip. This means CPython effectively includes a CA bundle that is used solely by pip after it has been extracted.

This is considered preferable to relying solely on the system certificate stores, as it ensures that pip will behave the same across all supported versions of Python, even those prior to Python 3.4 that cannot access the system certificate store on Windows.

Automatic installation of setuptools

pip currently depends on setuptools to handle metadata generation during the build process, along with some other features. While work is ongoing to reduce or eliminate this dependency, it is not clear if that work will be complete for pip 1.5 (which is the version likely to be current when Python 3.4.0 is released).

This PEP proposes that, if pip still requires it as a dependency, ensurepip will include a private copy of setuptools (in addition to the private copy of ensurepip). python -m ensurepip will then install the private copy in addition to installing pip itself.

However, this behavior is officially considered an implementation detail. Other projects which explicitly require setuptools must still provide an appropriate dependency declaration, rather than assuming setuptools will always be installed alongside pip.

Once pip is able to run pip install --upgrade pip without needing setuptools installed first, then the private copy of setuptools will be removed from ensurepip in subsequent CPython releases.

As long as setuptools is needed, it will be a completely unmodified copy of the latest upstream setuptools release, including the easy_install script if the upstream setuptools continues to include it. The installation of easy_install along with pip isn't considered desirable, but installing a broken setuptools would be worse. This problem will naturally resolve itself once the pip developers have managed to eliminate their dependency on setuptools and the private copy of setuptools can be removed entirely from CPython.

Updating the private copy of pip

In order to keep up with evolutions in packaging as well as providing users with as recent version a possible the ensurepip module will be regularly updated to the latest versions of everything it bootstraps.

After each new pip release, and again during the preparation for any release of Python (including feature releases), a script, provided as part of the implementation for this PEP, will be run to ensure the private copies stored in the CPython source repository have been updated to the latest versions.

Updating the ensurepip module API and CLI

Like venv and pyvenv, the ensurepip module API and CLI will be governed by the normal rules for the standard library: no new features are permitted in maintenance releases.

However, the embedded components may be updated as noted above, so the extracted pip may offer additional functionality in maintenance releases.

Uninstallation

No changes are proposed to the CPython uninstallation process by this PEP. The bootstrapped pip will be installed the same way as any other pip installed packages, and will be handled in the same way as any other post-install additions to the Python environment.

At least on Windows, that means the bootstrapped files will be left behind after uninstallation, since those files won't be associated with the Python MSI installer.

While the case can be made for the CPython installers clearing out these directories automatically, changing that behaviour is considered outside the scope of this PEP.

Script Execution on Windows

While the Windows installer was updated in Python 3.3 to optionally make python available on the PATH, no such change was made to include the script installation directory returned by sysconfig.get_path("scripts").

Accordingly, in addition to adding the option to extract and install pip during installation, this PEP proposes that the Windows installer in Python 3.4 and later be updated to also add the path returned by sysconfig.get_path("scripts") to the Windows PATH when the PATH modification option is enabled during installation

Note that this change will only be available in Python 3.4 and later.

This means that, for Python 3.3, the most reliable way to invoke pip globally on Windows (without tinkering manually with PATH) will still remain py -m pip (or py -3 -m pip to select the Python 3 version if both Python 2 and 3 are installed) rather than simply calling pip. This works because Python 3.3 provides the Python Launcher for Windows (and the associated py command) by default.

For Python 2.7 and 3.2, the most reliable mechanism will be to install the Python Launcher for Windows using the standalone installer and then use py -m pip as noted above.

Adding the scripts directory to the system PATH will mean that pip works reliably in the "only one Python installation on the system PATH" case, with py -m pip, pipX, or pipX.Y needed only to select a non-default version in the parallel installation case (and outside a virtual environment). This change should also make the pyvenv command substantially easier to invoke on Windows, along with all scripts installed by pip, easy_install and similar tools.

While the script invocations on recent versions of Python will run through the Python launcher for Windows, this shouldn't cause any issues, as long as the Python files in the Scripts directory correctly specify a Python version in their shebang line or have an adjacent Windows executable (as easy_install and pip do).

Recommendations for Downstream Distributors

A common source of Python installations are through downstream distributors such as the various Linux Distributions [8] [9] [10], OSX package managers [11] [12] [13], and commercial Python redistributors [14] [15] [16]. In order to provide a consistent, user-friendly experience to all users of Python regardless of how they obtained Python this PEP recommends and asks that downstream distributors:

  • Ensure that whenever Python is installed pip is either installed or is otherwise made readily available to end users.
    • For redistributors using binary installers, this may take the form of optionally executing the ensurepip bootstrap during installation, similar to the CPython installers.
    • For redistributors using package management systems, it may take the form of separate packages with dependencies on each other so that installing the Python package installs the pip package and installing the pip package installs the Python package.
    • Another reasonable way to implement this is to package pip separately but ensure that there is some sort of global hook that will recommend installing the separate pip package when a user executes pip without it being installed. Systems that choose this option should ensure that the ensurepip module still installs pip directly when invoked inside a virtual environment, but may modify the module in the system Python installation to redirect to the platform provided mechanism when installing pip globally.
  • Even if pip is made available globally by other means, do not remove the ensurepip module in Python 3.4 or later.
    • ensurepip will be required for automatic installation of pip into virtual environments by the venv module.
    • This is similar to the existing virtualenv package for which many downstream distributors have already made exception to the common "debundling" policy.
    • This does mean that if pip needs to be updated due to a security issue, so does the private copy in the ensurepip bootstrap module
    • However, altering the private copy of pip to remove the embedded CA certificate bundle and rely on the system CA bundle instead is a reasonable change.
  • Ensure that all features of this PEP continue to work with any modifications made to the redistributed version of Python.
    • Checking the version of pip that will be bootstrapped using python -m ensurepip --version or ensurepip.version().
    • Installation of pip into a global or virtual python environment using python -m ensurepip or ensurepip.bootstrap().
    • pip install --upgrade pip in a global installation should not affect any already created virtual environments (but is permitted to affect future virtual environments, even though it will not do so when using the standard implementation of ensurepip).
    • pip install --upgrade pip in a virtual environment should not affect the global installation.
  • Migrate build systems to utilize pip [18] and Wheel [17] wherever feasible and avoid directly invoking setup.py.
    • This will help ensure a smoother and more timely migration to improved metadata formats as the Python packaging ecosystem continues to evolve.

In the event that a Python redistributor chooses not to follow these recommendations, we request that they explicitly document this fact and provide their users with suitable guidance on translating upstream pip based installation instructions into something appropriate for the platform.

Other Python implementations are also encouraged to follow these guidelines where applicable.

Policies & Governance

The maintainers of the bootstrapped software and the CPython core team will work together in order to address the needs of both. The bootstrapped software will still remain external to CPython and this PEP does not include CPython subsuming the development responsibilities or design decisions of the bootstrapped software. This PEP aims to decrease the burden on end users wanting to use third-party packages and the decisions inside it are pragmatic ones that represent the trust that the Python community has already placed in the Python Packaging Authority as the authors and maintainers of pip, setuptools, PyPI, virtualenv and other related projects.

Backwards Compatibility

The public API and CLI of the ensurepip module itself will fall under the typical backwards compatibility policy of Python for its standard library. The externally developed software that this PEP bundles does not.

Most importantly, this means that the bootstrapped version of pip may gain new features in CPython maintenance releases, and pip continues to operate on its own 6 month release cycle rather than CPython's 18-24 month cycle.

Security Releases

Any security update that affects the ensurepip module will be shared prior to release with the Python Security Response Team (security@python.org). The PSRT will then decide if the reported issue warrants a security release of CPython with an updated private copy of pip.

Licensing

pip is currently licensed as 1 Clause BSD, and it contains code taken from other projects. Additionally this PEP will include setuptools until such time as pip no longer requires it. The licenses for these appear in the table below.

Project License
requests Apache 2.0
six 1 Clause BSD
html5lib 1 Clause BSD
distlib PSF
colorama 3 Clause BSD
Mozilla CA Bundle LGPL
setuptools PSF

All of these licenses should be compatible with the PSF license. Additionally it is unclear if a CA Bundle is copyrightable material and thus if it needs or can be licensed at all.

Appendix: Rejected Proposals

Changing the name of the scripts directory on Windows

Earlier versions of this PEP proposed changing the name of the script installation directory on Windows from "Scripts" to "bin" in order to improve the cross-platform consistency of the virtual environments created by pyvenv.

However, Paul Moore determined that this change was likely backwards incompatible with cross-version Windows installers created with previous versions of Python, so the change has been removed from this PEP [7].

Including ensurepip in Python 2.7, and 3.3

Earlier versions of this PEP made the case that the challenges of getting pip bootstrapped for new users posed a significant enough barrier to Python's future growth that it justified adding ensurepip as a new feature in the upcoming Python 2.7 and 3.3 maintenance releases.

While the proposal to provide pip with Python 3.4 was universally popular, this part of the proposal was highly controversial and ultimately rejected by MvL as BDFL-Delegate.

Accordingly, the proposal to backport ensurepip to Python 2.7 and 3.3 has been removed from this PEP in favour of creating a Windows installer for pip and a possible future PEP suggesting creation of an aggregate installer for Python 2.7 that combines CPython 2.7, pip and the Python Launcher for Windows.

Automatically contacting PyPI when bootstrapping pip

Earlier versions of this PEP called the bootstrapping module getpip and defaulted to downloading and installing pip from PyPI, with the private copy used only as a fallback option or when explicitly requested.

This resulted in several complex edge cases, along with difficulties in defining a clean API and CLI for the bootstrap module. It also significantly altered the default trust model for the binary installers published on python.org, as end users would need to explicitly opt-out of trusting the security of the PyPI ecosystem (rather than opting in to it by explicitly invoking pip following installation).

As a result, the PEP was simplified to the current design, where the bootstrapping always uses the private copy of pip. Contacting PyPI is now always an explicit separate step, with direct access to the full pip interface.

Removing the implicit attempt to access PyPI also made it feasible to invoke ensurepip by default when installing from a custom source build.

Implicit bootstrap

PEP439 [19], the predecessor for this PEP, proposes its own solution. Its solution involves shipping a fake pip command that when executed would implicitly bootstrap and install pip if it does not already exist. This has been rejected because it is too "magical". It hides from the end user when exactly the pip command will be installed or that it is being installed at all. It also does not provide any recommendations or considerations towards downstream packagers who wish to manage the globally installed pip through the mechanisms typical for their system.

The implicit bootstrap mechanism also ran into possible permissions issues, if a user inadvertently attempted to bootstrap pip without write access to the appropriate installation directories.

Including pip directly in the standard library

Similar to this PEP is the proposal of just including pip in the standard library. This would ensure that Python always includes pip and fixes all of the end user facing problems with not having pip present by default. This has been rejected because we've learned, through the inclusion and history of distutils in the standard library, that losing the ability to update the packaging tools independently can leave the tooling in a state of constant limbo. Making it unable to ever reasonably evolve in a time frame that actually affects users as any new features will not be available to the general population for years.

Allowing the packaging tools to progress separately from the Python release and adoption schedules allows the improvements to be used by all members of the Python community and not just those able to live on the bleeding edge of Python releases.

There have also been issues in the past with the "dual maintenance" problem if a project continues to be maintained externally while also having a fork maintained in the standard library. Since external maintenance of pip will always be needed to support earlier Python versions, the proposed bootstrapping mechanism will becoming the explicit responsibility of the CPython core developers (assisted by the pip developers), while pip issues reported to the CPython tracker will be migrated to the pip issue tracker. There will no doubt still be some user confusion over which tracker to use, but hopefully less than has been seen historically when including complete public copies of third-party projects in the standard library.

The approach described in this PEP also avoids some technical issues related to handling CPython maintenance updates when pip has been independently updated to a more recent version. The proposed pip-based bootstrapping mechanism handles that automatically, since pip and the system installer never get into a fight about who owns the pip installation (it is always managed through pip, either directly, or indirectly via the ensurepip bootstrap module).

Finally, the separate bootstrapping step means it is also easy to avoid installing pip at all if end users so desire. This is often the case if integrators are using system packages to handle installation of components written in multiple languages using a common set of tools.

Defaulting to --user installation

Some consideration was given to bootstrapping pip into the per-user site-packages directory by default. However, this behavior would be surprising (as it differs from the default behavior of pip itself) and is also not currently considered reliable (there are some edge cases which are not handled correctly when pip is installed into the user site-packages directory rather than the system site-packages).

References

[1]Discussion thread 1 (distutils-sig) (https://mail.python.org/pipermail/distutils-sig/2013-August/022529.html)
[2]Discussion thread 2 (distutils-sig) (https://mail.python.org/pipermail/distutils-sig/2013-September/022702.html)
[3]Discussion thread 3 (python-dev) (https://mail.python.org/pipermail/python-dev/2013-September/128723.html)
[4]Discussion thread 4 (python-dev) (https://mail.python.org/pipermail/python-dev/2013-September/128780.html)
[5]Discussion thread 5 (python-dev) (https://mail.python.org/pipermail/python-dev/2013-September/128894.html)
[6]pip/requests certificate management concerns (https://mail.python.org/pipermail/python-dev/2013-October/129755.html)
[7]Windows installer compatibility concerns (https://mail.python.org/pipermail/distutils-sig/2013-October/022855.html)
[8]Ubuntu <http://www.ubuntu.com/>
[9]Debian <http://www.debian.org>
[10]Fedora <https://fedoraproject.org/>
[11]Homebrew <http://brew.sh/>
[12]MacPorts <http://macports.org>
[13]Fink <http://finkproject.org>
[14]Anaconda <https://store.continuum.io/cshop/anaconda/>
[15]ActivePython <http://www.activestate.com/activepython>
[16]Enthought Canopy <https://www.enthought.com/products/canopy/>
[17](1, 2) http://www.python.org/dev/peps/pep-0427/
[18](1, 2) http://www.pip-installer.org
[19]http://www.python.org/dev/peps/pep-0439/

pep-0454 Add a new tracemalloc module to trace Python memory allocations

PEP:454
Title:Add a new tracemalloc module to trace Python memory allocations
Version:$Revision$
Last-Modified:$Date$
Author:Victor Stinner <victor.stinner at gmail.com>
BDFL-Delegate:Charles-Franรงois Natali <cf.natali@gmail.com>
Status:Final
Type:Standards Track
Content-Type:text/x-rst
Created:3-September-2013
Python-Version:3.4
Resolution:https://mail.python.org/pipermail/python-dev/2013-November/130491.html

Abstract

This PEP proposes to add a new tracemalloc module to trace memory blocks allocated by Python.

Rationale

Classic generic tools like Valgrind can get the C traceback where a memory block was allocated. Using such tools to analyze Python memory allocations does not help because most memory blocks are allocated in the same C function, in PyMem_Malloc() for example. Moreover, Python has an allocator for small objects called "pymalloc" which keeps free blocks for efficiency. This is not well handled by these tools.

There are debug tools dedicated to the Python language like Heapy Pympler and Meliae which lists all alive objects using the garbage collector module (functions like gc.get_objects(), gc.get_referrers() and gc.get_referents()), compute their size (ex: using sys.getsizeof()) and group objects by type. These tools provide a better estimation of the memory usage of an application. They are useful when most memory leaks are instances of the same type and this type is only instantiated in a few functions. Problems arise when the object type is very common like str or tuple, and it is hard to identify where these objects are instantiated.

Finding reference cycles is also a difficult problem. There are different tools to draw a diagram of all references. These tools cannot be used on large applications with thousands of objects because the diagram is too huge to be analyzed manually.

Proposal

Using the customized allocation API from PEP 445, it becomes easy to set up a hook on Python memory allocators. A hook can inspect Python internals to retrieve Python tracebacks. The idea of getting the current traceback comes from the faulthandler module. The faulthandler dumps the traceback of all Python threads on a crash, here is the idea is to get the traceback of the current Python thread when a memory block is allocated by Python.

This PEP proposes to add a new tracemalloc module, a debug tool to trace memory blocks allocated by Python. The module provides the following information:

  • Traceback where an object was allocated
  • Statistics on allocated memory blocks per filename and per line number: total size, number and average size of allocated memory blocks
  • Computed differences between two snapshots to detect memory leaks

The API of the tracemalloc module is similar to the API of the faulthandler module: enable() / start(), disable() / stop() and is_enabled() / is_tracing() functions, an environment variable (PYTHONFAULTHANDLER and PYTHONTRACEMALLOC), and a -X command line option (-X faulthandler and -X tracemalloc). See the documentation of the faulthandler module.

The idea of tracing memory allocations is not new. It was first implemented in the PySizer project in 2005. PySizer was implemented differently: the traceback was stored in frame objects and some Python types were linked the trace with the name of object type. PySizer patch on CPython adds a overhead on performances and memory footprint, even if the PySizer was not used. tracemalloc attachs a traceback to the underlying layer, to memory blocks, and has no overhead when the module is not tracing memory allocations.

The tracemalloc module has been written for CPython. Other implementations of Python may not be able to provide it.

API

To trace most memory blocks allocated by Python, the module should be started as early as possible by setting the PYTHONTRACEMALLOC environment variable to 1, or by using -X tracemalloc command line option. The tracemalloc.start() function can be called at runtime to start tracing Python memory allocations.

By default, a trace of an allocated memory block only stores the most recent frame (1 frame). To store 25 frames at startup: set the PYTHONTRACEMALLOC environment variable to 25, or use the -X tracemalloc=25 command line option. The set_traceback_limit() function can be used at runtime to set the limit.

Functions

clear_traces() function:

Clear traces of memory blocks allocated by Python.

See also stop().

get_object_traceback(obj) function:

Get the traceback where the Python object obj was allocated. Return a Traceback instance, or None if the tracemalloc module is not tracing memory allocations or did not trace the allocation of the object.

See also gc.get_referrers() and sys.getsizeof() functions.

get_traceback_limit() function:

Get the maximum number of frames stored in the traceback of a trace.

The tracemalloc module must be tracing memory allocations to get the limit, otherwise an exception is raised.

The limit is set by the start() function.

get_traced_memory() function:

Get the current size and maximum size of memory blocks traced by the tracemalloc module as a tuple: (size: int, max_size: int).

get_tracemalloc_memory() function:

Get the memory usage in bytes of the tracemalloc module used to store traces of memory blocks. Return an int.

is_tracing() function:

True if the tracemalloc module is tracing Python memory allocations, False otherwise.

See also start() and stop() functions.

start(nframe: int=1) function:

Start tracing Python memory allocations: install hooks on Python memory allocators. Collected tracebacks of traces will be limited to nframe frames. By default, a trace of a memory block only stores the most recent frame: the limit is 1. nframe must be greater or equal to 1.

Storing more than 1 frame is only useful to compute statistics grouped by 'traceback' or to compute cumulative statistics: see the Snapshot.compare_to() and Snapshot.statistics() methods.

Storing more frames increases the memory and CPU overhead of the tracemalloc module. Use the get_tracemalloc_memory() function to measure how much memory is used by the tracemalloc module.

The PYTHONTRACEMALLOC environment variable (PYTHONTRACEMALLOC=NFRAME) and the -X tracemalloc=NFRAME command line option can be used to start tracing at startup.

See also stop(), is_tracing() and get_traceback_limit() functions.

stop() function:

Stop tracing Python memory allocations: uninstall hooks on Python memory allocators. Clear also traces of memory blocks allocated by Python

Call take_snapshot() function to take a snapshot of traces before clearing them.

See also start() and is_tracing() functions.

take_snapshot() function:

Take a snapshot of traces of memory blocks allocated by Python. Return a new Snapshot instance.

The snapshot does not include memory blocks allocated before the tracemalloc module started to trace memory allocations.

Tracebacks of traces are limited to get_traceback_limit() frames. Use the nframe parameter of the start() function to store more frames.

The tracemalloc module must be tracing memory allocations to take a snapshot, see the the start() function.

See also the get_object_traceback() function.

Filter

Filter(inclusive: bool, filename_pattern: str, lineno: int=None, all_frames: bool=False) class:

Filter on traces of memory blocks.

See the fnmatch.fnmatch() function for the syntax of filename_pattern. The '.pyc' and '.pyo' file extensions are replaced with '.py'.

Examples:

  • Filter(True, subprocess.__file__) only includes traces of the subprocess module
  • Filter(False, tracemalloc.__file__) excludes traces of the tracemalloc module
  • Filter(False, "<unknown>") excludes empty tracebacks

inclusive attribute:

If inclusive is True (include), only trace memory blocks allocated in a file with a name matching filename_pattern at line number lineno.

If inclusive is False (exclude), ignore memory blocks allocated in a file with a name matching filename_pattern at line number lineno.

lineno attribute:

Line number (int) of the filter. If lineno is None, the filter matches any line number.

filename_pattern attribute:

Filename pattern of the filter (str).

all_frames attribute:

If all_frames is True, all frames of the traceback are checked. If all_frames is False, only the most recent frame is checked.

This attribute is ignored if the traceback limit is less than 2. See the get_traceback_limit() function and Snapshot.traceback_limit attribute.

Frame

Frame class:

Frame of a traceback.

The Traceback class is a sequence of Frame instances.

filename attribute:

Filename (str).

lineno attribute:

Line number (int).

Snapshot

Snapshot class:

Snapshot of traces of memory blocks allocated by Python.

The take_snapshot() function creates a snapshot instance.

compare_to(old_snapshot: Snapshot, group_by: str, cumulative: bool=False) method:

Compute the differences with an old snapshot. Get statistics as a sorted list of StatisticDiff instances grouped by group_by.

See the statistics() method for group_by and cumulative parameters.

The result is sorted from the biggest to the smallest by: absolute value of StatisticDiff.size_diff, StatisticDiff.size, absolute value of StatisticDiff.count_diff, Statistic.count and then by StatisticDiff.traceback.

dump(filename) method:

Write the snapshot into a file.

Use load() to reload the snapshot.

filter_traces(filters) method:

Create a new Snapshot instance with a filtered traces sequence, filters is a list of Filter instances. If filters is an empty list, return a new Snapshot instance with a copy of the traces.

All inclusive filters are applied at once, a trace is ignored if no inclusive filters match it. A trace is ignored if at least one exclusive filter matchs it.

load(filename) classmethod:

Load a snapshot from a file.

See also dump().

statistics(group_by: str, cumulative: bool=False) method:

Get statistics as a sorted list of Statistic instances grouped by group_by:

group_by description
'filename' filename
'lineno' filename and line number
'traceback' traceback

If cumulative is True, cumulate size and count of memory blocks of all frames of the traceback of a trace, not only the most recent frame. The cumulative mode can only be used with group_by equals to 'filename' and 'lineno' and traceback_limit greater than 1.

The result is sorted from the biggest to the smallest by: Statistic.size, Statistic.count and then by Statistic.traceback.

traceback_limit attribute:

Maximum number of frames stored in the traceback of traces: result of the get_traceback_limit() when the snapshot was taken.

traces attribute:

Traces of all memory blocks allocated by Python: sequence of Trace instances.

The sequence has an undefined order. Use the Snapshot.statistics() method to get a sorted list of statistics.

Statistic

Statistic class:

Statistic on memory allocations.

Snapshot.statistics() returns a list of Statistic instances.

See also the StatisticDiff class.

count attribute:

Number of memory blocks (int).

size attribute:

Total size of memory blocks in bytes (int).

traceback attribute:

Traceback where the memory block was allocated, Traceback instance.

StatisticDiff

StatisticDiff class:

Statistic difference on memory allocations between an old and a new Snapshot instance.

Snapshot.compare_to() returns a list of StatisticDiff instances. See also the Statistic class.

count attribute:

Number of memory blocks in the new snapshot (int): 0 if the memory blocks have been released in the new snapshot.

count_diff attribute:

Difference of number of memory blocks between the old and the new snapshots (int): 0 if the memory blocks have been allocated in the new snapshot.

size attribute:

Total size of memory blocks in bytes in the new snapshot (int): 0 if the memory blocks have been released in the new snapshot.

size_diff attribute:

Difference of total size of memory blocks in bytes between the old and the new snapshots (int): 0 if the memory blocks have been allocated in the new snapshot.

traceback attribute:

Traceback where the memory blocks were allocated, Traceback instance.

Trace

Trace class:

Trace of a memory block.

The Snapshot.traces attribute is a sequence of Trace instances.

size attribute:

Size of the memory block in bytes (int).

traceback attribute:

Traceback where the memory block was allocated, Traceback instance.

Traceback

Traceback class:

Sequence of Frame instances sorted from the most recent frame to the oldest frame.

A traceback contains at least 1 frame. If the tracemalloc module failed to get a frame, the filename "<unknown>" at line number 0 is used.

When a snapshot is taken, tracebacks of traces are limited to get_traceback_limit() frames. See the take_snapshot() function.

The Trace.traceback attribute is an instance of Traceback instance.

Rejected Alternatives

Log calls to the memory allocator

A different approach is to log calls to malloc(), realloc() and free() functions. Calls can be logged into a file or send to another computer through the network. Example of a log entry: name of the function, size of the memory block, address of the memory block, Python traceback where the allocation occurred, timestamp.

Logs cannot be used directly, getting the current status of the memory requires to parse previous logs. For example, it is not possible to get directly the traceback of a Python object, like get_object_traceback(obj) does with traces.

Python uses objects with a very short lifetime and so makes an extensive use of memory allocators. It has an allocator optimized for small objects (less than 512 bytes) with a short lifetime. For example, the Python test suites calls malloc(), realloc() or free() 270,000 times per second in average. If the size of log entry is 32 bytes, logging produces 8.2 MB per second or 29.0 GB per hour.

The alternative was rejected because it is less efficient and has less features. Parsing logs in a different process or a different computer is slower than maintaining traces on allocated memory blocks in the same process.

Prior Work

  • Python Memory Validator (2005-2013): commercial Python memory validator developed by Software Verification. It uses the Python Reflection API.
  • PySizer: Google Summer of Code 2005 project by Nick Smallbone.
  • Heapy (2006-2013): part of the Guppy-PE project written by Sverker Nilsson.
  • Draft PEP: Support Tracking Low-Level Memory Usage in CPython (Brett Canon, 2006)
  • Muppy: project developed in 2008 by Robert Schuppenies.
  • asizeof: a pure Python module to estimate the size of objects by Jean Brouwers (2008).
  • Heapmonitor: It provides facilities to size individual objects and can track all objects of certain classes. It was developed in 2008 by Ludwig Haehne.
  • Pympler (2008-2011): project based on asizeof, muppy and HeapMonitor
  • objgraph (2008-2012)
  • Dozer: WSGI Middleware version of the CherryPy memory leak debugger, written by Marius Gedminas (2008-2013)
  • Meliae: Python Memory Usage Analyzer developed by John A Meinel since 2009
  • gdb-heap: gdb script written in Python by Dave Malcom (2010-2011) to analyze the usage of the heap memory
  • memory_profiler: written by Fabian Pedregosa (2011-2013)
  • caulk: written by Ben Timby in 2012

See also Pympler Related Work.

pep-0455 Adding a key-transforming dictionary to collections

PEP:455
Title:Adding a key-transforming dictionary to collections
Version:$Revision$
Last-Modified:$Date$
Author:Antoine Pitrou <solipsis at pitrou.net>
BDFL-Delegate:Raymond Hettinger
Status:Rejected
Type:Standards Track
Content-Type:text/x-rst
Created:13-Sep-2013
Python-Version:3.5
Post-History:

Abstract

This PEP proposes a new data structure for the collections module, called "TransformDict" in this PEP. This structure is a mutable mapping which transforms the key using a given function when doing a lookup, but retains the original key when reading.

Rationale

Numerous specialized versions of this pattern exist. The most common is a case-insensitive case-preserving dict, i.e. a dict-like container which matches keys in a case-insensitive fashion but retains the original casing. It is a very common need in network programming, as many protocols feature some arrays of "key / value" properties in their messages, where the keys are textual strings whose case is specified to be ignored on receipt but by either specification or custom is to be preserved or non-trivially canonicalized when retransmitted.

Another common request is an identity dict, where keys are matched according to their respective id()s instead of normal matching.

Both are instances of a more general pattern, where a given transformation function is applied to keys when looking them up: that function being str.lower or str.casefold in the former example and the built-in id function in the latter.

(It could be said that the pattern projects keys from the user-visible set onto the internal lookup set.)

Semantics

TransformDict is a MutableMapping implementation: it faithfully implements the well-known API of mutable mappings, like dict itself and other dict-like classes in the standard library. Therefore, this PEP won't rehash the semantics of most TransformDict methods.

The transformation function needn't be bijective, it can be strictly surjective as in the case-insensitive example (in other words, different keys can lookup the same value):

>>> d = TransformDict(str.casefold)
>>> d['SomeKey'] = 5
>>> d['somekey']
5
>>> d['SOMEKEY']
5

TransformDict retains the first key used when creating an entry:

>>> d = TransformDict(str.casefold)
>>> d['SomeKey'] = 1
>>> d['somekey'] = 2
>>> list(d.items())
[('SomeKey', 2)]

The original keys needn't be hashable, as long as the transformation function returns a hashable one:

>>> d = TransformDict(id)
>>> l = [None]
>>> d[l] = 5
>>> l in d
True

Constructor

As shown in the examples above, creating a TransformDict requires passing the key transformation function as the first argument (much like creating a defaultdict requires passing the factory function as first argument).

The constructor also takes other optional arguments which can be used to initialize the TransformDict with certain key-value pairs. Those optional arguments are the same as in the dict and defaultdict constructors:

>>> d = TransformDict(str.casefold, [('Foo', 1)], Bar=2)
>>> sorted(d.items())
[('Bar', 2), ('Foo', 1)]

Getting the original key

TransformDict also features a lookup method returning the stored key together with the corresponding value:

>>> d = TransformDict(str.casefold, {'Foo': 1})
>>> d.getitem('FOO')
('Foo', 1)
>>> d.getitem('bar')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: 'bar'

The method name getitem() follows the standard popitem() method on mutable mappings.

Getting the transformation function

TransformDict has a simple read-only property transform_func which gives back the transformation function.

Alternative proposals and questions

Retaining the last original key

Most python-dev respondents found retaining the first user-supplied key more intuitive than retaining the last. Also, it matches the dict object's own behaviour when using different but equal keys:

>>> d = {}
>>> d[1] = 'hello'
>>> d[1.0] = 'world'
>>> d
{1: 'world'}

Furthermore, explicitly retaining the last key in a first-key-retaining scheme is still possible using the following approach:

d.pop(key, None)
d[key] = value

while the converse (retaining the first key in a last-key-retaining scheme) doesn't look possible without rewriting part of the container's code.

Using an encoder / decoder pair

Using a function pair isn't necessary, since the original key is retained by the container. Moreover, an encoder / decoder pair would require the transformation to be bijective, which prevents important use cases like case-insensitive matching.

Providing a transformation function for values

Dictionary values are not used for lookup, their semantics are totally irrelevant to the container's operation. Therefore, there is no point in having both an "original" and a "transformed" value: the transformed value wouldn't be used for anything.

Providing a specialized container, not generic

It was asked why we would provide the generic TransformDict construct rather than a specialized case-insensitive dict variant. The answer is that it's nearly as cheap (code-wise and performance-wise) to provide the generic construct, and it can fill more use cases.

Even case-insensitive dicts can actually elicit different transformation functions: str.lower, str.casefold or in some cases bytes.lower when working with text encoded in a ASCII-compatible encoding.

Other constructor patterns

Two other constructor patterns were proposed by Serhiy Storchaka:

  • A type factory scheme:

    d = TransformDict(str.casefold)(Foo=1)
    
  • A subclassing scheme:

    class CaseInsensitiveDict(TransformDict):
        __transform__ = str.casefold
    
    d = CaseInsensitiveDict(Foo=1)
    

While both approaches can be defended, they don't follow established practices in the standard library, and therefore were rejected.

Implementation

A patch for the collections module is tracked on the bug tracker at http://bugs.python.org/issue18986.

Existing work

Case-insensitive dicts are a popular request:

Identity dicts have been requested too:

Several modules in the standard library use identity lookups for object memoization, for example pickle, json, copy, cProfile, doctest and _threading_local.

Other languages

C# / .Net

.Net has a generic Dictionary class where you can specify a custom IEqualityComparer: http://msdn.microsoft.com/en-us/library/xfhwa508.aspx

Using it is the recommended way to write case-insensitive dictionaries: http://stackoverflow.com/questions/13230414/case-insensitive-access-for-generic-dictionary

C++

The C++ Standard Template Library features an unordered_map with customizable hash and equality functions: http://www.cplusplus.com/reference/unordered_map/unordered_map/

pep-0456 Secure and interchangeable hash algorithm

PEP:456
Title:Secure and interchangeable hash algorithm
Version:$Revision$
Last-Modified:$Date$
Author:Christian Heimes <christian at python.org>
BDFL-Delegate:Nick Coghlan
Status:Final
Type:Standards Track
Content-Type:text/x-rst
Created:27-Sep-2013
Python-Version:3.4
Post-History:06-Oct-2013, 14-Nov-2013, 20-Nov-2013
Resolution:https://mail.python.org/pipermail/python-dev/2013-November/130400.html

Abstract

This PEP proposes SipHash as default string and bytes hash algorithm to properly fix hash randomization once and for all. It also proposes modifications to Python's C code in order to unify the hash code and to make it easily interchangeable.

Rationale

Despite the last attempt [issue13703] CPython is still vulnerable to hash collision DoS attacks [29c3] [issue14621]. The current hash algorithm and its randomization is not resilient against attacks. Only a proper cryptographic hash function prevents the extraction of secret randomization keys. Although no practical attack against a Python-based service has been seen yet, the weakness has to be fixed. Jean-Philippe Aumasson and Daniel J. Bernstein have already shown how the seed for the current implementation can be recovered [poc].

Furthermore the current hash algorithm is hard-coded and implemented multiple times for bytes and three different Unicode representations UCS1, UCS2 and UCS4. This makes it impossible for embedders to replace it with a different implementation without patching and recompiling large parts of the interpreter. Embedders may want to choose a more suitable hash function.

Finally the current implementation code does not perform well. In the common case it only processes one or two bytes per cycle. On a modern 64-bit processor the code can easily be adjusted to deal with eight bytes at once.

This PEP proposes three major changes to the hash code for strings and bytes:

  • SipHash [sip] is introduced as default hash algorithm. It is fast and small despite its cryptographic properties. Due to the fact that it was designed by well known security and crypto experts, it is safe to assume that its secure for the near future.
  • The existing FNV code is kept for platforms without a 64-bit data type. The algorithm is optimized to process larger chunks per cycle.
  • Calculation of the hash of strings and bytes is moved into a single API function instead of multiple specialized implementations in Objects/object.c and Objects/unicodeobject.c. The function takes a void pointer plus length and returns the hash for it.
  • The algorithm can be selected at compile time. FNV is guaranteed to exist on all platforms. SipHash is available on the majority of modern systems.

Requirements for a hash function

  • It MUST be able to hash arbitrarily large blocks of memory from 1 byte up to the maximum ssize_t value.
  • It MUST produce at least 32 bits on 32-bit platforms and at least 64 bits on 64-bit platforms. (Note: Larger outputs can be compressed with e.g. v ^ (v >> 32).)
  • It MUST support hashing of unaligned memory in order to support hash(memoryview).
  • It is highly RECOMMENDED that the length of the input influences the outcome, so that hash(b'\00') != hash(b'\x00\x00').

The internal interface code between the hash function and the tp_hash slots implements special cases for zero length input and a return value of -1. An input of length 0 is mapped to hash value 0. The output -1 is mapped to -2.

Current implementation with modified FNV

CPython currently uses uses a variant of the Fowler-Noll-Vo hash function [fnv]. The variant is has been modified to reduce the amount and cost of hash collisions for common strings. The first character of the string is added twice, the first time with a bit shift of 7. The length of the input string is XOR-ed to the final value. Both deviations from the original FNV algorithm reduce the amount of hash collisions for short strings.

Recently [issue13703] a random prefix and suffix were added as an attempt to randomize the hash values. In order to protect the hash secret the code still returns 0 for zero length input.

C code:

Py_uhash_t x;
Py_ssize_t len;
/* p is either 1, 2 or 4 byte type */
unsigned char *p;
Py_UCS2 *p;
Py_UCS4 *p;

if (len == 0)
    return 0;
x = (Py_uhash_t) _Py_HashSecret.prefix;
x ^= (Py_uhash_t) *p << 7;
for (i = 0; i < len; i++)
    x = (1000003 * x) ^ (Py_uhash_t) *p++;
x ^= (Py_uhash_t) len;
x ^= (Py_uhash_t) _Py_HashSecret.suffix;
return x;

Which roughly translates to Python:

def fnv(p):
    if len(p) == 0:
        return 0

    # bit mask, 2**32-1 or 2**64-1
    mask = 2 * sys.maxsize + 1

    x = hashsecret.prefix
    x = (x ^ (ord(p[0]) << 7)) & mask
    for c in p:
        x = ((1000003 * x) ^ ord(c)) & mask
    x = (x ^ len(p)) & mask
    x = (x ^ hashsecret.suffix) & mask

    if x == -1:
        x = -2

    return x

FNV is a simple multiply and XOR algorithm with no cryptographic properties. The randomization was not part of the initial hash code, but was added as counter measure against hash collision attacks as explained in oCERT-2011-003 [ocert]. Because FNV is not a cryptographic hash algorithm and the dict implementation is not fortified against side channel analysis, the randomization secrets can be calculated by a remote attacker. The author of this PEP strongly believes that the nature of a non-cryptographic hash function makes it impossible to conceal the secrets.

Examined hashing algorithms

The author of this PEP has researched several hashing algorithms that are considered modern, fast and state-of-the-art.

SipHash

SipHash [sip] is a cryptographic pseudo random function with a 128-bit seed and 64-bit output. It was designed by Jean-Philippe Aumasson and Daniel J. Bernstein as a fast and secure keyed hash algorithm. It's used by Ruby, Perl, OpenDNS, Rust, Redis, FreeBSD and more. The C reference implementation has been released under CC0 license (public domain).

Quote from SipHash's site:

SipHash is a family of pseudorandom functions (a.k.a. keyed hash functions) optimized for speed on short messages. Target applications include network traffic authentication and defense against hash-flooding DoS attacks.

siphash24 is the recommend variant with best performance. It uses 2 rounds per message block and 4 finalization rounds. Besides the reference implementation several other implementations are available. Some are single-shot functions, others use a Merkle–Damgård construction-like approach with init, update and finalize functions. Marek Majkowski C implementation csiphash [csiphash] defines the prototype of the function. (Note: k is split up into two uint64_t):

uint64_t siphash24(const void *src, unsigned long src_sz, const char k[16])

SipHash requires a 64-bit data type and is not compatible with pure C89 platforms.

MurmurHash

MurmurHash [murmur] is a family of non-cryptographic keyed hash function developed by Austin Appleby. Murmur3 is the latest and fast variant of MurmurHash. The C++ reference implementation has been released into public domain. It features 32- or 128-bit output with a 32-bit seed. (Note: The out parameter is a buffer with either 1 or 4 bytes.)

Murmur3's function prototypes are:

void MurmurHash3_x86_32(const void *key, int len, uint32_t seed, void *out)

void MurmurHash3_x86_128(const void *key, int len, uint32_t seed, void *out)

void MurmurHash3_x64_128(const void *key, int len, uint32_t seed, void *out)

The 128-bit variants requires a 64-bit data type and are not compatible with pure C89 platforms. The 32-bit variant is fully C89-compatible.

Aumasson, Bernstein and Boßlet have shown [sip] [ocert-2012-001] that Murmur3 is not resilient against hash collision attacks. Therefore Murmur3 can no longer be considered as secure algorithm. It still may be an alternative is hash collision attacks are of no concern.

CityHash

CityHash [city] is a family of non-cryptographic hash function developed by Geoff Pike and Jyrki Alakuijala for Google. The C++ reference implementation has been released under MIT license. The algorithm is partly based on MurmurHash and claims to be faster. It supports 64- and 128-bit output with a 128-bit seed as well as 32-bit output without seed.

The relevant function prototype for 64-bit CityHash with 128-bit seed is:

uint64 CityHash64WithSeeds(const char *buf, size_t len, uint64 seed0,
                           uint64 seed1)

CityHash also offers SSE 4.2 optimizations with CRC32 intrinsic for long inputs. All variants except CityHash32 require 64-bit data types. CityHash32 uses only 32-bit data types but it doesn't support seeding.

Like MurmurHash Aumasson, Bernstein and Boßlet have shown [sip] a similar weakness in CityHash.

DJBX33A

DJBX33A is a very simple multiplication and addition algorithm by Daniel J. Bernstein. It is fast and has low setup costs but it's not secure against hash collision attacks. Its properties make it a viable choice for small string hashing optimization.

Other

Crypto algorithms such as HMAC, MD5, SHA-1 or SHA-2 are too slow and have high setup and finalization costs. For these reasons they are not considered fit for this purpose. Modern AMD and Intel CPUs have AES-NI (AES instruction set) [aes-ni] to speed up AES encryption. CMAC with AES-NI might be a viable option but it's probably too slow for daily operation. (testing required)

Conclusion

SipHash provides the best combination of speed and security. Developers of other prominent projects have came to the same conclusion.

Small string optimization

Hash functions like SipHash24 have a costly initialization and finalization code that can dominate speed of the algorithm for very short strings. On the other hand Python calculates the hash value of short strings quite often. A simple and fast function for especially for hashing of small strings can make a measurable impact on performance. For example these measurements were taken during a run of Python's regression tests. Additional measurements of other code have shown a similar distribution.

bytes hash() calls portion
1 18709 0.2%
2 737480 9.5%
3 636178 17.6%
4 1518313 36.7%
5 643022 44.9%
6 770478 54.6%
7 525150 61.2%
8 304873 65.1%
9 297272 68.8%
10 68191 69.7%
11 1388484 87.2%
12 480786 93.3%
13 52730 93.9%
14 65309 94.8%
15 44245 95.3%
16 85643 96.4%
Total 7921678  

However a fast function like DJBX33A is not as secure as SipHash24. A cutoff at about 5 to 7 bytes should provide a decent safety margin and speed up at the same time. The PEP's reference implementation provides such a cutoff with Py_HASH_CUTOFF. The optimization is disabled by default for several reasons. For one the security implications are unclear yet and should be thoroughly studied before the optimization is enabled by default. Secondly the performance benefits vary. On 64 bit Linux system with Intel Core i7 multiple runs of Python's benchmark suite [pybench] show an average speedups between 3% and 5% for benchmarks such as django_v2, mako and etree with a cutoff of 7. Benchmarks with X86 binaries and Windows X86_64 builds on the same machine are a bit slower with small string optimization.

The state of small string optimization will be assessed during the beta phase of Python 3.4. The feature will either be enabled with appropriate values or the code will be removed before beta 2 is released.

C API additions

All C API extension modifications are not part of the stable API.

hash secret

The _Py_HashSecret_t type of Python 2.6 to 3.3 has two members with either 32- or 64-bit length each. SipHash requires two 64-bit unsigned integers as keys. The typedef will be changed to an union with a guaranteed size of 24 bytes on all architectures. The union provides a 128 bit random key for SipHash24 and FNV as well as an additional value of 64 bit for the optional small string optimization and pyexpat seed. The additional 64 bit seed ensures that pyexpat or small string optimization cannot reveal bits of the SipHash24 seed.

memory layout on 64 bit systems:

cccccccc cccccccc cccccccc  uc -- unsigned char[24]
pppppppp ssssssss ........  fnv -- two Py_hash_t
k0k0k0k0 k1k1k1k1 ........  siphash -- two PY_UINT64_T
........ ........ ssssssss  djbx33a -- 16 bytes padding + one Py_hash_t
........ ........ eeeeeeee  pyexpat XML hash salt

memory layout on 32 bit systems:

cccccccc cccccccc cccccccc  uc -- unsigned char[24]
ppppssss ........ ........  fnv -- two Py_hash_t
k0k0k0k0 k1k1k1k1 ........  siphash -- two PY_UINT64_T (if available)
........ ........ ssss....  djbx33a -- 16 bytes padding + one Py_hash_t
........ ........ eeee....  pyexpat XML hash salt

new type definition:

typedef union {
    /* ensure 24 bytes */
    unsigned char uc[24];
    /* two Py_hash_t for FNV */
    struct {
        Py_hash_t prefix;
        Py_hash_t suffix;
    } fnv;
#ifdef PY_UINT64_T
    /* two uint64 for SipHash24 */
    struct {
        PY_UINT64_T k0;
        PY_UINT64_T k1;
    } siphash;
#endif
    /* a different (!) Py_hash_t for small string optimization */
    struct {
        unsigned char padding[16];
        Py_hash_t suffix;
    } djbx33a;
    struct {
        unsigned char padding[16];
        Py_hash_t hashsalt;
    } expat;
} _Py_HashSecret_t;
PyAPI_DATA(_Py_HashSecret_t) _Py_HashSecret;

_Py_HashSecret_t is initialized in Python/random.c:_PyRandom_Init() exactly once at startup.

hash function definition

Implementation:

typedef struct {
    /* function pointer to hash function, e.g. fnv or siphash24 */
    Py_hash_t (*const hash)(const void *, Py_ssize_t);
    const char *name;       /* name of the hash algorithm and variant */
    const int hash_bits;    /* internal size of hash value */
    const int seed_bits;    /* size of seed input */
} PyHash_FuncDef;

PyAPI_FUNC(PyHash_FuncDef*) PyHash_GetFuncDef(void);

autoconf

A new test is added to the configure script. The test sets HAVE_ALIGNED_REQUIRED, when it detects a platform, that requires aligned memory access for integers. Must current platforms such as X86, X86_64 and modern ARM don't need aligned data.

A new option --with-hash-algorithm enables the user to select a hash algorithm in the configure step.

hash function selection

The value of the macro Py_HASH_ALGORITHM defines which hash algorithm is used internally. It may be set to any of the three values Py_HASH_SIPHASH24, Py_HASH_FNV or Py_HASH_EXTERNAL. If Py_HASH_ALGORITHM is not defined at all, then the best available algorithm is selected. On platforms wich don't require aligned memory access (HAVE_ALIGNED_REQUIRED not defined) and an unsigned 64 bit integer type PY_UINT64_T, SipHash24 is used. On strict C89 platforms without a 64 bit data type, or architectures such as SPARC, FNV is selected as fallback. A hash algorithm can be selected with an autoconf option, for example ./configure --with-hash-algorithm=fnv.

The value Py_HASH_EXTERNAL allows 3rd parties to provide their own implementation at compile time.

Implementation:

#if Py_HASH_ALGORITHM == Py_HASH_EXTERNAL
extern PyHash_FuncDef PyHash_Func;
#elif Py_HASH_ALGORITHM == Py_HASH_SIPHASH24
static PyHash_FuncDef PyHash_Func = {siphash24, "siphash24", 64, 128};
#elif Py_HASH_ALGORITHM == Py_HASH_FNV
static PyHash_FuncDef PyHash_Func = {fnv, "fnv", 8 * sizeof(Py_hash_t),
                                     16 * sizeof(Py_hash_t)};
#endif

Python API addition

sys module

The sys module already has a hash_info struct sequence. More fields are added to the object to reflect the active hash algorithm and its properties.

sys.hash_info(width=64,
              modulus=2305843009213693951,
              inf=314159,
              nan=0,
              imag=1000003,
              # new fields:
              algorithm='siphash24',
              hash_bits=64,
              seed_bits=128,
              cutoff=0)

Necessary modifications to C code

_Py_HashBytes() (Objects/object.c)

_Py_HashBytes is an internal helper function that provides the hashing code for bytes, memoryview and datetime classes. It currently implements FNV for unsigned char *.

The function is moved to Python/pyhash.c and modified to use the hash function through PyHash_Func.hash(). The function signature is altered to take a const void * as first argument. _Py_HashBytes also takes care of special cases: it maps zero length input to 0 and return value of -1 to -2.

bytes_hash() (Objects/bytesobject.c)

bytes_hash uses _Py_HashBytes to provide the tp_hash slot function for bytes objects. The function will continue to use _Py_HashBytes but withoht a type cast.

memory_hash() (Objects/memoryobject.c)

memory_hash provides the tp_hash slot function for read-only memory views if the original object is hashable, too. It's the only function that has to support hashing of unaligned memory segments in the future. The function will continue to use _Py_HashBytes but withoht a type cast.

unicode_hash() (Objects/unicodeobject.c)

unicode_hash provides the tp_hash slot function for unicode. Right now it implements the FNV algorithm three times for unsigned char*, Py_UCS2 and Py_UCS4. A reimplementation of the function must take care to use the correct length. Since the macro PyUnicode_GET_LENGTH returns the length of the unicode string and not its size in octets, the length must be multiplied with the size of the internal unicode kind:

if (PyUnicode_READY(u) == -1)
    return -1;
x = _Py_HashBytes(PyUnicode_DATA(u),
                  PyUnicode_GET_LENGTH(u) * PyUnicode_KIND(u));

generic_hash() (Modules/_datetimemodule.c)

generic_hash acts as a wrapper around _Py_HashBytes for the tp_hash slots of date, time and datetime types. timedelta objects are hashed by their state (days, seconds, microseconds) and tzinfo objects are not hashable. The data members of date, time and datetime types' struct are not void* aligned. This can easily by fixed with memcpy()ing four to ten bytes to an aligned buffer.

Performance

In general the PEP 456 code with SipHash24 is about as fast as the old code with FNV. SipHash24 seems to make better use of modern compilers, CPUs and large L1 cache. Several benchmarks show a small speed improvement on 64 bit CPUs such as Intel Core i5 and Intel Core i7 processes. 32 bit builds and benchmarks on older CPUs such as an AMD Athlon X2 are slightly slower with SipHash24. The performance increase or decrease are so small that they should not affect any application code.

The benchmarks were conducted on CPython default branch revision b08868fd5994 and the PEP repository [pep-456-repos]. All upstream changes were merged into the pep-456 branch. The "performance" CPU governor was configured and almost all programs were stopped so the benchmarks were able to utilize TurboBoost and the CPU caches as much as possible. The raw benchmark results of multiple machines and platforms are made available at [benchmarks].

Hash value distribution

A good distribution of hash values is important for dict and set performance. Both SipHash24 and FNV take the length of the input into account, so that strings made up entirely of NULL bytes don't have the same hash value. The last bytes of the input tend to affect the least significant bits of the hash value, too. That attribute reduces the amount of hash collisions for strings with a common prefix.

Typical length

Serhiy Storchaka has shown in [issue16427] that a modified FNV implementation with 64 bits per cycle is able to process long strings several times faster than the current FNV implementation.

However according to statistics [issue19183] a typical Python program as well as the Python test suite have a hash ratio of about 50% small strings between 1 and 6 bytes. Only 5% of the strings are larger than 16 bytes.

Grand Unified Python Benchmark Suite

Initial tests with an experimental implementation and the Grand Unified Python Benchmark Suite have shown minimal deviations. The summarized total runtime of the benchmark is within 1% of the runtime of an unmodified Python 3.4 binary. The tests were run on an Intel i7-2860QM machine with a 64-bit Linux installation. The interpreter was compiled with GCC 4.7 for 64- and 32-bit.

More benchmarks will be conducted.

Backwards Compatibility

The modifications don't alter any existing API.

The output of hash() for strings and bytes are going to be different. The hash values for ASCII Unicode and ASCII bytes will stay equal.

Alternative counter measures against hash collision DoS

Three alternative countermeasures against hash collisions were discussed in the past, but are not subject of this PEP.

  1. Marc-Andre Lemburg has suggested that dicts shall count hash collisions. In case an insert operation causes too many collisions an exception shall be raised.
  2. Some applications (e.g. PHP) limit the amount of keys for GET and POST HTTP requests. The approach effectively leverages the impact of a hash collision attack. (XXX citation needed)
  3. Hash maps have a worst case of O(n) for insertion and lookup of keys. This results in a quadratic runtime during a hash collision attack. The introduction of a new and additional data structure with with O(log n) worst case behavior would eliminate the root cause. A data structures like red-black-tree or prefix trees (trie [trie]) would have other benefits, too. Prefix trees with stringed keyed can reduce memory usage as common prefixes are stored within the tree structure.

Discussion

Pluggable

The first draft of this PEP made the hash algorithm pluggable at runtime. It supported multiple hash algorithms in one binary to give the user the possibility to select a hash algorithm at startup. The approach was considered an unnecessary complication by several core committers [pluggable]. Subsequent versions of the PEP aim for compile time configuration.

Non-aligned memory access

The implementation of SipHash24 were critized because it ignores the issue of non-aligned memory and therefore doesn't work on architectures that requires alignment of integer types. The PEP deliberately neglects this special case and doesn't support SipHash24 on such platforms. It's simply not considered worth the trouble until proven otherwise. All major platforms like X86, X86_64 and ARMv6+ can handle unaligned memory with minimal or even no speed impact. [alignmentmyth]

Almost every block is properly aligned anyway. At present bytes' and str's data are always aligned. Only memoryviews can point to unaligned blocks under rare circumstances. The PEP implementation is optimized and simplified for the common case.

ASCII str / bytes hash collision

Since the implementation of [pep-0393] bytes and ASCII text have the same memory layout. Because of this the new hashing API will keep the invariant:

hash("ascii string") == hash(b"ascii string")

for ASCII string and ASCII bytes. Equal hash values result in a hash collision and therefore cause a minor speed penalty for dicts and sets with mixed keys. The cause of the collision could be removed by e.g. subtracting 2 from the hash value of bytes. -2 because hash(b"") == 0 and -1 is reserved. The PEP doesn't change the hash value.

pep-0457 Syntax For Positional-Only Parameters

PEP:457
Title:Syntax For Positional-Only Parameters
Version:$Revision$
Last-Modified:$Date$
Author:Larry Hastings <larry at hastings.org>
Discussions-To:Python-Dev <python-dev at python.org>
Status:Draft
Type:Informational
Content-Type:text/x-rst
Created:08-Oct-2013

Overview

This PEP proposes a syntax for positional-only parameters in Python. Positional-only parameters are parameters without an externally-usable name; when a function accepting positional-only parameters is called, positional arguments are mapped to these parameters based solely on their position.

Rationale

Python has always supported positional-only parameters. Early versions of Python lacked the concept of specifying parameters by name, so naturally all parameters were positional-only. This changed around Python 1.0, when all parameters suddenly became positional-or-keyword. But, even in current versions of Python, many CPython "builtin" functions still only accept positional-only arguments.

Functions implemented in modern Python can accept an arbitrary number of positional-only arguments, via the variadic *args parameter. However, there is no Python syntax to specify accepting a specific number of positional-only parameters. Put another way, there are many builtin functions whose signatures are simply not expressable with Python syntax.

This PEP proposes a backwards-compatible syntax that should permit implementing any builtin in pure Python code.

Positional-Only Parameter Semantics In Current Python

There are many, many examples of builtins that only accept positional-only parameters. The resulting semantics are easily experienced by the Python programmer--just try calling one, specifying its arguments by name:

>>> pow(x=5, y=3)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: pow() takes no keyword arguments

In addition, there are some functions with particularly interesting semantics:

  • range(), which accepts an optional parameter to the left of its required parameter. [2]
  • dict(), whose mapping/iterator parameter is optional and semantically must be positional-only. Any externally visible name for this parameter would occlude that name going into the **kwarg keyword variadic parameter dict! [1]

Obviously one can simulate any of these in pure Python code by accepting (*args, **kwargs) and parsing the arguments by hand. But this results in a disconnect between the Python function's signature and what it actually accepts, not to mention the work of implementing said argument parsing.

Motivation

This PEP does not propose we implement positional-only parameters in Python. The goal of this PEP is simply to define the syntax, so that:

  • Documentation can clearly, unambiguously, and consistently express exactly how the arguments for a function will be interpreted.
  • The syntax is reserved for future use, in case the community decides someday to add positional-only parameters to the language.
  • Argument Clinic can use a variant of the syntax as part of its input when defining the arguments for built-in functions.

The Current State Of Documentation For Positional-Only Parameters

The documentation for positional-only parameters is incomplete and inconsistent:

  • Some functions denote optional groups of positional-only arguments by enclosing them in nested square brackets. [3]
  • Some functions denote optional groups of positional-only arguments by presenting multiple prototypes with varying numbers of arguments. [4]
  • Some functions use both of the above approaches. [2] [5]

One more important idea to consider: currently in the documentation there's no way to tell whether a function takes positional-only parameters. open() accepts keyword arguments, ord() does not, but there is no way of telling just by reading the documentation that this is true.

Syntax And Semantics

From the "ten-thousand foot view", and ignoring *args and **kwargs for now, the grammar for a function definition currently looks like this:

def name(positional_or_keyword_parameters, *, keyword_only_parameters):

Building on that perspective, the new syntax for functions would look like this:

def name(positional_only_parameters, /, positional_or_keyword_parameters,
         *, keyword_only_parameters):

All parameters before the / are positional-only. If / is not specified in a function signature, that function does not accept any positional-only parameters.

Positional-only parameters can be optional, but the mechanism is significantly different from positional-or-keyword or keyword-only parameters. Positional-only parameters don't accept default values. Instead, positional-only parameters can be specified in optional "groups". Groups of parameters are surrounded by square brackets, like so:

def addch([y, x,] ch, [attr,] /):

Positional-only parameters that are not in an option group are "required" positional-only parameters. All "required" positional-only parameters must be contiguous.

Parameters in an optional group accept arguments in a group; you must provide arguments either for all of the them or for none of them. Using the example of addch() above, you could not call addch() in such a way that x was specified but y was not (and vice versa). The mapping of positional parameters to optional groups is done based on fitting the number of parameters to groups. Based on the above definition, addch() would assign arguments to parameters in the following way:

Number of arguments Parameter assignment
0 raises an exception
1 ch
2 ch, attr
3 y, x, ch
4 y, x, ch, attr
5 or more raises an exception

More semantics of positional-only parameters:

  • Although positional-only parameter technically have names, these names are internal-only; positional-only parameters are never externally addressable by name. (Similarly to *args and **kwargs.)
  • It's possible to nest option groups.
  • If there are no required parameters, all option groups behave as if they're to the right of the required parameter group.
  • For clarity and consistency, the comma for a parameter always comes immediately after the parameter name. It's a syntax error to specify a square bracket between the name of a parameter and the following comma. (This is far more readable than putting the comma outside the square bracket, particularly for nested groups.)
  • If there are arguments after the /, then you must specify a comma after the /, just as there is a comma after the * denoting the shift to keyword-only parameters.
  • This syntax has no effect on *args or **kwargs.

It's possible to specify a function prototype where the mapping of arguments to parameters is ambiguous. Consider:

def range([start,] stop, [range,] /):

Python disambiguates these situations by preferring optional groups to the left of the required group.

Additional Limitations

Argument Clinic uses a form of this syntax for specifying builtins. It imposes further limitations that are theoretically unnecessary but make the implementation easier. Specifically:

  • A function that has positional-only parameters currently cannot have any other kind of parameter. (This will probably be relaxed slightly in the near future.)

  • Multiple option groups on either side of the required positional-only parameters must be nested, with the nesting getting deeper the further away the group is from the required positional-parameter group.

    Put another way: all the left-brackets for option groups to the left of the required group must be specified contiguously, and all the right-brackets for option groups to the right of the required group must be specified contiguously.

Notes For A Future Implementor

If we decide to implement positional-only parameters in a future version of Python, we'd have to do some additional work to preserve their semantics. The problem: how do we inform a parameter that no value was passed in for it when the function was called?

The obvious solution: add a new singleton constant to Python that is passed in when a parameter is not mapped to an argument. I propose that the value be called undefined, and be a singleton of a special class called Undefined. If a positional-only parameter did not receive an argument when called, its value would be set to undefined.

But this raises a further problem. How do can we tell the difference between "this positional-only parameter did not receive an argument" and "the caller passed in undefined for this parameter"?

It'd be nice to make it illegal to pass undefined in as an argument to a function--to, say, raise an exception. But that would slow Python down, and the "consenting adults" rule appears applicable here. So making it illegal should probably be strongly discouraged but not outright prevented.

However, it should be allowed (and encouraged) for user functions to specify undefined as a default value for parameters.

Unresolved Questions

There are three types of parameters in Python:

  1. positional-only parameters,
  2. positional-or-keyword parameters, and
  3. keyword-only parameters.

Python allows functions to have both 2 and 3. And some builtins (e.g. range) have both 1 and 3. Does it make sense to have functions that have both 1 and 2? Or all of the above?

Thanks

Credit for the use of '/' as the separator between positional-only and positional-or-keyword parameters goes to Guido van Rossum, in a proposal from 2012. [6]

Credit for making left option groups higher precedence goes to Nick Coghlan. (Conversation in person at PyCon US 2013.)

[1]http://docs.python.org/3/library/stdtypes.html#dict
[2](1, 2) http://docs.python.org/3/library/functions.html#func-range
[3]http://docs.python.org/3/library/curses.html#curses.window.border
[4]http://docs.python.org/3/library/os.html#os.sendfile
[5]http://docs.python.org/3/library/curses.html#curses.window.addch
[6]Guido van Rossum, posting to python-ideas, March 2012: http://mail.python.org/pipermail/python-ideas/2012-March/014364.html and http://mail.python.org/pipermail/python-ideas/2012-March/014378.html and http://mail.python.org/pipermail/python-ideas/2012-March/014417.html

pep-0458 Surviving a Compromise of PyPI

PEP:458
Title:Surviving a Compromise of PyPI
Version:$Revision$
Last-Modified:$Date$
Author:Trishank Karthik Kuppusamy <trishank at nyu.edu>, Vladimir Diaz <vladimir.diaz at nyu.edu>, Donald Stufft <donald at stufft.io>, Justin Cappos <jcappos at nyu.edu>
BDFL-Delegate:Richard Jones <r1chardj0n3s@gmail.com>
Discussions-To:DistUtils mailing list <distutils-sig at python.org>
Status:Draft
Type:Standards Track
Content-Type:text/x-rst
Created:27-Sep-2013

Abstract

This PEP proposes how the Python Package Index (PyPI [1]) should be integrated with The Update Framework [2] (TUF). TUF was designed to be a flexible security add-on to a software updater or package manager. The framework integrates best security practices such as separating role responsibilities, adopting the many-man rule for signing packages, keeping signing keys offline, and revocation of expired or compromised signing keys. For example, attackers would have to steal multiple signing keys stored independently to compromise a role responsible for specifying a repository's available files. Another role responsible for indicating the latest snapshot of the repository may have to be similarly compromised, and independent of the first compromised role.

The proposed integration will allow modern package managers such as pip [3] to be more secure against various types of security attacks on PyPI and protect users from such attacks. Specifically, this PEP describes how PyPI processes should be adapted to generate and incorporate TUF metadata (i.e., the minimum security model). The minimum security model supports verification of PyPI distributions that are signed with keys stored on PyPI: distributions uploaded by developers are signed by PyPI, require no action from developers (other than uploading the distribution), and are immediately available for download. The minimum security model also minimizes PyPI administrative responsibilities by automating much of the signing process.

This PEP does not prescribe how package managers such as pip should be adapted to install or update projects from PyPI with TUF metadata. Package managers interested in adopting TUF on the client side may consult TUF's library documentation [27], which exists for this purpose. Support for project distributions that are signed by developers (maximum security model) is also not discussed in this PEP, but is outlined in the appendix as a possible future extension and covered in detail in PEP 480 [26]. The PEP 480 extension focuses on the maximum security model, which requires more PyPI administrative work (none by clients), but it also proposes an easy-to-use key management solution for developers, how to interface with a potential future build farm on PyPI infrastructure, and discusses the feasibility of end-to-end signing.

Motivation

In January 2013, the Python Software Foundation (PSF) announced [4] that the python.org wikis for Python, Jython, and the PSF were subjected to a security breach that caused all of the wiki data to be destroyed on January 5, 2013. Fortunately, the PyPI infrastructure was not affected by this security breach. However, the incident is a reminder that PyPI should take defensive steps to protect users as much as possible in the event of a compromise. Attacks on software repositories happen all the time [5]. The PSF must accept the possibility of security breaches and prepare PyPI accordingly because it is a valuable resource used by thousands, if not millions, of people.

Before the wiki attack, PyPI used MD5 hashes to tell package managers, such as pip, whether or not a package was corrupted in transit. However, the absence of SSL made it hard for package managers to verify transport integrity to PyPI. It was therefore easy to launch a man-in-the-middle attack between pip and PyPI, and change package content arbitrarily. Users could be tricked into installing malicious packages with man-in-the-middle attacks. After the wiki attack, several steps were proposed (some of which were implemented) to deliver a much higher level of security than was previously the case: requiring SSL to communicate with PyPI [6], restricting project names [7], and migrating from MD5 to SHA-2 hashes [8].

These steps, though necessary, are insufficient because attacks are still possible through other avenues. For example, a public mirror is trusted to honestly mirror PyPI, but some mirrors may misbehave due to malice or accident. Package managers such as pip are supposed to use signatures from PyPI to verify packages downloaded from a public mirror [9], but none are known to actually do so [10]. Therefore, it would be wise to add more security measures to detect attacks from public mirrors or content delivery networks [11] (CDNs).

Even though official mirrors are being deprecated on PyPI [12], there remain a wide variety of other attack vectors on package managers [13]. These attacks can crash client systems, cause obsolete packages to be installed, or even allow an attacker to execute arbitrary code. In September 2013 [28], a post was made to the Distutils mailing list showing that the latest version of pip (at the time) was susceptible to such attacks, and how TUF could protect users against them [14]. Specifically, testing was done to see how pip would respond to these attacks with and without TUF. Attacks tested included replay and freeze, arbitrary packages, slow retrieval, and endless data. The post also included a demonstration of how pip would respond if PyPI were compromised.

With the intent to protect PyPI against infrastructure compromises, this PEP proposes integrating PyPI with The Update Framework [2] (TUF). TUF helps secure new or existing software update systems. Software update systems are vulnerable to many known attacks, including those that can result in clients being compromised or crashed. TUF solves these problems by providing a flexible security framework that can be added to software updaters.

Threat Model

The threat model assumes the following:

  • Offline keys are safe and securely stored.
  • Attackers can compromise at least one of PyPI's trusted keys stored online, and may do so at once or over a period of time.
  • Attackers can respond to client requests.

An attacker is considered successful if they can cause a client to install (or leave installed) something other than the most up-to-date version of the software the client is updating. If the attacker is preventing the installation of updates, they want clients to not realize there is anything wrong.

Definitions

The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [29].

This PEP focuses on integrating TUF with PyPI; however, the reader is encouraged to read about TUF's design principles [2]. It is also RECOMMENDED that the reader be familiar with the TUF specification [16].

Terms used in this PEP are defined as follows:

  • Projects: Projects are software components that are made available for integration. Projects include Python libraries, frameworks, scripts, plugins, applications, collections of data or other resources, and various combinations thereof. Public Python projects are typically registered on the Python Package Index [17].
  • Releases: Releases are uniquely identified snapshots of a project [17].
  • Distributions: Distributions are the packaged files that are used to publish and distribute a release [17].
  • Simple index: The HTML page that contains internal links to the distributions of a project [17].
  • Roles: There is one root role in PyPI. There are multiple roles whose responsibilities are delegated to them directly or indirectly by the root role. The term top-level role refers to the root role and any role delegated by the root role. Each role has a single metadata file that it is trusted to provide.
  • Metadata: Metadata are signed files that describe roles, other metadata, and target files.
  • Repository: A repository is a resource comprised of named metadata and target files. Clients request metadata and target files stored on a repository.
  • Consistent snapshot: A set of TUF metadata and PyPI targets that capture the complete state of all projects on PyPI as they existed at some fixed point in time.
  • The snapshot (release) role: In order to prevent confusion due to the different meanings of the term "release" used in PEP 426 [17] and the TUF specification [16], the release role is renamed as the snapshot role.
  • Developer: Either the owner or maintainer of a project who is allowed to update the TUF metadata as well as distribution metadata and files for the project.
  • Online key: A private cryptographic key that MUST be stored on the PyPI server infrastructure. This is usually to allow automated signing with the key. However, an attacker who compromises the PyPI infrastructure will be able to read these keys.
  • Offline key: A private cryptographic key that MUST be stored independent of the PyPI server infrastructure. This prevents automated signing with the key. An attacker who compromises the PyPI infrastructure will not be able to immediately read these keys.
  • Threshold signature scheme: A role can increase its resilience to key compromises by specifying that at least t out of n keys are REQUIRED to sign its metadata. A compromise of t-1 keys is insufficient to compromise the role itself. Saying that a role requires (t, n) keys denotes the threshold signature property.

Overview of TUF

At its highest level, TUF provides applications with a secure method of obtaining files and knowing when new versions of files are available. On the surface, this all sounds simple. The basic steps for updating applications are:

  • Knowing when an update exists.
  • Downloading a correct copy of the latest version of an updated file.

The problem is that updating applications is only simple when there are no malicious activities in the picture. If an attacker is trying to interfere with these seemingly simple steps, there is plenty they can do.

Assume a software updater takes the approach of most systems (at least the ones that try to be secure). It downloads both the file it wants and a cryptographic signature of the file. The software updater already knows which key it trusts to make the signature. It checks that the signature is correct and was made by this trusted key. Unfortunately, the software updater is still at risk in many ways, including:

  • An attacker keeps giving the software updater the same update file, so it never realizes there is an update.
  • An attacker gives the software updater an older, insecure version of a file that it already has, so it downloads that one and blindly uses it thinking it is newer.
  • An attacker gives the software updater a newer version of a file it has but it is not the newest one. The file is newer to the software updater, but it may be insecure and exploitable by the attacker.
  • An attacker compromises the key used to sign these files and now the software updater downloads a malicious file that is properly signed.

TUF is designed to address these attacks, and others, by adding signed metadata (text files that describe the repository's files) to the repository and referencing the metadata files during the update procedure. Repository files are verified against the information included in the metadata before they are handed off to the software update system. The framework also provides multi-signature trust, explicit and implicit revocation of cryptograhic keys, responsibility separation of the metadata, and minimizes key risk. For a full list and outline of the repository attacks and software updater weaknesses addressed by TUF, see Appendix A.

Integrating TUF with PyPI

A software update system must complete two main tasks to integrate with TUF. First, it must add the framework to the client side of the update system. For example, TUF MAY be integrated with the pip package manager. Second, the repository on the server side MUST be modified to provide signed TUF metadata. This PEP is concerned with the second part of the integration, and the changes required on PyPI to support software updates with TUF.

What Additional Repository Files are Required on PyPI?

In order for package managers like pip to download and verify packages with TUF, a few extra files MUST exist on PyPI. These extra repository files are called TUF metadata. TUF metadata contains information such as which keys are trustable, the cryptographic hashes of files, signatures to the metadata, metadata version numbers, and the date after which the metadata should be considered expired.

When a package manager wants to check for updates, it asks TUF to do the work. That is, a package manager never has to deal with this additional metadata or understand what's going on underneath. If TUF reports back that there are updates available, a package manager can then ask TUF to download these files from PyPI. TUF downloads them and checks them against the TUF metadata that it also downloads from the repository. If the downloaded target files are trustworthy, TUF then hands them over to the package manager.

The Metadata [30] document provides information about each of the required metadata and their expected content. The next section covers the different kinds of metadata RECOMMENDED for PyPI.

PyPI and TUF Metadata

TUF metadata provides information that clients can use to make update decisions. For example, a targets metadata lists the available distributions on PyPI and includes the distribution's signatures, cryptographic hashes, and file sizes. Different metadata files provide different information. The various metadata files are signed by different roles, which are indicated by the root role. The concept of roles allows TUF to delegate responsibilities to multiple roles and minimizes the impact of a compromised role.

TUF requires four top-level roles. These are root, timestamp, snapshot, and targets. The root role specifies the public cryptographic keys of the top-level roles (including its own). The timestamp role references the latest snapshot and can signify when a new snapshot of the repository is available. The snapshot role indicates the latest version of all the TUF metadata files (other than timestamp). The targets role lists the available target files (in our case, it will be all files on PyPI under the /simple and /packages directories). Each top-level role will serve its responsibilities without exception. Figure 1 provides a table of the roles used in TUF.

pep-0458-1.png

Figure 1: An overview of the TUF roles.

Signing Metadata and Repository Management

The top-level root role signs for the keys of the top-level timestamp, snapshot, targets, and root roles. The timestamp role signs for every new snapshot of the repository metadata. The snapshot role signs for root, targets, and all delegated roles. The bins roles (delegated roles) sign for all distributions belonging to registered PyPI projects.

Figure 2 provides an overview of the roles available within PyPI, which includes the top-level roles and the roles delegated by targets. The figure also indicates the types of keys used to sign each role and which roles are trusted to sign for files available on PyPI. The next two sections cover the details of signing repository files and the types of keys used for each role.

pep-0458-2.png

Figure 2: An overview of the role metadata available on PyPI.

The roles that change most frequently are timestamp, snapshot and delegated roles (bins and its delegated roles). The timestamp and snapshot metadata MUST be updated whenever root, targets or delegated metadata are updated. Observe, though, that root and targets metadata are much less likely to be updated as often as delegated metadata. Therefore, timestamp and snapshot metadata will most likely be updated frequently (possibly every minute) due to delegated metadata being updated frequently in order to support continuous delivery of projects. Continuous delivery is a set of processes that PyPI uses produce snapshots that can safely coexist and be deleted independent of other snapshots [18].

Every year, PyPI administrators SHOULD sign for root and targets role keys. Automation will continuously sign for a timestamped, snapshot of all projects. A repository management [31] tool is available that can sign metadata files, generate cryptographic keys, and manage a TUF repository.

How to Establish Initial Trust in the PyPI Root Keys

Package managers like pip need to ship a file called "root.json" with the installation files that users initially download. This includes information about the keys trusted for certain roles, as well as the root keys themselves. Any new version of "root.json" that clients may download are verified against the root keys that client's initially trust. If a root key is compromised, but a threshold of keys are still secured, the PyPI administrator MUST push a new release that revokes trust in the compromised keys. If a threshold of root keys are compromised, then "root.json" should be updated out-of-band, however the threshold should be chosen so that this is extremely unlikely. The TUF client library does not require manual intervention if root keys are revoked or added: the update process handles the cases where "root.json" has changed.

To bundle the software, "root.json" MUST be included in the version of pip shipped with CPython (via ensurepip). The TUF client library then loads the root metadata and downloads the rest of the roles, including updating "root.json" if it has changed. An outline of the update process [32] is available.

Minimum Security Model

There are two security models to consider when integrating TUF with PyPI. The one proposed in this PEP is the minimum security model, which supports verification of PyPI distributions that are signed with private cryptographic keys stored on PyPI. Distributions uploaded by developers are signed by PyPI and immediately available for download. A possible future extension to this PEP, discussed in Appendix B, proposes the maximum security model and allows a developer to sign for his/her project. Developer keys are not stored online: therefore, projects are safe from PyPI compromises.

The minimum security model requires no action from a developer and protects against malicious CDNs [19] and public mirrors. To support continuous delivery of uploaded packages, PyPI signs for projects with an online key. This level of security prevents projects from being accidentally or deliberately tampered with by a mirror or a CDN because the mirror or CDN will not have any of the keys required to sign for projects. However, it does not protect projects from attackers who have compromised PyPI, since attackers can manipulate TUF metadata using the keys stored online.

This PEP proposes that the bins role (and its delegated roles) sign for all PyPI projects with an online key. The targets role, which only signs with an offline key, MUST delegate all PyPI projects to the bins role. This means that when a package manager such as pip (i.e., using TUF) downloads a distribution from a project on PyPI, it will consult the bins role about the TUF metadata for the project. If no bin roles delegated by bins specify the project's distribution, then the project is considered to be non-existent on PyPI.

Metadata Expiry Times

The root and targets role metadata SHOULD expire in one year, because these two metadata files are expected to change very rarely.

The timestamp, snapshot, and bins metadata SHOULD expire in one day because a CDN or mirror SHOULD synchronize itself with PyPI every day. Furthermore, this generous time frame also takes into account client clocks that are highly skewed or adrift.

Metadata Scalability

Due to the growing number of projects and distributions, TUF metadata will also grow correspondingly. For example, consider the bins role. In August 2013, it was found that the size of the bins metadata was about 42MB if the bins role itself signed for about 220K PyPI targets (which are simple indices and distributions). This PEP does not delve into the details, but TUF features a so-called "lazy bin walk [33]" scheme that splits a large targets' metadata file into many small ones. This allows a TUF client updater to intelligently download only a small number of TUF metadata files in order to update any project signed for by the bins role. For example, applying this scheme to the previous repository resulted in pip downloading between 1.3KB and 111KB to install or upgrade a PyPI project via TUF.

Based on our findings as of the time of writing, PyPI SHOULD split all targets in the bins role by delegating them to 1024 delegated roles, each of which would sign for PyPI targets whose hashes fall into that "bin" or delegated role (see Figure 2). It was found that 1024 bins would result in the bins metadata, and each of its delegated roles, being about the same size (40-50KB) for about 220K PyPI targets (simple indices and distributions).

It is possible to make TUF metadata more compact by representing it in a binary format as opposed to the JSON text format. Nevertheless, a sufficiently large number of projects and distributions will introduce scalability challenges at some point, and therefore the bins role will still need delegations (as outlined in figure 2) in order to address the problem. Furthermore, the JSON format is an open and well-known standard for data interchange. Due to the large number of delegated metadata, compressed versions of snapshot metadata SHOULD also be made available to clients.

PyPI and Key Requirements

In this section, the kinds of keys required to sign for TUF roles on PyPI are examined. TUF is agnostic with respect to choices of digital signature algorithms. For the purpose of discussion, it is assumed that most digital signatures will be produced with the well-tested and tried RSA algorithm [20]. Nevertheless, we do NOT recommend any particular digital signature algorithm in this PEP because there are a few important constraints: first, cryptography changes over time; second, package managers such as pip may wish to perform signature verification in Python, without resorting to a compiled C library, in order to be able to run on as many systems as Python supports; and third, TUF recommends diversity of keys for certain applications.

How Should Metadata be Generated?

Project developers expect the distributions they upload to PyPI to be immediately available for download. Unfortunately, there will be problems when many readers and writers simultaneously access the same metadata and distributions. That is, there needs to be a way to ensure consistency of metadata and repository files when multiple developers simulaneously change the same metadata or distributions. There are also issues with consistency on PyPI without TUF, but the problem is more severe with signed metadata that MUST keep track of the files available on PyPI in real-time.

Suppose that PyPI generates a snapshot, which indicates the latest version of every metadata except timestamp, at version 1 and a client requests this snapshot from PyPI. While the client is busy downloading this snapshot, PyPI then timestamps a new snapshot at, say, version 2. Without ensuring consistency of metadata, the client would find itself with a copy of snapshot that disagrees with what is available on PyPI, which is indistinguishable from arbitrary metadata injected by an attacker. The problem would also occur for mirrors attempting to sync with PyPI.

Consistent Snapshots

There are problems with consistency on PyPI with or without TUF. TUF requires that its metadata be consistent with the repository files, but how would the metadata be kept consistent with projects that change all the time? As a result, this proposal MUST address the problem of producing a consistent snapshot that captures the state of all known projects at a given time. Each snapshot should safely coexist with any other snapshot, and be able to be deleted independently, without affecting any other snapshot.

The solution presented in this PEP is that every metadata or data file managed by PyPI and written to disk MUST include in its filename the cryptographic hash [34] of the file. How would this help clients that use the TUF protocol to securely and consistently install or update a project from PyPI?

The first step in the TUF protocol requires the client to download the latest timestamp metadata. However, the client would not know in advance the hash of the timestamp associated with the latest snapshot. Therefore, PyPI MUST redirect all HTTP GET requests for timestamp to the timestamp referenced in the latest snapshot. The timestamp role is the root of a tree of cryptographic hashes that points to every other metadata that is meant to exist together (i.e., clients request metadata in timestamp -> snapshot -> root -> targets order). Clients are able to retrieve any file from this snapshot by deterministically including, in the request for the file, the hash of the file in the filename. Assuming infinite disk space and no hash collisions [35], a client may safely read from one snapshot while PyPI produces another snapshot.

In this simple but effective manner, PyPI is able to capture a consistent snapshot of all projects and the associated metadata at a given time. The next subsection provides implementation details of this idea.

Note: This PEP does not prohibit using advanced file systems or tools to produce consistent snapshots. There are two important reasons for why this PEP proposes the simple solution. First, the solution does not mandate that PyPI use any particular file system or tool. Second, the generic file-system based approach allows mirrors to use extant file transfer tools such as rsync to efficiently transfer consistent snapshots from PyPI.

Producing Consistent Snapshots

Given a project, PyPI is responsible for updating the bins metadata (roles delegated by the bins role and signed with an online key). Every project MUST upload its release in a single transaction. The uploaded set of files is called the "project transaction". How PyPI MAY validate the files in a project transaction is discussed in a later section. For now, the focus is on how PyPI will respond to a project transaction.

Every metadata and target file MUST include in its filename the hex digest [36] of its SHA-256 [37] hash. For this PEP, it is RECOMMENDED that PyPI adopt a simple convention of the form: digest.filename, where filename is the original filename without a copy of the hash, and digest is the hex digest of the hash.

When a project uploads a new transaction, the project transaction process MUST add all new targets and relevant delegated bins metadata. (It is shown later in this section why the bins role will delegate targets to a number of delegated bins roles.) Finally, the project transaction process MUST inform the snapshot process about new delegated bins metadata.

Project transaction processes SHOULD be automated and MUST also be applied atomically: either all metadata and targets -- or none of them -- are added. The project transaction and snapshot processes SHOULD work concurrently. Finally, project transaction processes SHOULD keep in memory the latest bins metadata so that they will be correctly updated in new consistent snapshots.

All project transactions MAY be placed in a single queue and processed serially. Alternatively, the queue MAY be processed concurrently in order of appearance, provided that the following rules are observed:

  1. No pair of project transaction processes must concurrently work on the same project.
  2. No pair of project transaction processes must concurrently work on bins projects that belong to the same delegated bins targets role.

These rules MUST be observed so that metadata is not read from or written to inconsistently.

Snapshot Process

The snapshot process is fairly simple and SHOULD be automated. The snapshot process MUST keep in memory the latest working set of root, targets, and delegated roles. Every minute or so, the snapshot process will sign for this latest working set. (Recall that project transaction processes continuously inform the snapshot process about the latest delegated metadata in a concurrency-safe manner. The snapshot process will actually sign for a copy of the latest working set while the latest working set in memory will be updated with information that is continuously communicated by the project transaction processes.) The snapshot process MUST generate and sign new timestamp metadata that will vouch for the metadata (root, targets, and delegated roles) generated in the previous step. Finally, the snapshot process MUST make available to clients the new timestamp and snapshot metadata representing the latest snapshot.

A few implementation notes are now in order. So far, we have seen only that new metadata and targets are added, but not that old metadata and targets are removed. Practical constraints are such that eventually PyPI will run out of disk space to produce a new consistent snapshot. In that case, PyPI MAY then use something like a "mark-and-sweep" algorithm to delete sufficiently old consistent snapshots: in order to preserve the latest consistent snapshot, PyPI would walk objects beginning from the root (timestamp) of the latest consistent snapshot, mark all visited objects, and delete all unmarked objects. The last few consistent snapshots may be preserved in a similar fashion. Deleting a consistent snapshot will cause clients to see nothing except HTTP 404 responses to any request for a file within that consistent snapshot. Clients SHOULD then retry (as before) their requests with the latest consistent snapshot.

All clients, such as pip using the TUF protocol, MUST be modified to download every metadata and target file (except for timestamp metadata) by including, in the request for the file, the cryptographic hash of the file in the filename. Following the filename convention recommended earlier, a request for the file at filename.ext will be transformed to the equivalent request for the file at digest.filename.

Finally, PyPI SHOULD use a transaction log [38] to record project transaction processes and queues so that it will be easier to recover from errors after a server failure.

Key Compromise Analysis

This PEP has covered the minimum security model, the TUF roles that should be added to support continuous delivery of distributions, and how to generate and sign the metadata of each role. The remaining sections discuss how PyPI SHOULD audit repository metadata, and the methods PyPI can use to detect and recover from a PyPI compromise.

Table 1 summarizes a few of the attacks possible when a threshold number of private cryptographic keys (belonging to any of the PyPI roles) are compromised. The leftmost column lists the roles (or a combination of roles) that have been compromised, and the columns to its right show whether the compromised roles leaves clients susceptible to malicious updates, a freeze attack, or metadata inconsistency attacks.

Role Compromise Malicious Updates Freeze Attack Metadata Inconsistency Attacks
timestamp NO snapshot and targets or any of the bins need to cooperate YES limited by earliest root, targets, or bin expiry time NO snapshot needs to cooperate
snapshot NO timestamp and targets or any of the bins need to cooperate NO timestamp needs to cooperate NO timestamp needs to cooperate
timestamp AND snapshot NO targets or any of the bins need to cooperate YES limited by earliest root, targets, or bin metadata expiry time YES limited by earliest root, targets, or bin metadata expiry time
targets OR bin NO timestamp and snapshot need to cooperate NOT APPLICABLE need timestamp and snapshot NOT APPLICABLE need timestamp and snapshot
timestamp AND snapshot AND bin YES YES limited by earliest root, targets, or bin metadata expiry time YES limited by earliest root, targets, or bin metadata expiry time
root YES YES YES

Table 1: Attacks possible by compromising certain combinations of role keys. In September 2013 [28], it was shown how the latest version (at the time) of pip was susceptible to these attacks and how TUF could protect users against them [14].

Note that compromising targets or any delegated role (except for project targets metadata) does not immediately allow an attacker to serve malicious updates. The attacker must also compromise the timestamp and snapshot roles (which are both online and therefore more likely to be compromised). This means that in order to launch any attack, one must not only be able to act as a man-in-the-middle but also compromise the timestamp key (or compromise the root keys and sign a new timestamp key). To launch any attack other than a freeze attack, one must also compromise the snapshot key.

Finally, a compromise of the PyPI infrastructure MAY introduce malicious updates to bins projects because the keys for these roles are online. The maximum security model discussed in the appendix addresses this issue. PEP 480 also covers the maximum security model and goes into more detail on generating developer keys and signing uploaded distributions.

In the Event of a Key Compromise

A key compromise means that a threshold of keys (belonging to the metadata roles on PyPI), as well as the PyPI infrastructure, have been compromised and used to sign new metadata on PyPI.

If a threshold number of timestamp, snapshot, or bins keys have been compromised, then PyPI MUST take the following steps:

  1. Revoke the timestamp, snapshot and targets role keys from the root role. This is done by replacing the compromised timestamp, snapshot and targets keys with newly issued keys.
  2. Revoke the bins keys from the targets role by replacing their keys with newly issued keys. Sign the new targets role metadata and discard the new keys (because, as explained earlier, this increases the security of targets metadata).
  3. All targets of the bins roles SHOULD be compared with the last known good consistent snapshot where none of the timestamp, snapshot, or bins keys were known to have been compromised. Added, updated or deleted targets in the compromised consistent snapshot that do not match the last known good consistent snapshot MAY be restored to their previous versions. After ensuring the integrity of all bins targets, the bins metadata MUST be regenerated.
  4. The bins metadata MUST have their version numbers incremented, expiry times suitably extended, and signatures renewed.
  5. A new timestamped consistent snapshot MUST be issued.

Following these steps would preemptively protect all of these roles even though only one of them may have been compromised.

If a threshold number of root keys have been compromised, then PyPI MUST take the steps taken when the targets role has been compromised. All of the root keys must also be replaced.

It is also RECOMMENDED that PyPI sufficiently document compromises with security bulletins. These security bulletins will be most informative when users of pip-with-TUF are unable to install or update a project because the keys for the timestamp, snapshot or root roles are no longer valid. They could then visit the PyPI web site to consult security bulletins that would help to explain why they are no longer able to install or update, and then take action accordingly. When a threshold number of root keys have not been revoked due to a compromise, then new root metadata may be safely updated because a threshold number of existing root keys will be used to sign for the integrity of the new root metadata. TUF clients will be able to verify the integrity of the new root metadata with a threshold number of previously known root keys. This will be the common case. Otherwise, in the worst case, where a threshold number of root keys have been revoked due to a compromise, an end-user may choose to update new root metadata with out-of-band [39] mechanisms.

Auditing Snapshots

If a malicious party compromises PyPI, they can sign arbitrary files with any of the online keys. The roles with offline keys (i.e., root and targets) are still protected. To safely recover from a repository compromise, snapshots should be audited to ensure files are only restored to trusted versions.

When a repository compromise has been detected, the integrity of three types of information must be validated:

  1. If the online keys of the repository have been compromised, they can be revoked by having the targets role sign new metadata delegating to a new key.
  2. If the role metadata on the repository has been changed, this would impact the metadata that is signed by online keys. Any role information created since the last period should be discarded. As a result, developers of new projects will need to re-register their projects.
  3. If the packages themselves may have been tampered with, they can be validated using the stored hash information for packages that existed at the time of the last period.

In order to safely restore snapshots in the event of a compromise, PyPI SHOULD maintain a small number of its own mirrors to copy PyPI snapshots according to some schedule. The mirroring protocol can be used immediately for this purpose. The mirrors must be secured and isolated such that they are responsible only for mirroring PyPI. The mirrors can be checked against one another to detect accidental or malicious failures.

Another approach is to generate the cryptographic hash of snapshot periodically and tweet it. Perhaps a user comes forward with the actual metadata and the repository maintainers can verify the metadata's cryptographic hash. Alternatively, PyPI may periodically archive its own versions of snapshot rather than rely on externally provided metadata. In this case, PyPI SHOULD take the cryptographic hash of every package on the repository and store this data on an offline device. If any package hash has changed, this indicates an attack.

As for attacks that serve different versions of metadata, or freeze a version of a package at a specific version, they can be handled by TUF with techniques like implicit key revocation and metadata mismatch detection [81].

Appendix A: Repository Attacks Prevented by TUF

  • Arbitrary software installation: An attacker installs anything they want on the client system. That is, an attacker can provide arbitrary files in respond to download requests and the files will not be detected as illegitimate.
  • Rollback attacks: An attacker presents a software update system with older files than those the client has already seen, causing the client to use files older than those the client knows about.
  • Indefinite freeze attacks: An attacker continues to present a software update system with the same files the client has already seen. The result is that the client does not know that new files are available.
  • Endless data attacks: An attacker responds to a file download request with an endless stream of data, causing harm to clients (e.g., a disk partition filling up or memory exhaustion).
  • Slow retrieval attacks: An attacker responds to clients with a very slow stream of data that essentially results in the client never continuing the update process.
  • Extraneous dependencies attacks: An attacker indicates to clients that in order to install the software they wanted, they also need to install unrelated software. This unrelated software can be from a trusted source but may have known vulnerabilities that are exploitable by the attacker.
  • Mix-and-match attacks: An attacker presents clients with a view of a repository that includes files that never existed together on the repository at the same time. This can result in, for example, outdated versions of dependencies being installed.
  • Wrong software installation: An attacker provides a client with a trusted file that is not the one the client wanted.
  • Malicious mirrors preventing updates: An attacker in control of one repository mirror is able to prevent users from obtaining updates from other, good mirrors.
  • Vulnerability to key compromises: An attacker who is able to compromise a single key or less than a given threshold of keys can compromise clients. This includes relying on a single online key (such as only being protected by SSL) or a single offline key (such as most software update systems use to sign files).

Appendix B: Extension to the Minimum Security Model

The maximum security model and end-to-end signing have been intentionally excluded from this PEP. Although both improve PyPI's ability to survive a repository compromise and allow developers to sign their distributions, they have been postponed for review as a potential future extension to PEP 458. PEP 480 [26], which discusses the extension in detail, is available for review to those developers interested in the end-to-end signing option. The maximum security model and end-to-end signing are briefly covered in subsections that follow.

There are several reasons for not initially supporting the features discussed in this section:

  1. A build farm (distribution wheels on supported platforms are generated for each project on PyPI infrastructure) may possibly complicate matters. PyPI wants to support a build farm in the future. Unfortunately, if wheels are auto-generated externally, developer signatures for these wheels are unlikely. However, there might still be a benefit to generating wheels from source distributions that are signed by developers (provided that reproducible wheels are possible). Another possibility is to optionally delegate trust of these wheels to an online role.

  2. An easy-to-use key management solution is needed for developers. miniLock [40] is one likely candidate for management and generation of keys. Although developer signatures can remain optional, this approach may be inadequate due to the great number of potentially unsigned dependencies each distribution may have. If any one of these dependencies is unsigned, it negates any benefit the project gains from signing its own distribution (i.e., attackers would only need to compromise one of the unsigned dependencies to attack end-users). Requiring developers to manually sign distributions and manage keys is expected to render key signing an unused feature.

  3. A two-phase approach, where the minimum security model is implemented first followed by the maximum security model, can simplify matters and give PyPI administrators time to review the feasibility of end-to-end signing.

Maximum Security Model

The maximum security model relies on developers signing their projects and uploading signed metadata to PyPI. If the PyPI infrastructure were to be compromised, attackers would be unable to serve malicious versions of claimed projects without access to the project's developer key. Figure 3 depicts the changes made to figure 2, namely that developer roles are now supported and that three new delegated roles exist: claimed, recently-claimed, and unclaimed. The bins role has been renamed unclaimed and can contain any projects that have not been added to claimed. The strength of this model (over the minimum security model) is in the offline keys provided by developers. Although the minimum security model supports continuous delivery, all of the projects are signed by an online key. An attacker can corrupt packages in the minimum security model, but not in the maximum model without also compromising a developer's key.

pep-0458-3.png

Figure 3: An overview of the metadata layout in the maximum security model. The maximum security model supports continuous delivery and survivable key compromise.

End-to-End Signing

End-to-End signing allows both PyPI and developers to sign for the metadata downloaded by clients. PyPI is trusted to make uploaded projects available to clients (they sign the metadata for this part of the process), and developers can sign the distributions that they upload.

PEP 480 [26] discusses the tools available to developers who sign the distributions that they upload to PyPI. To summarize PEP 480, developers generate cryptographic keys and sign metadata in some automated fashion, where the metadata includes the information required to verify the authenticity of the distribution. The metadata is then uploaded to PyPI by the client, where it will be available for download by package managers such as pip (i.e., package managers that support TUF metadata). The entire process is transparent to clients (using a package manager that supports TUF) who download distributions from PyPI.

Appendix C: PEP 470 and Projects Hosted Externally

How should TUF handle distributions that are not hosted on PyPI? According to PEP 470 [41], projects may opt to host their distributions externally and are only required to provide PyPI a link to its external index, which package managers like pip can use to find the project's distributions. PEP 470 does not mention whether externally hosted projects are considered unverified by default, as projects that use this option are not required to submit any information about their distributions (e.g., file size and cryptographic hash) when the project is registered, nor include a cryptographic hash of the file in download links.

Potentional approaches that PyPI administrators MAY consider to handle projects hosted externally:

  1. Download external distributions but do not verify them. The targets metadata will not include information for externally hosted projects.
  2. PyPI will periodically download information from the external index. PyPI will gather the external distribution's file size and hashes and generate appropriate TUF metadata.
  3. External projects MUST submit to PyPI the file size and cryptographic hash for a distribution.
  4. External projects MUST upload to PyPI a developer public key for the index. The distribution MUST create TUF metadata that is stored at the index, and signed with the developer's corresponding private key. The client will fetch the external TUF metadata as part of the package update process.
  5. External projects MUST upload to PyPI signed TUF metadata (as allowed by the maximum security model) about the distributions that they host externally, and a developer public key. Package managers verify distributions by consulting the signed metadata uploaded to PyPI.

Only one of the options listed above should be implemented on PyPI. Option (4) or (5) is RECOMMENDED because external distributions are signed by developers. External distributions that are forged (due to a compromised PyPI account or external host) may be detected if external developers are required to sign metadata, although this requirement is likely only practical if an easy-to-use key management solution and developer scripts are provided by PyPI.

References

[1]https://pypi.python.org
[2](1, 2, 3) https://isis.poly.edu/~jcappos/papers/samuel_tuf_ccs_2010.pdf
[3]http://www.pip-installer.org
[4]https://wiki.python.org/moin/WikiAttack2013
[5]https://github.com/theupdateframework/pip/wiki/Attacks-on-software-repositories
[6]https://mail.python.org/pipermail/distutils-sig/2013-April/020596.html
[7]https://mail.python.org/pipermail/distutils-sig/2013-May/020701.html
[8]https://mail.python.org/pipermail/distutils-sig/2013-July/022008.html
[9]PEP 381, Mirroring infrastructure for PyPI, ZiadĂŠ, LĂświs http://www.python.org/dev/peps/pep-0381/
[10]https://mail.python.org/pipermail/distutils-sig/2013-September/022773.html
[11]https://mail.python.org/pipermail/distutils-sig/2013-May/020848.html
[12]PEP 449, Removal of the PyPI Mirror Auto Discovery and Naming Scheme, Stufft http://www.python.org/dev/peps/pep-0449/
[13]https://isis.poly.edu/~jcappos/papers/cappos_mirror_ccs_08.pdf
[14](1, 2) https://mail.python.org/pipermail/distutils-sig/2013-September/022755.html
[15]https://pypi.python.org/security
[16](1, 2) https://github.com/theupdateframework/tuf/blob/develop/docs/tuf-spec.txt
[17](1, 2, 3, 4, 5) PEP 426, Metadata for Python Software Packages 2.0, Coghlan, Holth, Stufft http://www.python.org/dev/peps/pep-0426/
[18]https://en.wikipedia.org/wiki/Continuous_delivery
[19]https://mail.python.org/pipermail/distutils-sig/2013-August/022154.html
[20]https://en.wikipedia.org/wiki/RSA_%28algorithm%29
[21]https://en.wikipedia.org/wiki/Key-recovery_attack
[22]http://csrc.nist.gov/publications/nistpubs/800-57/SP800-57-Part1.pdf
[23]https://www.openssl.org/
[24]https://pypi.python.org/pypi/pycrypto
[25]http://ed25519.cr.yp.to/
[26](1, 2, 3) https://www.python.org/dev/peps/pep-0480/
[27]https://github.com/theupdateframework/tuf/tree/develop/tuf/client#updaterpy
[28](1, 2) https://mail.python.org/pipermail/distutils-sig/2013-September/022755.html
[29]http://www.ietf.org/rfc/rfc2119.txt
[30]https://github.com/theupdateframework/tuf/blob/develop/METADATA.md
[31]https://github.com/theupdateframework/tuf/tree/develop/tuf#repository-management
[32]https://github.com/theupdateframework/tuf/tree/develop/tuf/client#overview-of-the-update-process.
[33]https://github.com/theupdateframework/tuf/issues/39
[34]https://en.wikipedia.org/wiki/Cryptographic_hash_function
[35]https://en.wikipedia.org/wiki/Collision_(computer_science)
[36]http://docs.python.org/2/library/hashlib.html#hashlib.hash.hexdigest
[37]https://en.wikipedia.org/wiki/SHA-2
[38]https://en.wikipedia.org/wiki/Transaction_log
[39]https://en.wikipedia.org/wiki/Out-of-band#Authentication
[40]https://minilock.io/
[41]http://www.python.org/dev/peps/pep-0470/

Acknowledgements

This material is based upon work supported by the National Science Foundation under Grants No. CNS-1345049 and CNS-0959138. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

We thank Nick Coghlan, Daniel Holth and the distutils-sig community in general for helping us to think about how to usably and efficiently integrate TUF with PyPI.

Roger Dingledine, Sebastian Hahn, Nick Mathewson, Martin Peck and Justin Samuel helped us to design TUF from its predecessor Thandy of the Tor project.

We appreciate the efforts of Konstantin Andrianov, Geremy Condra, Zane Fisher, Justin Samuel, Tian Tian, Santiago Torres, John Ward, and Yuyu Zheng to to develop TUF.

Vladimir Diaz, Monzur Muhammad and Sai Teja Peddinti helped us to review this PEP.

Zane Fisher helped us to review and transcribe this PEP.

pep-0459 Standard Metadata Extensions for Python Software Packages

PEP:459
Title:Standard Metadata Extensions for Python Software Packages
Version:$Revision$
Last-Modified:$Date$
Author:Nick Coghlan <ncoghlan at gmail.com>
BDFL-Delegate:Nick Coghlan <ncoghlan@gmail.com>
Discussions-To:Distutils SIG <distutils-sig at python.org>
Status:Draft
Type:Standards Track
Content-Type:text/x-rst
Requires:426
Created:11 Nov 2013
Post-History:21 Dec 2013

Abstract

This PEP describes several standard extensions to the Python metadata.

Like all metadata extensions, each standard extension format is independently versioned. Changing any of the formats requires an update to this PEP, but does not require an update to the core packaging metadata.

Note

These extensions may eventually be separated out into their own PEPs, but we're already suffering from PEP overload in the packaging metadata space.

This PEP was initially created by slicing out large sections of earlier drafts of PEP 426 and making them extensions, so some of the specifics may still be rough in the new context.

Standard Extension Namespace

The python project on the Python Package Index refers to the CPython reference interpreter. This namespace is used as the namespace for the standard metadata extensions.

The currently defined standard extensions are:

  • python.details
  • python.project
  • python.integrator
  • python.exports
  • python.commands
  • python.constraints

All standard extensions are currently at version 1.0, and thus the extension_metadata field may be omitted without losing access to any functionality.

The python.details extension

The python.details extension allows for more information to be provided regarding the software distribution.

The python.details extension contains four custom subfields:

  • license: the copyright license for the distribution
  • keywords: package index keywords for the distribution
  • classifiers: package index Trove classifiers for the distribution
  • document_names: the names of additional metadata files

All of these fields are optional. Automated tools MUST operate correctly if a distribution does not provide them, including failing cleanly when an operation depending on one of these fields is requested.

License

A short string summarising the license used for this distribution.

Note that distributions that provide this field should still specify any applicable license Trove classifiers in the Classifiers field. Even when an appropriate Trove classifier is available, the license summary can be a good way to specify a particular version of that license, or to indicate any variations or exception to the license.

This field SHOULD contain fewer than 512 characters and MUST contain fewer than 2048.

This field SHOULD NOT contain any line breaks.

The full license text SHOULD be included as a separate file in the source archive for the distribution. See Document names for details.

Example:

"license": "GPL version 3, excluding DRM provisions"

Keywords

A list of additional keywords to be used to assist searching for the distribution in a larger catalog.

Example:

"keywords": ["comfy", "chair", "cushions", "too silly", "monty python"]

Classifiers

A list of strings, with each giving a single classification value for the distribution. Classifiers are described in PEP 301 [2].

Example:

"classifiers": [
  "Development Status :: 4 - Beta",
  "Environment :: Console (Text Based)",
  "License :: OSI Approved :: GNU General Public License v3 (GPLv3)"
]

Document names

Filenames for supporting documents included in the distribution's dist-info metadata directory.

The following supporting documents can be named:

  • description: a file containing a long description of the distribution
  • license: a file with the full text of the distribution's license
  • changelog: a file describing changes made to the distribution

Supporting documents MUST be included directly in the dist-info directory. Directory separators are NOT permitted in document names.

The markup format (if any) for the file is indicated by the file extension. This allows index servers and other automated tools to render included text documents correctly and provide feedback on rendering errors, rather than having to guess the intended format.

If the filename has no extension, or the extension is not recognised, the default rendering format MUST be plain text.

The following markup renderers SHOULD be used for the specified file extensions:

  • Plain text: .txt, no extension, unknown extension
  • reStructured Text: .rst
  • Markdown: .md
  • AsciiDoc: .adoc, .asc, .asciidoc
  • HTML: .html, .htm

Automated tools MAY render one or more of the specified formats as plain text and MAY render other markup formats beyond those listed.

Automated tools SHOULD NOT make any assumptions regarding the maximum length of supporting document content, except as necessary to protect the integrity of a service.

Example:

"document_names": {
    "description": "README.rst",
    "license": "LICENSE.rst",
    "changelog": "NEWS"
}

The python.project extension

The python.project extension allows for more information to be provided regarding the creation and maintenance of the distribution.

The python.project extension contains three custom subfields:

  • contacts: key contact points for the distribution
  • contributors: other contributors to the distribution
  • project_urls: relevant URLs for the distribution

Contact information

Details on individuals and organisations are recorded as mappings with the following subfields:

  • name: the name of an individual or group
  • email: an email address (this may be a mailing list)
  • url: a URL (such as a profile page on a source code hosting service)
  • role: one of "author", "maintainer" or "contributor"

The name subfield is required, the other subfields are optional.

If no specific role is stated, the default is contributor.

Email addresses must be in the form local-part@domain where the local-part may be up to 64 characters long and the entire email address contains no more than 254 characters. The formal specification of the format is in RFC 5322 (sections 3.2.3 and 3.4.1) and RFC 5321, with a more readable form given in the informational RFC 3696 and the associated errata.

The defined contributor roles are as follows:

  • author: the original creator of a distribution
  • maintainer: the current lead contributor for a distribution, when they are not the original creator
  • contributor: any other individuals or organizations involved in the creation of the distribution

Contact and contributor metadata is optional. Automated tools MUST operate correctly if a distribution does not provide it, including failing cleanly when an operation depending on one of these fields is requested.

Contacts

A list of contributor entries giving the recommended contact points for getting more information about the project.

The example below would be suitable for a project that was in the process of handing over from the original author to a new lead maintainer, while operating as part of a larger development group.

Example:

"contacts": [
  {
    "name": "Python Packaging Authority/Distutils-SIG",
    "email": "distutils-sig@python.org",
    "url": "https://bitbucket.org/pypa/"
  },
  {
    "name": "Samantha C.",
    "role": "maintainer",
    "email": "dontblameme@example.org"
  },
  {
    "name": "Charlotte C.",
    "role": "author",
    "email": "iambecomingasketchcomedian@example.com"
  }
]

Contributors

A list of contributor entries for other contributors not already listed as current project points of contact. The subfields within the list elements are the same as those for the main contact field.

Example:

"contributors": [
  {"name": "John C."},
  {"name": "Erik I."},
  {"name": "Terry G."},
  {"name": "Mike P."},
  {"name": "Graeme C."},
  {"name": "Terry J."}
]

Project URLs

A mapping of arbitrary text labels to additional URLs relevant to the project.

While projects are free to choose their own labels and specific URLs, it is RECOMMENDED that home page, source control, issue tracker and documentation links be provided using the labels in the example below.

URL labels MUST be treated as case insensitive by automated tools, but they are not required to be valid Python identifiers. Any legal JSON string is permitted as a URL label.

Example:

"project_urls": {
  "Documentation": "https://distlib.readthedocs.org",
  "Home": "https://bitbucket.org/pypa/distlib",
  "Repository": "https://bitbucket.org/pypa/distlib/src",
  "Tracker": "https://bitbucket.org/pypa/distlib/issues"
}

The python.integrator extension

Structurally, this extension is largely identical to the python.project extension (the extension name is the only difference).

However, where the project metadata refers to the upstream creators of the software, the integrator metadata refers to the downstream redistributor of a modified version.

If the software is being redistributed unmodified, then typically this extension will not be used. However, if the software has been patched (for example, backporting compatible fixes from a later version, or addressing a platform compatibility issue), then this extension SHOULD be used, and a local version label added to the distribution's version identifier.

If there are multiple redistributors in the chain, each one just overwrites this extension with their particular metadata.

The python.exports extension

Most Python distributions expose packages and modules for import through the Python module namespace. Distributions may also expose other interfaces when installed.

The python.exports extension contains three custom subfields:

  • modules: modules exported by the distribution
  • namespaces: namespace packages that the distribution contributes to
  • exports: other Python interfaces exported by the distribution

Export specifiers

An export specifier is a string consisting of a fully qualified name, as well as an optional extra name enclosed in square brackets. This gives the following four possible forms for an export specifier:

module
module:name
module[requires_extra]
module:name[requires_extra]

Note

The jsonschema file currently restricts qualified names using the Python 2 ASCII identifier rules. This may need to be reconsidered given the more relaxed identifier rules in Python 3.

The meaning of the subfields is as follows:

  • module: the module providing the export
  • name: if applicable, the qualified name of the export within the module
  • requires_extra: indicates the export will only work correctly if the additional dependencies named in the given extra are available in the installed environment

Note

I tried this as a mapping with subfields, and it made the examples below unreadable. While this PEP is mostly for tool use, readability still matters to some degree for debugging purposes, and because I expect snippets of the format to be reused elsewhere.

Modules

A list of qualified names of modules and packages that the distribution provides for import.

Note

The jsonschema file currently restricts qualified names using the Python 2 ASCII identifier rules. This may need to be reconsidered given the more relaxed identifier rules in Python 3.

For names that contain dots, the portion of the name before the final dot MUST appear either in the installed module list or in the namespace package list.

To help avoid name conflicts, it is RECOMMENDED that distributions provide a single top level module or package that matches the distribution name (or a lower case equivalent). This requires that the distribution name also meet the requirements of a Python identifier, which are stricter than those for distribution names). This practice will also make it easier to find authoritative sources for modules.

Index servers SHOULD allow multiple distributions to publish the same modules, but MAY notify distribution authors of potential conflicts.

Installation tools SHOULD report an error when asked to install a distribution that provides a module that is also provided by a different, previously installed, distribution.

Note that attempting to import some declared modules may result in an exception if the appropriate extras are not installed.

Example:

"modules": ["chair", "chair.cushions", "python_sketches.nobody_expects"]

Note

Making this a list of export specifiers instead would allow a distribution to declare when a particular module requires a particular extra in order to run correctly. On the other hand, there's an argument to be made that that is the point where it starts to become worthwhile to split out a separate distribution rather than using extras.

Namespaces

A list of qualified names of namespace packages that the distribution contributes modules to.

Note

The jsonschema file currently restricts qualified names using the Python 2 ASCII identifier rules. This may need to be reconsidered given the more relaxed identifier rules in Python 3.

On versions of Python prior to Python 3.3 (which provides native namespace package support), installation tools SHOULD emit a suitable __init__.py file to properly initialise the namespace rather than using a distribution provided file.

Installation tools SHOULD emit a warning and MAY emit an error if a distribution declares a namespace package that conflicts with the name of an already installed module or vice-versa.

Example:

"namespaces": ["python_sketches"]

Exports

The exports field is a mapping containing prefixed names as keys. Each key identifies an export group containing one or more exports published by the distribution.

Export group names are defined by distributions that will then make use of the published export information in some way. The primary use case is for distributions that support a plugin model: defining an export group allows other distributions to indicate which plugins they provide, how they can be imported and accessed, and which additional dependencies (if any) are needed for the plugin to work correctly.

To reduce the chance of name conflicts, export group names SHOULD use a prefix that corresponds to a module name in the distribution that defines the meaning of the export group. This practice will also make it easier to find authoritative documentation for export groups.

Each individual export group is then a mapping of arbitrary non-empty string keys to export specifiers. The meaning of export names within an export group is up to the distribution that defines the export group. Creating an appropriate definition for the export name format can allow the importing distribution to determine whether or not an export is relevant without needing to import every exporting module.

Example:

"exports": {
  "nose.plugins.0.10": {
    "chairtest": "chair:NosePlugin"
  }
}

The python.commands extension

The python.commands extension contains three custom subfields:

  • wrap_console: console wrapper scripts to be generated by the installer
  • wrap_gui: GUI wrapper scripts to be generated by the installer
  • prebuilt: scripts created by the distribution's build process and installed directly to the configured scripts directory

wrap_console and wrap_gui are both mappings of script names to export specifiers. The script names must follow the same naming rules as distribution names.

The export specifiers for wrapper scripts must refer to either a package with a __main__ submodule (if no name subfield is given in the export specifier) or else to a callable inside the named module.

Installation tools should generate appropriate wrappers as part of the installation process.

Note

Still needs more detail on what "appropriate wrappers" means. For now, refer to what setuptools and zc.buildout generate as wrapper scripts.

prebuilt is a list of script paths, relative to the scripts directory in a wheel file or following installation. They are provided for informational purpose only - installing them is handled through the normal processes for files created when building a distribution.

Build tools SHOULD mark this extension as requiring handling by installers.

Index servers SHOULD allow multiple distributions to publish the same commands, but MAY notify distribution authors of potential conflicts.

Installation tools SHOULD report an error when asked to install a distribution that provides a command that is also provided by a different, previously installed, distribution.

Example:

"python.commands": {
  "installer_must_handle": true,
  "wrap_console": [{"chair": "chair:run_cli"}],
  "wrap_gui": [{"chair-gui": "chair:run_gui"}],
  "prebuilt": ["reduniforms"]
}

The python.constraints extension

The python.constraints extension contains two custom subfields:

  • environments: supported installation environments
  • extension_metadata: required exact matches in extension metadata fields published by other installed components

Build tools SHOULD mark this extension as requiring handling by installers.

Index servers SHOULD allow distributions to be uploaded with constraints that cannot be satisfied using that index, but MAY notify distribution authors of any such potential compatibility issues.

Installation tools SHOULD report an error if constraints are specified by the distribution and the target installation environment fails to satisfy them, MUST at least emit a warning, and MAY allow the user to force the installation to proceed regardless.

Example:

"python.constraints": {
  "installer_must_handle": true,
  "environments": ["python_version >= 2.6"],
  "extension_metadata": {
    "fortranlib": {
      "fortranlib.compatibility": {
        "fortran_abi": "openblas-g77"
      }
    }
  }
}

Supported Environments

The environments subfield is a list of strings specifying the environments that the distribution explicitly supports. An environment is considered supported if it matches at least one of the environment markers given.

If this field is not given in the metadata, it is assumed that the distribution supports any platform supported by Python.

Individual entries are environment markers, as described in PEP 426.

The two main uses of this field are to declare which versions of Python and which underlying operating systems are supported.

Examples indicating supported Python versions:

# Supports Python 2.6+
"environments": ["python_version >= '2.6'"]

# Supports Python 2.6+ (for 2.x) or 3.3+ (for 3.x)
"environments": ["python_version >= '3.3'",
                 "'3.0' > python_version >= '2.6'"]

Examples indicating supported operating systems:

# Windows only
"environments": ["sys_platform == 'win32'"]

# Anything except Windows
"environments": ["sys_platform != 'win32'"]

# Linux or BSD only
"environments": ["'linux' in sys_platform",
                 "'bsd' in sys_platform"]

Example where the supported Python version varies by platform:

# The standard library's os module has long supported atomic renaming
# on POSIX systems, but only gained atomic renaming on Windows in Python
# 3.3. A distribution that needs atomic renaming support for reliable
# operation might declare the following supported environments.
"environment": ["python_version >= '2.6' and sys_platform != 'win32'",
                "python_version >= '3.3' and sys_platform == 'win32'"]

Extension metadata constraints

The extension_metadata subfield is a mapping from distribution names to extension metadata snippets that are expected to exactly match the metadata of the named distribution in the target installation environment.

Each submapping then consists of a mapping from metadata extension names to the exact expected values of a subset of fields.

For example, a distribution called fortranlib may publish a different FORTRAN ABI depending on how it is built, and any related projects that are installed into the same runtime environment should use matching build options. This can be handled by having the base distribution publish a custom extension that indicates the build option that was used to create the binary extensions:

"extensions": {
  "fortranlib.compatibility": {
    "fortran_abi": "openblas-g77"
  }
}

Other distributions that contain binary extensions that need to be compatible with the base distribution would then define a suitable constraint in their own metadata:

"python.constraints": {
  "installer_must_handle": true,
  "extension_metadata": {
    "fortranlib": {
      "fortranlib.compatibility": {
        "fortran_abi": "openblas-g77"
      }
    }
  }
}

This constraint specifies that:

  • fortranlib must be installed (this should also be expressed as a normal dependency so that installers ensure it is satisfied)
  • The installed version of fortranlib must include the custom fortranlib.compatibility extension in its published metadata
  • The fortan_abi subfield of that extension must have the exact value openblas-g77.

If all of these conditions are met (the distribution is installed, the specified extension is included in the metadata, the specified subfields have the exact specified value), then the constraint is considered to be satisfied.

Note

The primary intended use case here is allowing C extensions with additional ABI compatibility requirements to declare those in a way that any installation tool can enforce without needing to understand the details. In particular, many NumPy based scientific libraries need to be built using a consistent set of FORTRAN libraries, hence the "fortranlib" example.

This is the reason there's no support for pattern matching or boolean logic: even the "simple" version of this extension is relatively complex, and there's currently no compelling rationale for making it more complicated than it already is.

pep-0460 Add binary interpolation and formatting

PEP:460
Title:Add binary interpolation and formatting
Version:$Revision$
Last-Modified:$Date$
Author:Antoine Pitrou <solipsis at pitrou.net>
Status:Withdrawn
Type:Standards Track
Content-Type:text/x-rst
Created:6-Jan-2014
Python-Version:3.5

Abstract

This PEP proposes to add minimal formatting operations to bytes and bytearray objects. The proposed additions are:

  • bytes % ... and bytearray % ... for percent-formatting, similar in syntax to percent-formatting on str objects (accepting a single object, a tuple or a dict).
  • bytes.format(...) and bytearray.format(...) for a formatting similar in syntax to str.format() (accepting positional as well as keyword arguments).
  • bytes.format_map(...) and bytearray.format_map(...) for an API similar to str.format_map(...), with the same formatting syntax and semantics as bytes.format() and bytearray.format().

Rationale

In Python 2, str % args and str.format(args) allow the formatting and interpolation of bytestrings. This feature has commonly been used for the assembling of protocol messages when protocols are known to use a fixed encoding.

Python 3 generally mandates that text be stored and manipulated as unicode (i.e. str objects, not bytes). In some cases, though, it makes sense to manipulate bytes objects directly. Typical usage is binary network protocols, where you can want to interpolate and assemble several bytes object (some of them literals, some of them compute) to produce complete protocol messages. For example, protocols such as HTTP or SIP have headers with ASCII names and opaque "textual" values using a varying and/or sometimes ill-defined encoding. Moreover, those headers can be followed by a binary body... which can be chunked and decorated with ASCII headers and trailers!

While there are reasonably efficient ways to accumulate binary data (such as using a bytearray object, the bytes.join method or even io.BytesIO), none of them leads to the kind of readable and intuitive code that is produced by a %-formatted or {}-formatted template and a formatting operation.

Binary formatting features

Supported features

In this proposal, percent-formatting for bytes and bytearray supports the following features:

  • Looking up formatting arguments by position as well as by name (i.e., %s as well as %(name)s).
  • %s will try to get a Py_buffer on the given value, and fallback on calling __bytes__. The resulting binary data is inserted at the given point in the string. This is expected to work with bytes, bytearray and memoryview objects (as well as a couple others such as pathlib's path objects).
  • %c will accept an integer between 0 and 255, and insert a byte of the given value.

Braces-formatting for bytes and bytearray supports the following features:

  • All the kinds of argument lookup supported by str.format() (explicit positional lookup, auto-incremented positional lookup, keyword lookup, attribute lookup, etc.)
  • Insertion of binary data when no modifier or layout is specified (e.g. {}, {0}, {name}). This has the same semantics as %s for percent-formatting (see above).
  • The c modifier will accept an integer between 0 and 255, and insert a byte of the given value (same as %c above).

Unsupported features

All other features present in formatting of str objects (either through the percent operator or the str.format() method) are unsupported. Those features imply treating the recipient of the operator or method as text, which goes counter to the text / bytes separation (for example, accepting %d as a format code would imply that the bytes object really is a ASCII-compatible text string).

Amongst those unsupported features are not only most type-specific format codes, but also the various layout specifiers such as padding or alignment. Besides, str objects are not acceptable as arguments to the formatting operations, even when using e.g. the %s format code.

__format__ isn't called.

Criticisms

  • The development cost and maintenance cost.
  • In 3.3 encoding to ASCII or latin-1 is as fast as memcpy (but it still creates a separate object).
  • Developers will have to work around the lack of binary formatting anyway, if they want to to support Python 3.4 and earlier.
  • bytes.join() is consistently faster than format to join bytes strings (XXX is it?).
  • Formatting functions could be implemented in a third party module, rather than added to builtin types.

Other proposals

A new type datatype

It was proposed to create a new datatype specialized for "network programming". The authors of this PEP believe this is counter-productive. Python 3 already has several major types dedicated to manipulation of binary data: bytes, bytearray, memoryview, io.BytesIO.

Adding yet another type would make things more confusing for users, and interoperability between libraries more painful (also potentially sub-optimal, due to the necessary conversions).

Moreover, not one type would be needed, but two: one immutable type (to allow for hashing), and one mutable type (as efficient accumulation is often necessary when working with network messages).

Resolution

This PEP is made obsolete by the acceptance of PEP 461, which introduces a more extended formatting language for bytes objects in conjunction with the modulo operator.

pep-0461 Adding % formatting to bytes and bytearray

PEP:461
Title:Adding % formatting to bytes and bytearray
Version:$Revision$
Last-Modified:$Date$
Author:Ethan Furman <ethan at stoneleaf.us>
Status:Final
Type:Standards Track
Content-Type:text/x-rst
Created:2014-01-13
Python-Version:3.5
Post-History:2014-01-14, 2014-01-15, 2014-01-17, 2014-02-22, 2014-03-25, 2014-03-27
Resolution:http://mail.python.org/pipermail/python-dev/2014-March/133621.html

Abstract

This PEP proposes adding % formatting operations similar to Python 2's str type to bytes and bytearray [1] [2].

Rationale

While interpolation is usually thought of as a string operation, there are cases where interpolation on bytes or bytearrays make sense, and the work needed to make up for this missing functionality detracts from the overall readability of the code.

Motivation

With Python 3 and the split between str and bytes, one small but important area of programming became slightly more difficult, and much more painful -- wire format protocols [3].

This area of programming is characterized by a mixture of binary data and ASCII compatible segments of text (aka ASCII-encoded text). Bringing back a restricted %-interpolation for bytes and bytearray will aid both in writing new wire format code, and in porting Python 2 wire format code.

Common use-cases include dbf and pdf file formats, email formats, and FTP and HTTP communications, among many others.

Proposed semantics for bytes and bytearray formatting

%-interpolation

All the numeric formatting codes (d, i, o, u, x, X, e, E, f, F, g, G, and any that are subsequently added to Python 3) will be supported, and will work as they do for str, including the padding, justification and other related modifiers (currently #, 0, -, `` `` (space), and + (plus any added to Python 3)). The only non-numeric codes allowed are c, b, a, and s (which is a synonym for b).

For the numeric codes, the only difference between str and bytes (or bytearray) interpolation is that the results from these codes will be ASCII-encoded text, not unicode. In other words, for any numeric formatting code %x:

b"%x" % val

is equivalent to:

("%x" % val).encode("ascii")

Examples:

>>> b'%4x' % 10
b'   a'

>>> b'%#4x' % 10
' 0xa'

>>> b'%04X' % 10
'000A'

%c will insert a single byte, either from an int in range(256), or from a bytes argument of length 1, not from a str.

Examples:

>>> b'%c' % 48
b'0'

>>> b'%c' % b'a'
b'a'

%b will insert a series of bytes. These bytes are collected in one of two ways:

- input type supports ``Py_buffer`` [4]_?
  use it to collect the necessary bytes

- input type is something else?
  use its ``__bytes__`` method [5]_ ; if there isn't one, raise a ``TypeError``

In particular, %b will not accept numbers nor str. str is rejected as the string to bytes conversion requires an encoding, and we are refusing to guess; numbers are rejected because:

  • what makes a number is fuzzy (float? Decimal? Fraction? some user type?)
  • allowing numbers would lead to ambiguity between numbers and textual representations of numbers (3.14 vs '3.14')
  • given the nature of wire formats, explicit is definitely better than implicit

%s is included as a synonym for %b for the sole purpose of making 2/3 code bases easier to maintain. Python 3 only code should use %b.

Examples:

>>> b'%b' % b'abc'
b'abc'

>>> b'%b' % 'some string'.encode('utf8')
b'some string'

>>> b'%b' % 3.14
Traceback (most recent call last):
...
TypeError: b'%b' does not accept 'float'

>>> b'%b' % 'hello world!'
Traceback (most recent call last):
...
TypeError: b'%b' does not accept 'str'

%a will give the equivalent of repr(some_obj).encode('ascii', 'backslashreplace') on the interpolated value. Use cases include developing a new protocol and writing landmarks into the stream; debugging data going into an existing protocol to see if the problem is the protocol itself or bad data; a fall-back for a serialization format; or any situation where defining __bytes__ would not be appropriate but a readable/informative representation is needed [6].

%r is included as a synonym for %a for the sole purpose of making 2/3 code bases easier to maintain. Python 3 only code use %a [7].

Examples:

>>> b'%a' % 3.14
b'3.14'

>>> b'%a' % b'abc'
b"b'abc'"

>>> b'%a' % 'def'
b"'def'"

Compatibility with Python 2

As noted above, %s and %r are being included solely to help ease migration from, and/or have a single code base with, Python 2. This is important as there are modules both in the wild and behind closed doors that currently use the Python 2 str type as a bytes container, and hence are using %s as a bytes interpolator.

However, %b and %a should be used in new, Python 3 only code, so %s and %r will immediately be deprecated, but not removed from the 3.x series [7].

Proposed variations

It has been proposed to automatically use .encode('ascii','strict') for str arguments to %b.

  • Rejected as this would lead to intermittent failures. Better to have the operation always fail so the trouble-spot can be correctly fixed.

It has been proposed to have %b return the ascii-encoded repr when the value is a str (b'%b' % 'abc' --> b"'abc'").

  • Rejected as this would lead to hard to debug failures far from the problem site. Better to have the operation always fail so the trouble-spot can be easily fixed.

Originally this PEP also proposed adding format-style formatting, but it was decided that format and its related machinery were all strictly text (aka str) based, and it was dropped.

Various new special methods were proposed, such as __ascii__, __format_bytes__, etc.; such methods are not needed at this time, but can be visited again later if real-world use shows deficiencies with this solution.

A competing PEP, PEP 460 Add binary interpolation and formatting [8], also exists.

Objections

The objections raised against this PEP were mainly variations on two themes:

  • the bytes and bytearray types are for pure binary data, with no assumptions about encodings
  • offering %-interpolation that assumes an ASCII encoding will be an attractive nuisance and lead us back to the problems of the Python 2 str/unicode text model

As was seen during the discussion, bytes and bytearray are also used for mixed binary data and ASCII-compatible segments: file formats such as dbf and pdf, network protocols such as ftp and email, etc.

bytes and bytearray already have several methods which assume an ASCII compatible encoding. upper(), isalpha(), and expandtabs() to name just a few. %-interpolation, with its very restricted mini-language, will not be any more of a nuisance than the already existing methods.

Some have objected to allowing the full range of numeric formatting codes with the claim that decimal alone would be sufficient. However, at least two formats (dbf and pdf) make use of non-decimal numbers.

pep-0462 Core development workflow automation for CPython

PEP:462
Title:Core development workflow automation for CPython
Version:$Revision$
Last-Modified:$Date$
Author:Nick Coghlan <ncoghlan at gmail.com>
Status:Deferred
Type:Process
Content-Type:text/x-rst
Requires:474
Created:23-Jan-2014
Post-History:25-Jan-2014, 27-Jan-2014, 01-Feb-2015

Abstract

This PEP proposes investing in automation of several of the tedious, time consuming activities that are currently required for the core development team to incorporate changes into CPython. This proposal is intended to allow core developers to make more effective use of the time they have available to contribute to CPython, which should also result in an improved experience for other contributors that are reliant on the core team to get their changes incorporated.

PEP Deferral

This PEP is currently deferred pending acceptance or rejection of the Kallithea-based forge.python.org proposal in PEP 474.

Rationale for changes to the core development workflow

The current core developer workflow to merge a new feature into CPython on a POSIX system "works" as follows:

  1. If applying a change submitted to bugs.python.org by another user, first check they have signed the PSF Contributor Licensing Agreement. If not, request that they sign one before continuing with merging the change.
  2. Apply the change locally to a current checkout of the main CPython repository (the change will typically have been discussed and reviewed as a patch on bugs.python.org first, but this step is not currently considered mandatory for changes originating directly from core developers).
  3. Run the test suite locally, at least make test or ./python -m test (depending on system specs, this takes a few minutes in the default configuration, but substantially longer if all optional resources, like external network access, are enabled).
  4. Run make patchcheck to fix any whitespace issues and as a reminder of other changes that may be needed (such as updating Misc/ACKS or adding an entry to Misc/NEWS)
  5. Commit the change and push it to the main repository. If hg indicates this would create a new head in the remote repository, run hg pull --rebase (or an equivalent). Theoretically, you should rerun the tests at this point, but it's very tempting to skip that step.
  6. After pushing, monitor the stable buildbots for any new failures introduced by your change. In particular, developers on POSIX systems will often break the Windows buildbots, and vice-versa. Less commonly, developers on Linux or Mac OS X may break other POSIX systems.

The steps required on Windows are similar, but the exact commands used will be different.

Rather than being simpler, the workflow for a bug fix is more complicated than that for a new feature! New features have the advantage of only being applied to the default branch, while bug fixes also need to be considered for inclusion in maintenance branches.

  • If a bug fix is applicable to Python 2.7, then it is also separately applied to the 2.7 branch, which is maintained as an independent head in Mercurial
  • If a bug fix is applicable to the current 3.x maintenance release, then it is first applied to the maintenance branch and then merged forward to the default branch. Both branches are pushed to hg.python.org at the same time.

Documentation patches are simpler than functional patches, but not hugely so - the main benefit is only needing to check the docs build successfully rather than running the test suite.

I would estimate that even when everything goes smoothly, it would still take me at least 20-30 minutes to commit a bug fix patch that applies cleanly. Given that it should be possible to automate several of these tasks, I do not believe our current practices are making effective use of scarce core developer resources.

There are many, many frustrations involved with this current workflow, and they lead directly to some undesirable development practices.

  • Much of this overhead is incurred on a per-patch applied basis. This encourages large commits, rather than small isolated changes. The time required to commit a 500 line feature is essentially the same as that needed to commit a 1 line bug fix - the additional time needed for the larger change appears in any preceding review rather than as part of the commit process.
  • The additional overhead of working on applying bug fixes creates an additional incentive to work on new features instead, and new features are already inherently more interesting to work on - they don't need workflow difficulties giving them a helping hand!
  • Getting a preceding review on bugs.python.org is additional work, creating an incentive to commit changes directly, increasing the reliance on post-review on the python-checkins mailing list.
  • Patches on the tracker that are complete, correct and ready to merge may still languish for extended periods awaiting a core developer with the time to devote to getting it merged.
  • The risk of push races (especially when pushing a merged bug fix) creates a temptation to skip doing full local test runs (especially after a push race has already been encountered once), increasing the chance of breaking the buildbots.
  • The buildbots are sometimes red for extended periods, introducing errors into local test runs, and also meaning that they sometimes fail to serve as a reliable indicator of whether or not a patch has introduced cross platform issues.
  • Post-conference development sprints are a nightmare, as they collapse into a mire of push races. It's tempting to just leave patches on the tracker until after the sprint is over and then try to clean them up afterwards.

There are also many, many opportunities for core developers to make mistakes that inconvenience others, both in managing the Mercurial branches and in breaking the buildbots without being in a position to fix them promptly. This both makes the existing core development team cautious in granting new developers commit access, as well as making those new developers cautious about actually making use of their increased level of access.

There are also some incidental annoyances (like keeping the NEWS file up to date) that will also be necessarily addressed as part of this proposal.

One of the most critical resources of a volunteer-driven open source project is the emotional energy of its contributors. The current approach to change incorporation doesn't score well on that front for anyone:

  • For core developers, the branch wrangling for bug fixes is delicate and easy to get wrong. Conflicts on the NEWS file and push races when attempting to upload changes add to the irritation of something most of us aren't being paid to spend time on (and for those that are, contributing to CPython is likely to be only one of our responsibilities). The time we spend actually getting a change merged is time we're not spending coding additional changes, writing or updating documentation or reviewing contributions from others.
  • Red buildbots make life difficult for other developers (since a local test failure may not be due to anything that developer did), release managers (since they may need to enlist assistance cleaning up test failures prior to a release) and for the developers themselves (since it creates significant pressure to fix any failures we inadvertently introduce right now, rather than at a more convenient time, as well as potentially making hg bisect more difficult to use if hg annotate isn't sufficient to identify the source of a new failure).
  • For other contributors, a core developer spending time actually getting changes merged is a developer that isn't reviewing and discussing patches on the issue tracker or otherwise helping others to contribute effectively. It is especially frustrating for contributors that are accustomed to the simplicity of a developer just being able to hit "Merge" on a pull request that has already been automatically tested in the project's CI system (which is a common workflow on sites like GitHub and BitBucket), or where the post-review part of the merge process is fully automated (as is the case for OpenStack).

Current Tools

The following tools are currently used to manage various parts of the CPython core development workflow.

  • Mercurial (hg.python.org) for version control
  • Roundup (bugs.python.org) for issue tracking
  • Rietveld (also hosted on bugs.python.org) for code review
  • Buildbot (buildbot.python.org) for automated testing

This proposal suggests replacing the use of Rietveld for code review with the more full-featured Kallithea-based forge.python.org service proposed in PEP 474. Guido has indicated that the original Rietveld implementation was primarily intended as a public demonstration application for Google App Engine, and switching to Kallithea will address some of the issues with identifying intended target branches that arise when working with patch files on Roundup and the associated reviews in the integrated Rietveld instance.

It also suggests the addition of new tools in order to automate additional parts of the workflow, as well as a critical review of the remaining tools to see which, if any, may be candidates for replacement.

Proposal

The essence of this proposal is that CPython aim to adopt a "core reviewer" development model, similar to that used by the OpenStack project.

The workflow problems experienced by the CPython core development team are not unique. The OpenStack infrastructure team have come up with a well designed automated workflow that is designed to ensure:

  • once a patch has been reviewed, further developer involvement is needed only if the automated tests fail prior to merging
  • patches never get merged without being tested relative to the current state of the branch
  • the main development branch always stays green. Patches that do not pass the automated tests do not get merged

If a core developer wants to tweak a patch prior to merging, they download it from the review tool, modify and upload it back to the review tool rather than pushing it directly to the source code repository.

The core of this workflow is implemented using a tool called Zuul [1], a Python web service created specifically for the OpenStack project, but deliberately designed with a plugin based trigger and action system to make it easier to adapt to alternate code review systems, issue trackers and CI systems. James Blair of the OpenStack infrastructure team provided an excellent overview of Zuul at linux.conf.au 2014.

While Zuul handles several workflows for OpenStack, the specific one of interest for this PEP is the "merge gating" workflow.

For this workflow, Zuul is configured to monitor the Gerrit code review system for patches which have been marked as "Approved". Once it sees such a patch, Zuul takes it, and combines it into a queue of "candidate merges". It then creates a pipeline of test runs that execute in parallel in Jenkins (in order to allow more than 24 commits a day when a full test run takes the better part of an hour), and are merged as they pass (and as all the candidate merges ahead of them in the queue pass). If a patch fails the tests, Zuul takes it out of the queue, cancels any test runs after that patch in the queue, and rebuilds the queue without the failing patch.

If a developer looks at a test which failed on merge and determines that it was due to an intermittent failure, they can then resubmit the patch for another attempt at merging.

To adapt this process to CPython, it should be feasible to have Zuul monitor Kallithea for approved pull requests (which may require a feature addition in Kallithea), submit them to Buildbot for testing on the stable buildbots, and then merge the changes appropriately in Mercurial. This idea poses a few technical challenges, which have their own section below.

For CPython, I don't believe we will need to take advantage of Zuul's ability to execute tests in parallel (certainly not in the initial iteration - if we get to a point where serial testing of patches by the merge gating system is our primary bottleneck rather than having the people we need in order to be able to review and approve patches, then that will be a very good day).

However, the merge queue itself is a very powerful concept that should directly address several of the issues described in the Rationale above.

Deferred Proposals

The OpenStack team also use Zuul to coordinate several other activities:

  • Running preliminary "check" tests against patches posted to Gerrit.
  • Creation of updated release artefacts and republishing documentation when changes are merged
  • The Elastic recheck [2] feature that uses ElasticSearch in conjunction with a spam filter to monitor test output and suggest the specific intermittent failure that may have caused a test to fail, rather than requiring users to search logs manually

While these are possibilities worth exploring in the future (and one of the possible benefits I see to seeking closer coordination with the OpenStack Infrastructure team), I don't see them as offering quite the same kind of fundamental workflow improvement that merge gating appears to provide.

However, if we find we are having too many problems with intermittent test failures in the gate, then introducing the "Elastic recheck" feature may need to be considered as part of the initial deployment.

Suggested Variants

Terry Reedy has suggested doing an initial filter which specifically looks for approved documentation-only patches (~700 of the 4000+ open CPython issues are pure documentation updates). This approach would avoid several of the issues related to flaky tests and cross-platform testing, while still allowing the rest of the automation flows to be worked out (such as how to push a patch into the merge queue).

The key downside to this approach is that Zuul wouldn't have complete control of the merge process as it usually expects, so there would potentially be additional coordination needed around that.

It may be worth keeping this approach as a fallback option if the initial deployment proves to have more trouble with test reliability than is anticipated.

It would also be possible to tweak the merge gating criteria such that it doesn't run the test suite if it detects that the patch hasn't modified any files outside the "Docs" tree, and instead only checks that the documentation builds without errors.

As yet another alternative, it may be reasonable to move some parts of the documentation (such as the tutorial and the HOWTO guides) out of the main source repository and manage them using the simpler pull request based model described in PEP 474.

Perceived Benefits

The benefits of this proposal accrue most directly to the core development team. First and foremost, it means that once we mark a patch as "Approved" in the updated code review system, we're usually done. The extra 20-30 minutes (or more) of actually applying the patch, running the tests and merging it into Mercurial would all be orchestrated by Zuul. Push races would also be a thing of the past - if lots of core developers are approving patches at a sprint, then that just means the queue gets deeper in Zuul, rather than developers getting frustrated trying to merge changes and failing. Test failures would still happen, but they would result in the affected patch being removed from the merge queue, rather than breaking the code in the main repository.

With the bulk of the time investment moved to the review process, this also encourages "development for reviewability" - smaller, easier to review patches, since the overhead of running the tests multiple times will be incurred by Zuul rather than by the core developers.

However, removing this time sink from the core development team should also improve the experience of CPython development for other contributors, as it eliminates several of the opportunities for patches to get "dropped on the floor", as well as increasing the time core developers are likely to have available for reviewing contributed patches.

Another example of benefits to other contributors is that when a sprint aimed primarily at new contributors is running with just a single core developer present (such as the sprints at PyCon AU for the last few years), the merge queue would allow that developer to focus more of their time on reviewing patches and helping the other contributors at the sprint, since accepting a patch for inclusion would now be a single click in the Kallithea UI, rather than the relatively time consuming process that it is currently. Even when multiple core developers are present, it is better to enable them to spend their time and effort on interacting with the other sprint participants than it is on things that are sufficiently mechanical that a computer can (and should) handle them.

With most of the ways to make a mistake when committing a change automated out of existence, there are also substantially fewer new things to learn when a contributor is nominated to become a core developer. This should have a dual benefit, both in making the existing core developers more comfortable with granting that additional level of responsibility, and in making new contributors more comfortable with exercising it.

Finally, a more stable default branch in CPython makes it easier for other Python projects to conduct continuous integration directly against the main repo, rather than having to wait until we get into the release candidate phase of a new release. At the moment, setting up such a system isn't particularly attractive, as it would need to include an additional mechanism to wait until CPython's own Buildbot fleet indicated that the build was in a usable state. With the proposed merge gating system, the trunk always remains usable.

Technical Challenges

Adapting Zuul from the OpenStack infrastructure to the CPython infrastructure will at least require the development of additional Zuul trigger and action plugins, and may require additional development in some of our existing tools.

Kallithea vs Gerrit

Kallithea does not currently include a voting/approval feature that is equivalent to Gerrit's. For CPython, we wouldn't need anything as sophisticated as Gerrit's voting system - a simple core-developer-only "Approved" marker to trigger action from Zuul should suffice. The core-developer-or-not flag is available in Roundup, as is the flag indicating whether or not the uploader of a patch has signed a PSF Contributor Licensing Agreement, which may require further development to link contributor accounts between the Kallithea instance and Roundup.

Some of the existing Zuul triggers work by monitoring for particular comments (in particular, recheck/reverify comments to ask Zuul to try merging a change again if it was previously rejected due to an unrelated intermittent failure). We will likely also want similar explicit triggers for Kallithea.

The current Zuul plugins for Gerrit work by monitoring the Gerrit activity stream for particular events. If Kallithea has no equivalent, we will need to add something suitable for the events we would like to trigger on.

There would also be development effort needed to create a Zuul plugin that monitors Kallithea activity rather than Gerrit.

Mercurial vs Gerrit/git

Gerrit uses git as the actual storage mechanism for patches, and automatically handles merging of approved patches. By contrast, Kallithea use the RhodeCode created vcs <https://pythonhosted.org/vcs/> library as an abstraction layer over specific DVCS implementations (with Mercurial and git backends currently available).

Zuul is also directly integrated with git for patch manipulation - as far as I am aware, this part of the design currently isn't pluggable. However, at PyCon US 2014, the Mercurial core developers at the sprints expressed some interest in collaborating with the core development team and the Zuul developers on enabling the use of Zuul with Mercurial in addition to git. As Zuul is itself a Python application, migrating it to use the same DVCS abstraction library as RhodeCode and Kallithea may be a viable path towards achieving that.

Buildbot vs Jenkins

Zuul's interaction with the CI system is also pluggable, using Gearman as the preferred interface. Accordingly, adapting the CI jobs to run in Buildbot rather than Jenkins should just be a matter of writing a Gearman client that can process the requests from Zuul and pass them on to the Buildbot master. Zuul uses the pure Python gear client library to communicate with Gearman, and this library should also be useful to handle the Buildbot side of things.

Note that, in the initial iteration, I am proposing that we do not attempt to pipeline test execution. This means Zuul would be running in a very simple mode where only the patch at the head of the merge queue is being tested on the Buildbot fleet, rather than potentially testing several patches in parallel. I am picturing something equivalent to requesting a forced build from the Buildbot master, and then waiting for the result to come back before moving on to the second patch in the queue.

If we ultimately decide that this is not sufficient, and we need to start using the CI pipelining features of Zuul, then we may need to look at moving the test execution to dynamically provisioned cloud images, rather than relying on volunteer maintained statically provisioned systems as we do currently. The OpenStack CI infrastructure team are exploring the idea of replacing their current use of Jenkins masters with a simpler pure Python test runner, so if we find that we can't get Buildbot to effectively support the pipelined testing model, we'd likely participate in that effort rather than setting up a Jenkins instance for CPython.

In this case, the main technical risk would be a matter of ensuring we support testing on platforms other than Linux (as our stable buildbots currently cover Windows, Mac OS X, FreeBSD and OpenIndiana in addition to a couple of different Linux variants).

In such a scenario, the Buildbot fleet would still have a place in doing "check" runs against the master repository (either periodically or for every commit), even if it did not play a part in the merge gating process. More unusual configurations (such as building without threads, or without SSL/TLS support) would likely still be handled that way rather than being included in the gate criteria (at least initially, anyway).

Handling of maintenance branches

The OpenStack project largely leaves the question of maintenance branches to downstream vendors, rather than handling it directly. This means there are questions to be answered regarding how we adapt Zuul to handle our maintenance branches.

Python 2.7 can be handled easily enough by treating it as a separate patch queue. This would be handled natively in Kallithea by submitting separate pull requests in order to update the Python 2.7 maintenance branch.

The Python 3.x maintenance branches are potentially more complicated. My current recommendation is to simply stop using Mercurial merges to manage them, and instead treat them as independent heads, similar to the Python 2.7 branch. Separate pull requests would need to be submitted for the active Python 3 maintenance branch and the default development branch. The downside of this approach is that it increases the risk that a fix is merged only to the maintenance branch without also being submitted to the default branch, so we may want to design some additional tooling that ensures that every maintenance branch pull request either has a corresponding default branch pull request prior to being merged, or else has an explicit disclaimer indicating that it is only applicable to that branch and doesn't need to be ported forward to later branches.

Such an approach has the benefit of adjusting relatively cleanly to the intermittent periods where we have two active Python 3 maintenance branches.

This issue does suggest some potential user interface ideas for Kallithea, where it may be desirable to be able to clone a pull request in order to be able to apply it to a second branch.

Handling of security branches

For simplicity's sake, I would suggest leaving the handling of security-fix only branches alone: the release managers for those branches would continue to backport specific changes manually. The only change is that they would be able to use the Kallithea pull request workflow to do the backports if they would like others to review the updates prior to merging them.

Handling of NEWS file updates

Our current approach to handling NEWS file updates regularly results in spurious conflicts when merging bug fixes forward from an active maintenance branch to a later branch.

Issue #18967* discusses some possible improvements in that area, which would be beneficial regardless of whether or not we adopt Zuul as a workflow automation tool.

Stability of "stable" Buildbot slaves

Instability of the nominally stable buildbots has a substantially larger impact under this proposal. We would need to ensure we're genuinely happy with each of those systems gating merges to the development branches, or else move then to "unstable" status.

Intermittent test failures

Some tests, especially timing tests, exhibit intermittent failures on the existing Buildbot fleet. In particular, test systems running as VMs may sometimes exhibit timing failures when the VM host is under higher than normal load.

The OpenStack CI infrastructure includes a number of additional features to help deal with intermittent failures, the most basic of which is simply allowing developers to request that merging a patch be tried again when the original failure appears to be due to a known intermittent failure (whether that intermittent failure is in OpenStack itself or just in a flaky test).

The more sophisticated Elastic recheck [2] feature may be worth considering, especially since the output of the CPython test suite is substantially simpler than that from OpenStack's more complex multi-service testing, and hence likely even more amenable to automated analysis.

Custom Mercurial client workflow support

One useful part of the OpenStack workflow is the "git review" plugin, which makes it relatively easy to push a branch from a local git clone up to Gerrit for review.

PEP 474 mentions a draft custom Mercurial extension that automates some aspects of the existing CPython core development workflow.

As part of this proposal, that custom extension would be extended to work with the new Kallithea based review workflow in addition to the legacy Roundup/Rietveld based review workflow.

Social Challenges

The primary social challenge here is getting the core development team to change their practices. However, the tedious-but-necessary steps that are automated by the proposal should create a strong incentive for the existing developers to go along with the idea.

I believe three specific features may be needed to assure existing developers that there are no downsides to the automation of this workflow:

  • Only requiring approval from a single core developer to incorporate a patch. This could be revisited in the future, but we should preserve the status quo for the initial rollout.
  • Explicitly stating that core developers remain free to approve their own patches, except during the release candidate phase of a release. This could be revisited in the future, but we should preserve the status quo for the initial rollout.
  • Ensuring that at least release managers have a "merge it now" capability that allows them to force a particular patch to the head of the merge queue. Using a separate clone for release preparation may be sufficient for this purpose. Longer term, automatic merge gating may also allow for more automated preparation of release artefacts as well.

Practical Challenges

The PSF runs its own directly and indirectly sponsored workflow infrastructure primarily due to past experience with unacceptably poor performance and inflexibility of infrastructure provided for free to the general public. CPython development was originally hosted on SourceForge, with source control moved to self hosting when SF was both slow to offer Subversion support and suffering from CVS performance issues (see PEP 347), while issue tracking later moved to the open source Roundup issue tracker on dedicated sponsored hosting (from Upfront Systems), due to a combination of both SF performance issues and general usability issues with the SF tracker at the time (the outcome and process for the new tracker selection were captured on the python.org wiki rather than in a PEP).

Accordingly, proposals that involve setting ourselves up for "SourceForge usability and reliability issues, round two" will face significant opposition from at least some members of the CPython core development team (including the author of this PEP). This proposal respects that history by recommending only tools that are available for self-hosting as sponsored or PSF funded infrastructure, and are also open source Python projects that can be customised to meet the needs of the CPython core development team.

However, for this proposal to be a success (if it is accepted), we need to understand how we are going to carry out the necessary configuration, customisation, integration and deployment work.

The last attempt at adding a new piece to the CPython support infrastructure (speed.python.org) has unfortunately foundered due to the lack of time to drive the project from the core developers and PSF board members involved, and the difficulties of trying to bring someone else up to speed to lead the activity (the hardware donated to that project by HP is currently in use to support PyPy instead, but the situation highlights some of the challenges of relying on volunteer labour with many other higher priority demands on their time to steer projects to completion).

Even ultimately successful past projects, such as the source control migrations from CVS to Subversion and from Subversion to Mercurial, the issue tracker migration from SourceForge to Roundup, the code review integration between Roundup and Rietveld and the introduction of the Buildbot continuous integration fleet, have taken an extended period of time as volunteers worked their way through the many technical and social challenges involved.

Fortunately, as several aspects of this proposal and PEP 474 align with various workflow improvements under consideration for Red Hat's Beaker open source hardware integration testing system and other work-related projects, I have arranged to be able to devote ~1 day a week to working on CPython infrastructure projects.

Together with Rackspace's existing contributions to maintaining the pypi.python.org infrastructure, I personally believe this arrangement is indicative of a more general recognition amongst CPython redistributors and major users of the merit in helping to sustain upstream infrastructure through direct contributions of developer time, rather than expecting volunteer contributors to maintain that infrastructure entirely in their spare time or funding it indirectly through the PSF (with the additional management overhead that would entail). I consider this a positive trend, and one that I will continue to encourage as best I can.

Open Questions

Pretty much everything in the PEP. Do we want to adopt merge gating and Zuul? How do we want to address the various technical challenges? Are the Kallithea and Zuul development communities open to the kind of collaboration that would be needed to make this effort a success?

While I've arranged to spend some of my own work time on this, do we want to approach the OpenStack Foundation for additional assistance, since we're a key dependency of OpenStack itself, Zuul is a creation of the OpenStack infrastructure team, and the available development resources for OpenStack currently dwarf those for CPython?

Are other interested folks working for Python redistributors and major users also in a position to make a business case to their superiors for investing developer time in supporting this effort?

Next Steps

If pursued, this will be a follow-on project to the Kallithea-based forge.python.org proposal in PEP 474. Refer to that PEP for more details on the discussion, review and proof-of-concept pilot process currently under way.

Acknowledgements

Thanks to Jesse Noller, Alex Gaynor and James Blair for providing valuable feedback on a preliminary draft of this proposal, and to James and Monty Taylor for additional technical feedback following publication of the initial draft.

Thanks to Bradley Kuhn, Mads Kiellerich and other Kallithea developers for the discussions around PEP 474 that led to a significant revision of this proposal to be based on using Kallithea for the review component rather than the existing Rietveld installation.

pep-0463 Exception-catching expressions

PEP:463
Title:Exception-catching expressions
Version:$Revision$
Last-Modified:$Date$
Author:Chris Angelico <rosuav at gmail.com>
Status:Draft
Type:Standards Track
Content-Type:text/x-rst
Created:15-Feb-2014
Python-Version:3.5
Post-History:20-Feb-2014, 16-Feb-2014

Abstract

Just as PEP 308 introduced a means of value-based conditions in an expression, this system allows exception-based conditions to be used as part of an expression.

Motivation

A number of functions and methods have parameters which will cause them to return a specified value instead of raising an exception. The current system is ad-hoc and inconsistent, and requires that each function be individually written to have this functionality; not all support this.

  • dict.get(key, default) - second positional argument in place of KeyError
  • next(iter, default) - second positional argument in place of StopIteration
  • list.pop() - no way to return a default
  • seq[index] - no way to handle a bounds error
  • min(sequence, default=default) - keyword argument in place of ValueError
  • statistics.mean(data) - no way to handle an empty iterator

Had this facility existed early in Python's history, there would have been no need to create dict.get() and related methods; the one obvious way to handle an absent key would be to respond to the exception. One method is written which signals the absence in one way, and one consistent technique is used to respond to the absence. Instead, we have dict.get(), and as of Python 3.4, we also have min(... default=default), and myriad others. We have a LBYL syntax for testing inside an expression, but there is currently no EAFP notation; compare the following:

# LBYL:
if key in dic:
    process(dic[key])
else:
    process(None)
# As an expression:
process(dic[key] if key in dic else None)

# EAFP:
try:
    process(dic[key])
except KeyError:
    process(None)
# As an expression:
process(dic[key] except KeyError: None)

Python generally recommends the EAFP policy, but must then proliferate utility functions like dic.get(key,None) to enable this.

Rationale

The current system requires that a function author predict the need for a default, and implement support for it. If this is not done, a full try/except block is needed.

Since try/except is a statement, it is impossible to catch exceptions in the middle of an expression. Just as if/else does for conditionals and lambda does for function definitions, so does this allow exception catching in an expression context.

This provides a clean and consistent way for a function to provide a default: it simply raises an appropriate exception, and the caller catches it.

With some situations, an LBYL technique can be used (checking if some sequence has enough length before indexing into it, for instance). This is not safe in all cases, but as it is often convenient, programmers will be tempted to sacrifice the safety of EAFP in favour of the notational brevity of LBYL. Additionally, some LBYL techniques (eg involving getattr with three arguments) warp the code into looking like literal strings rather than attribute lookup, which can impact readability. A convenient EAFP notation solves all of this.

There's no convenient way to write a helper function to do this; the nearest is something ugly using either lambda:

def except_(expression, exception_list, default):
    try:
        return expression()
    except exception_list:
        return default()
value = except_(lambda: 1/x, ZeroDivisionError, lambda: float("nan"))

which is clunky, and unable to handle multiple exception clauses; or eval:

def except_(expression, exception_list, default):
    try:
        return eval(expression, globals_of_caller(), locals_of_caller())
    except exception_list as exc:
        l = locals_of_caller().copy()
        l['exc'] = exc
        return eval(default, globals_of_caller(), l)

def globals_of_caller():
    return sys._getframe(2).f_globals

def locals_of_caller():
    return sys._getframe(2).f_locals

value = except_("""1/x""",ZeroDivisionError,""" "Can't divide by zero" """)

which is even clunkier, and relies on implementation-dependent hacks. (Writing globals_of_caller() and locals_of_caller() for interpreters other than CPython is left as an exercise for the reader.)

Raymond Hettinger expresses [1] a desire for such a consistent API. Something similar has been requested [2] multiple [3] times [4] in the past.

Proposal

Just as the 'or' operator and the three part 'if-else' expression give short circuiting methods of catching a falsy value and replacing it, this syntax gives a short-circuiting method of catching an exception and replacing it.

This currently works:

lst = [1, 2, None, 3]
value = lst[2] or "No value"

The proposal adds this:

lst = [1, 2]
value = (lst[2] except IndexError: "No value")

Specifically, the syntax proposed is:

(expr except exception_list: default)

where expr, exception_list, and default are all expressions. First, expr is evaluated. If no exception is raised, its value is the value of the overall expression. If any exception is raised, exception_list is evaluated, and should result in either a type or a tuple, just as with the statement form of try/except. Any matching exception will result in the corresponding default expression being evaluated and becoming the value of the expression. As with the statement form of try/except, non-matching exceptions will propagate upward.

Parentheses are required around the entire expression, unless they would be completely redundant, according to the same rules as generator expressions follow. This guarantees correct interpretation of nested except-expressions, and allows for future expansion of the syntax - see below on multiple except clauses.

Note that the current proposal does not allow the exception object to be captured. Where this is needed, the statement form must be used. (See below for discussion and elaboration on this.)

This ternary operator would be between lambda and if/else in precedence.

Consider this example of a two-level cache:

for key in sequence:
    x = (lvl1[key] except KeyError: (lvl2[key] except KeyError: f(key)))
    # do something with x

This cannot be rewritten as:

x = lvl1.get(key, lvl2.get(key, f(key)))

which, despite being shorter, defeats the purpose of the cache, as it must calculate a default value to pass to get(). The .get() version calculates backwards; the exception-testing version calculates forwards, as would be expected. The nearest useful equivalent would be:

x = lvl1.get(key) or lvl2.get(key) or f(key)

which depends on the values being nonzero, as well as depending on the cache object supporting this functionality.

Alternative Proposals

Discussion on python-ideas brought up the following syntax suggestions:

value = expr except default if Exception [as e]
value = expr except default for Exception [as e]
value = expr except default from Exception [as e]
value = expr except Exception [as e] return default
value = expr except (Exception [as e]: default)
value = expr except Exception [as e] try default
value = expr except Exception [as e] continue with default
value = default except Exception [as e] else expr
value = try expr except Exception [as e]: default
value = expr except default # Catches anything
value = expr except(Exception) default # Catches only the named type(s)
value = default if expr raise Exception
value = expr or else default if Exception
value = expr except Exception [as e] -> default
value = expr except Exception [as e] pass default

It has also been suggested that a new keyword be created, rather than reusing an existing one. Such proposals fall into the same structure as the last form, but with a different keyword in place of 'pass'. Suggestions include 'then', 'when', and 'use'. Also, in the context of the "default if expr raise Exception" proposal, it was suggested that a new keyword "raises" be used.

All forms involving the 'as' capturing clause have been deferred from this proposal in the interests of simplicity, but are preserved in the table above as an accurate record of suggestions.

The four forms most supported by this proposal are, in order:

value = (expr except Exception: default)
value = (expr except Exception -> default)
value = (expr except Exception pass default)
value = (expr except Exception then default)

All four maintain left-to-right evaluation order: first the base expression, then the exception list, and lastly the default. This is important, as the expressions are evaluated lazily. By comparison, several of the ad-hoc alternatives listed above must (by the nature of functions) evaluate their default values eagerly. The preferred form, using the colon, parallels try/except by using "except exception_list:", and parallels lambda by having "keyword name_list: subexpression"; it also can be read as mapping Exception to the default value, dict-style. Using the arrow introduces a token many programmers will not be familiar with, and which currently has no similar meaning, but is otherwise quite readable. The English word "pass" has a vaguely similar meaning (consider the common usage "pass by value/reference" for function arguments), and "pass" is already a keyword, but as its meaning is distinctly unrelated, this may cause confusion. Using "then" makes sense in English, but this introduces a new keyword to the language - albeit one not in common use, but a new keyword all the same.

Left to right evaluation order is extremely important to readability, as it parallels the order most expressions are evaluated. Alternatives such as:

value = (expr except default if Exception)

break this, by first evaluating the two ends, and then coming to the middle; while this may not seem terrible (as the exception list will usually be a constant), it does add to the confusion when multiple clauses meet, either with multiple except/if or with the existing if/else, or a combination. Using the preferred order, subexpressions will always be evaluated from left to right, no matter how the syntax is nested.

Keeping the existing notation, but shifting the mandatory parentheses, we have the following suggestion:

value = expr except (Exception: default)
value = expr except(Exception: default)

This is reminiscent of a function call, or a dict initializer. The colon cannot be confused with introducing a suite, but on the other hand, the new syntax guarantees lazy evaluation, which a dict does not. The potential to reduce confusion is considered unjustified by the corresponding potential to increase it.

Example usage

For each example, an approximately-equivalent statement form is given, to show how the expression will be parsed. These are not always strictly equivalent, but will accomplish the same purpose. It is NOT safe for the interpreter to translate one into the other.

A number of these examples are taken directly from the Python standard library, with file names and line numbers correct as of early Feb 2014. Many of these patterns are extremely common.

Retrieve an argument, defaulting to None:

cond = (args[1] except IndexError: None)

# Lib/pdb.py:803:
try:
    cond = args[1]
except IndexError:
    cond = None

Fetch information from the system if available:

pwd = (os.getcwd() except OSError: None)

# Lib/tkinter/filedialog.py:210:
try:
    pwd = os.getcwd()
except OSError:
    pwd = None

Attempt a translation, falling back on the original:

e.widget = (self._nametowidget(W) except KeyError: W)

# Lib/tkinter/__init__.py:1222:
try:
    e.widget = self._nametowidget(W)
except KeyError:
    e.widget = W

Read from an iterator, continuing with blank lines once it's exhausted:

line = (readline() except StopIteration: '')

# Lib/lib2to3/pgen2/tokenize.py:370:
try:
    line = readline()
except StopIteration:
    line = ''

Retrieve platform-specific information (note the DRY improvement); this particular example could be taken further, turning a series of separate assignments into a single large dict initialization:

# sys.abiflags may not be defined on all platforms.
_CONFIG_VARS['abiflags'] = (sys.abiflags except AttributeError: '')

# Lib/sysconfig.py:529:
try:
    _CONFIG_VARS['abiflags'] = sys.abiflags
except AttributeError:
    # sys.abiflags may not be defined on all platforms.
    _CONFIG_VARS['abiflags'] = ''

Retrieve an indexed item, defaulting to None (similar to dict.get):

def getNamedItem(self, name):
    return (self._attrs[name] except KeyError: None)

# Lib/xml/dom/minidom.py:573:
def getNamedItem(self, name):
    try:
        return self._attrs[name]
    except KeyError:
        return None

Translate numbers to names, falling back on the numbers:

g = (grp.getgrnam(tarinfo.gname)[2] except KeyError: tarinfo.gid)
u = (pwd.getpwnam(tarinfo.uname)[2] except KeyError: tarinfo.uid)

# Lib/tarfile.py:2198:
try:
    g = grp.getgrnam(tarinfo.gname)[2]
except KeyError:
    g = tarinfo.gid
try:
    u = pwd.getpwnam(tarinfo.uname)[2]
except KeyError:
    u = tarinfo.uid

Look up an attribute, falling back on a default:

mode = (f.mode except AttributeError: 'rb')

# Lib/aifc.py:882:
if hasattr(f, 'mode'):
    mode = f.mode
else:
    mode = 'rb'

return (sys._getframe(1) except AttributeError: None)
# Lib/inspect.py:1350:
return sys._getframe(1) if hasattr(sys, "_getframe") else None

Perform some lengthy calculations in EAFP mode, handling division by zero as a sort of sticky NaN:

value = (calculate(x) except ZeroDivisionError: float("nan"))

try:
    value = calculate(x)
except ZeroDivisionError:
    value = float("nan")

Calculate the mean of a series of numbers, falling back on zero:

value = (statistics.mean(lst) except statistics.StatisticsError: 0)

try:
    value = statistics.mean(lst)
except statistics.StatisticsError:
    value = 0

Looking up objects in a sparse list of overrides:

(overrides[x] or default except IndexError: default).ping()

try:
    (overrides[x] or default).ping()
except IndexError:
    default.ping()

Narrowing of exception-catching scope

The following examples, taken directly from Python's standard library, demonstrate how the scope of the try/except can be conveniently narrowed. To do this with the statement form of try/except would require a temporary variable, but it's far cleaner as an expression.

Lib/ipaddress.py:343:

try:
    ips.append(ip.ip)
except AttributeError:
    ips.append(ip.network_address)

Becomes:

ips.append(ip.ip except AttributeError: ip.network_address)

The expression form is nearly equivalent to this:

try:
    _ = ip.ip
except AttributeError:
    _ = ip.network_address
ips.append(_)

Lib/tempfile.py:130:

try:
    dirlist.append(_os.getcwd())
except (AttributeError, OSError):
    dirlist.append(_os.curdir)

Becomes:

dirlist.append(_os.getcwd() except (AttributeError, OSError): _os.curdir)

Lib/asyncore.py:264:

try:
    status.append('%s:%d' % self.addr)
except TypeError:
    status.append(repr(self.addr))

Becomes:

status.append('%s:%d' % self.addr except TypeError: repr(self.addr))

In each case, the narrowed scope of the try/except ensures that an unexpected exception (for instance, AttributeError if "append" were misspelled) does not get caught by the same handler. This is sufficiently unlikely to be reason to break the call out into a separate line (as per the five line example above), but it is a small benefit gained as a side-effect of the conversion.

Comparisons with other languages

(With thanks to Andrew Barnert for compiling this section. Note that the examples given here do not reflect the current version of the proposal, and need to be edited.)

Ruby's [5] "begin…rescue…rescue…else…ensure…end" is an expression (potentially with statements inside it). It has the equivalent of an "as" clause, and the equivalent of bare except. And it uses no punctuation or keyword between the bare except/exception class/exception class with as clause and the value. (And yes, it's ambiguous unless you understand Ruby's statement/expression rules.)

x = begin computation() rescue MyException => e default(e) end;
x = begin computation() rescue MyException default() end;
x = begin computation() rescue default() end;
x = begin computation() rescue MyException default() rescue OtherException other() end;

In terms of this PEP:

x = computation() except MyException as e default(e)
x = computation() except MyException default(e)
x = computation() except default(e)
x = computation() except MyException default() except OtherException other()

Erlang [6] has a try expression that looks like this

x = try computation() catch MyException:e -> default(e) end;
x = try computation() catch MyException:e -> default(e); OtherException:e -> other(e) end;

The class and "as" name are mandatory, but you can use "_" for either. There's also an optional "when" guard on each, and a "throw" clause that you can catch, which I won't get into. To handle multiple exceptions, you just separate the clauses with semicolons, which I guess would map to commas in Python. So:

x = try computation() except MyException as e -> default(e)
x = try computation() except MyException as e -> default(e), OtherException as e->other_default(e)

Erlang also has a "catch" expression, which, despite using the same keyword, is completely different, and you don't want to know about it.

The ML family has two different ways of dealing with this, "handle" and "try"; the difference between the two is that "try" pattern-matches the exception, which gives you the effect of multiple except clauses and as clauses. In either form, the handler clause is punctuated by "=>" in some dialects, "->" in others.

To avoid confusion, I'll write the function calls in Python style.

Here's SML's [7] "handle"

let x = computation() handle MyException => default();;

Here's OCaml's [8] "try"

let x = try computation() with MyException explanation -> default(explanation);;

let x = try computation() with

    MyException(e) -> default(e)
  | MyOtherException() -> other_default()
  | (e) -> fallback(e);;

In terms of this PEP, these would be something like:

x = computation() except MyException => default()
x = try computation() except MyException e -> default()
x = (try computation()
     except MyException as e -> default(e)
     except MyOtherException -> other_default()
     except BaseException as e -> fallback(e))

Many ML-inspired but not-directly-related languages from academia mix things up, usually using more keywords and fewer symbols. So, the Oz [9] would map to Python as

x = try computation() catch MyException as e then default(e)

Many Lisp-derived languages, like Clojure, [10] implement try/catch as special forms (if you don't know what that means, think function-like macros), so you write, effectively

try(computation(), catch(MyException, explanation, default(explanation)))

try(computation(),
    catch(MyException, explanation, default(explanation)),
    catch(MyOtherException, explanation, other_default(explanation)))

In Common Lisp, this is done with a slightly clunkier "handler-case" macro, [11] but the basic idea is the same.

The Lisp style is, surprisingly, used by some languages that don't have macros, like Lua, where xpcall [12] takes functions. Writing lambdas Python-style instead of Lua-style

x = xpcall(lambda: expression(), lambda e: default(e))

This actually returns (true, expression()) or (false, default(e)), but I think we can ignore that part.

Haskell is actually similar to Lua here (except that it's all done with monads, of course):

x = do catch(lambda: expression(), lambda e: default(e))

You can write a pattern matching expression within the function to decide what to do with it; catching and re-raising exceptions you don't want is cheap enough to be idiomatic.

But Haskell infixing makes this nicer:

x = do expression() `catch` lambda: default()
x = do expression() `catch` lambda e: default(e)

And that makes the parallel between the lambda colon and the except colon in the proposal much more obvious:

x = expression() except Exception: default()
x = expression() except Exception as e: default(e)

Tcl [13] has the other half of Lua's xpcall; catch is a function which returns true if an exception was caught, false otherwise, and you get the value out in other ways. And it's all built around the the implicit quote-and-exec that everything in Tcl is based on, making it even harder to describe in Python terms than Lisp macros, but something like

if {[ catch("computation()") "explanation"]} { default(explanation) }

Smalltalk [14] is also somewhat hard to map to Python. The basic version would be

x := computation() on:MyException do:default()

... but that's basically Smalltalk's passing-arguments-with-colons syntax, not its exception-handling syntax.

Deferred sub-proposals

Multiple except clauses

An examination of use-cases shows that this is not needed as often as it would be with the statement form, and as its syntax is a point on which consensus has not been reached, the entire feature is deferred.

Multiple 'except' keywords could be used, and they will all catch exceptions raised in the original expression (only):

# Will catch any of the listed exceptions thrown by expr;
# any exception thrown by a default expression will propagate.
value = (expr
    except Exception1: default1
    except Exception2: default2
    # ... except ExceptionN: defaultN
)

Currently, one of the following forms must be used:

# Will catch an Exception2 thrown by either expr or default1
value = (
    (expr except Exception1: default1)
    except Exception2: default2
)
# Will catch an Exception2 thrown by default1 only
value = (expr except Exception1:
    (default1 except Exception2: default2)
)

Listing multiple exception clauses without parentheses is a syntax error (see above), and so a future version of Python is free to add this feature without breaking any existing code.

Capturing the exception object

In a try/except block, the use of 'as' to capture the exception object creates a local name binding, and implicitly deletes that binding (to avoid creating a reference loop) in a finally clause. In an expression context, this makes little sense, and a proper sub-scope would be required to safely capture the exception object - something akin to the way a list comprehension is handled. However, CPython currently implements a comprehension's subscope with a nested function call, which has consequences in some contexts such as class definitions, and is therefore unsuitable for this proposal. Should there be, in future, a way to create a true subscope (which could simplify comprehensions, except expressions, with blocks, and possibly more), then this proposal could be revived; until then, its loss is not a great one, as the simple exception handling that is well suited to the expression notation used here is generally concerned only with the type of the exception, and not its value - further analysis below.

This syntax would, admittedly, allow a convenient way to capture exceptions in interactive Python; returned values are captured by "_", but exceptions currently are not. This could be spelled:

>>> (expr except Exception as e: e)

An examination of the Python standard library shows that, while the use of 'as' is fairly common (occurring in roughly one except clause in five), it is extremely uncommon in the cases which could logically be converted into the expression form. Its few uses can simply be left unchanged. Consequently, in the interests of simplicity, the 'as' clause is not included in this proposal. A subsequent Python version can add this without breaking any existing code, as 'as' is already a keyword.

One example where this could possibly be useful is Lib/imaplib.py:568:

try: typ, dat = self._simple_command('LOGOUT')
except: typ, dat = 'NO', ['%s: %s' % sys.exc_info()[:2]]

This could become:

typ, dat = (self._simple_command('LOGOUT')
    except BaseException as e: ('NO', '%s: %s' % (type(e), e)))

Or perhaps some other variation. This is hardly the most compelling use-case, but an intelligent look at this code could tidy it up significantly. In the absence of further examples showing any need of the exception object, I have opted to defer indefinitely the recommendation.

Rejected sub-proposals

finally clause

The statement form try... finally or try... except... finally has no logical corresponding expression form. Therefore the finally keyword is not a part of this proposal, in any way.

Bare except having different meaning

With several of the proposed syntaxes, omitting the exception type name would be easy and concise, and would be tempting. For convenience's sake, it might be advantageous to have a bare 'except' clause mean something more useful than "except BaseException". Proposals included having it catch Exception, or some specific set of "common exceptions" (subclasses of a new type called ExpressionError), or have it look for a tuple named ExpressionError in the current scope, with a built-in default such as (ValueError, UnicodeError, AttributeError, EOFError, IOError, OSError, LookupError, NameError, ZeroDivisionError). All of these were rejected, for several reasons.

  • First and foremost, consistency with the statement form of try/except would be broken. Just as a list comprehension or ternary if expression can be explained by "breaking it out" into its vertical statement form, an expression-except should be able to be explained by a relatively mechanical translation into a near-equivalent statement. Any form of syntax common to both should therefore have the same semantics in each, and above all should not have the subtle difference of catching more in one than the other, as it will tend to attract unnoticed bugs.
  • Secondly, the set of appropriate exceptions to catch would itself be a huge point of contention. It would be impossible to predict exactly which exceptions would "make sense" to be caught; why bless some of them with convenient syntax and not others?
  • And finally (this partly because the recommendation was that a bare except should be actively encouraged, once it was reduced to a "reasonable" set of exceptions), any situation where you catch an exception you don't expect to catch is an unnecessary bug magnet.

Consequently, the use of a bare 'except' is down to two possibilities: either it is syntactically forbidden in the expression form, or it is permitted with the exact same semantics as in the statement form (namely, that it catch BaseException and be unable to capture it with 'as').

Bare except clauses

PEP 8 rightly advises against the use of a bare 'except'. While it is syntactically legal in a statement, and for backward compatibility must remain so, there is little value in encouraging its use. In an expression except clause, "except:" is a SyntaxError; use the equivalent long-hand form "except BaseException:" instead. A future version of Python MAY choose to reinstate this, which can be done without breaking compatibility.

Parentheses around the except clauses

Should it be legal to parenthesize the except clauses, separately from the expression that could raise? Example:

value = expr (
    except Exception1 [as e]: default1
    except Exception2 [as e]: default2
    # ... except ExceptionN [as e]: defaultN
)

This is more compelling when one or both of the deferred sub-proposals of multiple except clauses and/or exception capturing is included. In their absence, the parentheses would be thus:

value = expr except ExceptionType: default
value = expr (except ExceptionType: default)

The advantage is minimal, and the potential to confuse a reader into thinking the except clause is separate from the expression, or into thinking this is a function call, makes this non-compelling. The expression can, of course, be parenthesized if desired, as can the default:

value = (expr) except ExceptionType: (default)

As the entire expression is now required to be in parentheses (which had not been decided at the time when this was debated), there is less need to delineate this section, and in many cases it would be redundant.

Short-hand for "except: pass"

The following was been suggested as a similar short-hand, though not technically an expression:

statement except Exception: pass

try:
    statement
except Exception:
    pass

For instance, a common use-case is attempting the removal of a file:

os.unlink(some_file) except OSError: pass

There is an equivalent already in Python 3.4, however, in contextlib:

from contextlib import suppress
with suppress(OSError): os.unlink(some_file)

As this is already a single line (or two with a break after the colon), there is little need of new syntax and a confusion of statement vs expression to achieve this.

Common objections

Colons always introduce suites

While it is true that many of Python's syntactic elements use the colon to introduce a statement suite (if, while, with, for, etcetera), this is not by any means the sole use of the colon. Currently, Python syntax includes four cases where a colon introduces a subexpression:

  • dict display - { ... key:value ... }
  • slice notation - [start:stop:step]
  • function definition - parameter : annotation
  • lambda - arg list: return value

This proposal simply adds a fifth:

  • except-expression - exception list: result

Style guides and PEP 8 should recommend not having the colon at the end of a wrapped line, which could potentially look like the introduction of a suite, but instead advocate wrapping before the exception list, keeping the colon clearly between two expressions.

pep-0464 Removal of the PyPI Mirror Authenticity API

PEP:464
Title:Removal of the PyPI Mirror Authenticity API
Version:$Revision$
Last-Modified:$Date$
Author:Donald Stufft <donald at stufft.io>
BDFL-Delegate:Richard Jones <richard@python.org>
Discussions-To:distutils-sig at python.org
Status:Accepted
Type:Process
Content-Type:text/x-rst
Created:02-Mar-2014
Post-History:04-Mar-2014
Replaces:381
Resolution:https://mail.python.org/pipermail/distutils-sig/2014-March/024027.html

Abstract

This PEP proposes the deprecation and removal of the PyPI Mirror Authenticity API, this includes the /serverkey URL and all of the URLs under /serversig.

Rationale

The PyPI mirroring infrastructure (defined in PEP 381) provides a means to mirror the content of PyPI used by the automatic installers, and as a component of that, it provides a method for verifying the authenticity of the mirrored content.

This PEP proposes the removal of this API due to:

  • There are no known implementations that utilize this API, this includes pip and setuptools.
  • Because this API uses DSA it is vulnerable to leaking the private key if there is any bias in the random nonce.
  • This API solves one small corner of the trust problem, however the problem itself is much larger and it would be better to have a fully fledged system, such as The Update Framework, instead.

Due to the issues it has and the lack of use it is the opinion of this PEP that it does not provide any practical benefit to justify the additional complexity.

Plan for Deprecation & Removal

Immediately upon the acceptance of this PEP the Mirror Authenticity API will be considered deprecated and mirroring agents and installation tools should stop accessing it.

Instead of actually removing it from the current code base (PyPI 1.0) the current work to replace PyPI 1.0 with a new code base (PyPI 2.0) will simply not implement this API. This would cause the API to be "removed" when the switch from 1.0 to 2.0 occurs.

If PyPI 2.0 has not been deployed in place of PyPI 1.0 by Sept 01 2014 then this PEP will be implemented in the PyPI 1.0 code base instead (by removing the associated code).

No changes will be required in the installers, however PEP 381 compliant mirroring clients, such as bandersnatch and pep381client will need to be updated to no longer attempt to mirror the /serversig URLs.

pep-0465 A dedicated infix operator for matrix multiplication

PEP:465
Title:A dedicated infix operator for matrix multiplication
Version:$Revision$
Last-Modified:$Date$
Author:Nathaniel J. Smith <njs at pobox.com>
Status:Final
Type:Standards Track
Content-Type:text/x-rst
Created:20-Feb-2014
Python-Version:3.5
Post-History:13-Mar-2014

Abstract

This PEP proposes a new binary operator to be used for matrix multiplication, called @. (Mnemonic: @ is * for mATrices.)

Specification

A new binary operator is added to the Python language, together with the corresponding in-place version:

Op Precedence/associativity Methods
@ Same as * __matmul__, __rmatmul__
@= n/a __imatmul__

No implementations of these methods are added to the builtin or standard library types. However, a number of projects have reached consensus on the recommended semantics for these operations; see Intended usage details below for details.

For details on how this operator will be implemented in CPython, see Implementation details.

Motivation

Executive summary

In numerical code, there are two important operations which compete for use of Python's * operator: elementwise multiplication, and matrix multiplication. In the nearly twenty years since the Numeric library was first proposed, there have been many attempts to resolve this tension [13]; none have been really satisfactory. Currently, most numerical Python code uses * for elementwise multiplication, and function/method syntax for matrix multiplication; however, this leads to ugly and unreadable code in common circumstances. The problem is bad enough that significant amounts of code continue to use the opposite convention (which has the virtue of producing ugly and unreadable code in different circumstances), and this API fragmentation across codebases then creates yet more problems. There does not seem to be any good solution to the problem of designing a numerical API within current Python syntax -- only a landscape of options that are bad in different ways. The minimal change to Python syntax which is sufficient to resolve these problems is the addition of a single new infix operator for matrix multiplication.

Matrix multiplication has a singular combination of features which distinguish it from other binary operations, which together provide a uniquely compelling case for the addition of a dedicated infix operator:

  • Just as for the existing numerical operators, there exists a vast body of prior art supporting the use of infix notation for matrix multiplication across all fields of mathematics, science, and engineering; @ harmoniously fills a hole in Python's existing operator system.
  • @ greatly clarifies real-world code.
  • @ provides a smoother onramp for less experienced users, who are particularly harmed by hard-to-read code and API fragmentation.
  • @ benefits a substantial and growing portion of the Python user community.
  • @ will be used frequently -- in fact, evidence suggests it may be used more frequently than // or the bitwise operators.
  • @ allows the Python numerical community to reduce fragmentation, and finally standardize on a single consensus duck type for all numerical array objects.

Background: What's wrong with the status quo?

When we crunch numbers on a computer, we usually have lots and lots of numbers to deal with. Trying to deal with them one at a time is cumbersome and slow -- especially when using an interpreted language. Instead, we want the ability to write down simple operations that apply to large collections of numbers all at once. The n-dimensional array is the basic object that all popular numeric computing environments use to make this possible. Python has several libraries that provide such arrays, with numpy being at present the most prominent.

When working with n-dimensional arrays, there are two different ways we might want to define multiplication. One is elementwise multiplication:

[[1, 2],     [[11, 12],     [[1 * 11, 2 * 12],
 [3, 4]]  x   [13, 14]]  =   [3 * 13, 4 * 14]]

and the other is matrix multiplication [19]:

[[1, 2],     [[11, 12],     [[1 * 11 + 2 * 13, 1 * 12 + 2 * 14],
 [3, 4]]  x   [13, 14]]  =   [3 * 11 + 4 * 13, 3 * 12 + 4 * 14]]

Elementwise multiplication is useful because it lets us easily and quickly perform many multiplications on a large collection of values, without writing a slow and cumbersome for loop. And this works as part of a very general schema: when using the array objects provided by numpy or other numerical libraries, all Python operators work elementwise on arrays of all dimensionalities. The result is that one can write functions using straightforward code like a * b + c / d, treating the variables as if they were simple values, but then immediately use this function to efficiently perform this calculation on large collections of values, while keeping them organized using whatever arbitrarily complex array layout works best for the problem at hand.

Matrix multiplication is more of a special case. It's only defined on 2d arrays (also known as "matrices"), and multiplication is the only operation that has an important "matrix" version -- "matrix addition" is the same as elementwise addition; there is no such thing as "matrix bitwise-or" or "matrix floordiv"; "matrix division" and "matrix to-the-power-of" can be defined but are not very useful, etc. However, matrix multiplication is still used very heavily across all numerical application areas; mathematically, it's one of the most fundamental operations there is.

Because Python syntax currently allows for only a single multiplication operator *, libraries providing array-like objects must decide: either use * for elementwise multiplication, or use * for matrix multiplication. And, unfortunately, it turns out that when doing general-purpose number crunching, both operations are used frequently, and there are major advantages to using infix rather than function call syntax in both cases. Thus it is not at all clear which convention is optimal, or even acceptable; often it varies on a case-by-case basis.

Nonetheless, network effects mean that it is very important that we pick just one convention. In numpy, for example, it is technically possible to switch between the conventions, because numpy provides two different types with different __mul__ methods. For numpy.ndarray objects, * performs elementwise multiplication, and matrix multiplication must use a function call (numpy.dot). For numpy.matrix objects, * performs matrix multiplication, and elementwise multiplication requires function syntax. Writing code using numpy.ndarray works fine. Writing code using numpy.matrix also works fine. But trouble begins as soon as we try to integrate these two pieces of code together. Code that expects an ndarray and gets a matrix, or vice-versa, may crash or return incorrect results. Keeping track of which functions expect which types as inputs, and return which types as outputs, and then converting back and forth all the time, is incredibly cumbersome and impossible to get right at any scale. Functions that defensively try to handle both types as input and DTRT, find themselves floundering into a swamp of isinstance and if statements.

PEP 238 split / into two operators: / and //. Imagine the chaos that would have resulted if it had instead split int into two types: classic_int, whose __div__ implemented floor division, and new_int, whose __div__ implemented true division. This, in a more limited way, is the situation that Python number-crunchers currently find themselves in.

In practice, the vast majority of projects have settled on the convention of using * for elementwise multiplication, and function call syntax for matrix multiplication (e.g., using numpy.ndarray instead of numpy.matrix). This reduces the problems caused by API fragmentation, but it doesn't eliminate them. The strong desire to use infix notation for matrix multiplication has caused a number of specialized array libraries to continue to use the opposing convention (e.g., scipy.sparse, pyoperators, pyviennacl) despite the problems this causes, and numpy.matrix itself still gets used in introductory programming courses, often appears in StackOverflow answers, and so forth. Well-written libraries thus must continue to be prepared to deal with both types of objects, and, of course, are also stuck using unpleasant funcall syntax for matrix multiplication. After nearly two decades of trying, the numerical community has still not found any way to resolve these problems within the constraints of current Python syntax (see Rejected alternatives to adding a new operator below).

This PEP proposes the minimum effective change to Python syntax that will allow us to drain this swamp. It splits * into two operators, just as was done for /: * for elementwise multiplication, and @ for matrix multiplication. (Why not the reverse? Because this way is compatible with the existing consensus, and because it gives us a consistent rule that all the built-in numeric operators also apply in an elementwise manner to arrays; the reverse convention would lead to more special cases.)

So that's why matrix multiplication doesn't and can't just use *. Now, in the the rest of this section, we'll explain why it nonetheless meets the high bar for adding a new operator.

Why should matrix multiplication be infix?

Right now, most numerical code in Python uses syntax like numpy.dot(a, b) or a.dot(b) to perform matrix multiplication. This obviously works, so why do people make such a fuss about it, even to the point of creating API fragmentation and compatibility swamps?

Matrix multiplication shares two features with ordinary arithmetic operations like addition and multiplication on numbers: (a) it is used very heavily in numerical programs -- often multiple times per line of code -- and (b) it has an ancient and universally adopted tradition of being written using infix syntax. This is because, for typical formulas, this notation is dramatically more readable than any function call syntax. Here's an example to demonstrate:

One of the most useful tools for testing a statistical hypothesis is the linear hypothesis test for OLS regression models. It doesn't really matter what all those words I just said mean; if we find ourselves having to implement this thing, what we'll do is look up some textbook or paper on it, and encounter many mathematical formulas that look like:

S = (Hβ − r)T(HVHT) − 1(Hβ − r)

Here the various variables are all vectors or matrices (details for the curious: [5]).

Now we need to write code to perform this calculation. In current numpy, matrix multiplication can be performed using either the function or method call syntax. Neither provides a particularly readable translation of the formula:

import numpy as np
from numpy.linalg import inv, solve

# Using dot function:
S = np.dot((np.dot(H, beta) - r).T,
           np.dot(inv(np.dot(np.dot(H, V), H.T)), np.dot(H, beta) - r))

# Using dot method:
S = (H.dot(beta) - r).T.dot(inv(H.dot(V).dot(H.T))).dot(H.dot(beta) - r)

With the @ operator, the direct translation of the above formula becomes:

S = (H @ beta - r).T @ inv(H @ V @ H.T) @ (H @ beta - r)

Notice that there is now a transparent, 1-to-1 mapping between the symbols in the original formula and the code that implements it.

Of course, an experienced programmer will probably notice that this is not the best way to compute this expression. The repeated computation of Hβ − r should perhaps be factored out; and, expressions of the form dot(inv(A), B) should almost always be replaced by the more numerically stable solve(A, B). When using @, performing these two refactorings gives us:

# Version 1 (as above)
S = (H @ beta - r).T @ inv(H @ V @ H.T) @ (H @ beta - r)

# Version 2
trans_coef = H @ beta - r
S = trans_coef.T @ inv(H @ V @ H.T) @ trans_coef

# Version 3
S = trans_coef.T @ solve(H @ V @ H.T, trans_coef)

Notice that when comparing between each pair of steps, it's very easy to see exactly what was changed. If we apply the equivalent transformations to the code using the .dot method, then the changes are much harder to read out or verify for correctness:

# Version 1 (as above)
S = (H.dot(beta) - r).T.dot(inv(H.dot(V).dot(H.T))).dot(H.dot(beta) - r)

# Version 2
trans_coef = H.dot(beta) - r
S = trans_coef.T.dot(inv(H.dot(V).dot(H.T))).dot(trans_coef)

# Version 3
S = trans_coef.T.dot(solve(H.dot(V).dot(H.T)), trans_coef)

Readability counts! The statements using @ are shorter, contain more whitespace, can be directly and easily compared both to each other and to the textbook formula, and contain only meaningful parentheses. This last point is particularly important for readability: when using function-call syntax, the required parentheses on every operation create visual clutter that makes it very difficult to parse out the overall structure of the formula by eye, even for a relatively simple formula like this one. Eyes are terrible at parsing non-regular languages. I made and caught many errors while trying to write out the 'dot' formulas above. I know they still contain at least one error, maybe more. (Exercise: find it. Or them.) The @ examples, by contrast, are not only correct, they're obviously correct at a glance.

If we are even more sophisticated programmers, and writing code that we expect to be reused, then considerations of speed or numerical accuracy might lead us to prefer some particular order of evaluation. Because @ makes it possible to omit irrelevant parentheses, we can be certain that if we do write something like (H @ V) @ H.T, then our readers will know that the parentheses must have been added intentionally to accomplish some meaningful purpose. In the dot examples, it's impossible to know which nesting decisions are important, and which are arbitrary.

Infix @ dramatically improves matrix code usability at all stages of programmer interaction.

Transparent syntax is especially crucial for non-expert programmers

A large proportion of scientific code is written by people who are experts in their domain, but are not experts in programming. And there are many university courses run each year with titles like "Data analysis for social scientists" which assume no programming background, and teach some combination of mathematical techniques, introduction to programming, and the use of programming to implement these mathematical techniques, all within a 10-15 week period. These courses are more and more often being taught in Python rather than special-purpose languages like R or Matlab.

For these kinds of users, whose programming knowledge is fragile, the existence of a transparent mapping between formulas and code often means the difference between succeeding and failing to write that code at all. This is so important that such classes often use the numpy.matrix type which defines * to mean matrix multiplication, even though this type is buggy and heavily disrecommended by the rest of the numpy community for the fragmentation that it causes. This pedagogical use case is, in fact, the only reason numpy.matrix remains a supported part of numpy. Adding @ will benefit both beginning and advanced users with better syntax; and furthermore, it will allow both groups to standardize on the same notation from the start, providing a smoother on-ramp to expertise.

But isn't matrix multiplication a pretty niche requirement?

The world is full of continuous data, and computers are increasingly called upon to work with it in sophisticated ways. Arrays are the lingua franca of finance, machine learning, 3d graphics, computer vision, robotics, operations research, econometrics, meteorology, computational linguistics, recommendation systems, neuroscience, astronomy, bioinformatics (including genetics, cancer research, drug discovery, etc.), physics engines, quantum mechanics, geophysics, network analysis, and many other application areas. In most or all of these areas, Python is rapidly becoming a dominant player, in large part because of its ability to elegantly mix traditional discrete data structures (hash tables, strings, etc.) on an equal footing with modern numerical data types and algorithms.

We all live in our own little sub-communities, so some Python users may be surprised to realize the sheer extent to which Python is used for number crunching -- especially since much of this particular sub-community's activity occurs outside of traditional Python/FOSS channels. So, to give some rough idea of just how many numerical Python programmers are actually out there, here are two numbers: In 2013, there were 7 international conferences organized specifically on numerical Python [3] [4]. At PyCon 2014, ~20% of the tutorials appear to involve the use of matrices [6].

To quantify this further, we used Github's "search" function to look at what modules are actually imported across a wide range of real-world code (i.e., all the code on Github). We checked for imports of several popular stdlib modules, a variety of numerically oriented modules, and various other extremely high-profile modules like django and lxml (the latter of which is the #1 most downloaded package on PyPI). Starred lines indicate packages which export array- or matrix-like objects which will adopt @ if this PEP is approved:

Count of Python source files on Github matching given search terms
                 (as of 2014-04-10, ~21:00 UTC)
================ ==========  ===============  =======  ===========
module           "import X"  "from X import"    total  total/numpy
================ ==========  ===============  =======  ===========
sys                 2374638            63301  2437939         5.85
os                  1971515            37571  2009086         4.82
re                  1294651             8358  1303009         3.12
numpy ************** 337916 ********** 79065 * 416981 ******* 1.00
warnings             298195            73150   371345         0.89
subprocess           281290            63644   344934         0.83
django                62795           219302   282097         0.68
math                 200084            81903   281987         0.68
threading            212302            45423   257725         0.62
pickle+cPickle       215349            22672   238021         0.57
matplotlib           119054            27859   146913         0.35
sqlalchemy            29842            82850   112692         0.27
pylab *************** 36754 ********** 41063 ** 77817 ******* 0.19
scipy *************** 40829 ********** 28263 ** 69092 ******* 0.17
lxml                  19026            38061    57087         0.14
zlib                  40486             6623    47109         0.11
multiprocessing       25247            19850    45097         0.11
requests              30896              560    31456         0.08
jinja2                 8057            24047    32104         0.08
twisted               13858             6404    20262         0.05
gevent                11309             8529    19838         0.05
pandas ************** 14923 *********** 4005 ** 18928 ******* 0.05
sympy                  2779             9537    12316         0.03
theano *************** 3654 *********** 1828 *** 5482 ******* 0.01
================ ==========  ===============  =======  ===========

These numbers should be taken with several grains of salt (see footnote for discussion: [12]), but, to the extent they can be trusted, they suggest that numpy might be the single most-imported non-stdlib module in the entire Pythonverse; it's even more-imported than such stdlib stalwarts as subprocess, math, pickle, and threading. And numpy users represent only a subset of the broader numerical community that will benefit from the @ operator. Matrices may once have been a niche data type restricted to Fortran programs running in university labs and military clusters, but those days are long gone. Number crunching is a mainstream part of modern Python usage.

In addition, there is some precedence for adding an infix operator to handle a more-specialized arithmetic operation: the floor division operator //, like the bitwise operators, is very useful under certain circumstances when performing exact calculations on discrete values. But it seems likely that there are many Python programmers who have never had reason to use // (or, for that matter, the bitwise operators). @ is no more niche than //.

So @ is good for matrix formulas, but how common are those really?

We've seen that @ makes matrix formulas dramatically easier to work with for both experts and non-experts, that matrix formulas appear in many important applications, and that numerical libraries like numpy are used by a substantial proportion of Python's user base. But numerical libraries aren't just about matrix formulas, and being important doesn't necessarily mean taking up a lot of code: if matrix formulas only occured in one or two places in the average numerically-oriented project, then it still wouldn't be worth adding a new operator. So how common is matrix multiplication, really?

When the going gets tough, the tough get empirical. To get a rough estimate of how useful the @ operator will be, the table below shows the rate at which different Python operators are actually used in the stdlib, and also in two high-profile numerical packages -- the scikit-learn machine learning library, and the nipy neuroimaging library -- normalized by source lines of code (SLOC). Rows are sorted by the 'combined' column, which pools all three code bases together. The combined column is thus strongly weighted towards the stdlib, which is much larger than both projects put together (stdlib: 411575 SLOC, scikit-learn: 50924 SLOC, nipy: 37078 SLOC). [7]

The dot row (marked ******) counts how common matrix multiply operations are in each codebase.

====  ======  ============  ====  ========
  op  stdlib  scikit-learn  nipy  combined
====  ======  ============  ====  ========
   =    2969          5536  4932      3376 / 10,000 SLOC
   -     218           444   496       261
   +     224           201   348       231
  ==     177           248   334       196
   *     156           284   465       192
   %     121           114   107       119
  **      59           111   118        68
  !=      40            56    74        44
   /      18           121   183        41
   >      29            70   110        39
  +=      34            61    67        39
   <      32            62    76        38
  >=      19            17    17        18
  <=      18            27    12        18
 dot ***** 0 ********** 99 ** 74 ****** 16
   |      18             1     2        15
   &      14             0     6        12
  <<      10             1     1         8
  //       9             9     1         8
  -=       5            21    14         8
  *=       2            19    22         5
  /=       0            23    16         4
  >>       4             0     0         3
   ^       3             0     0         3
   ~       2             4     5         2
  |=       3             0     0         2
  &=       1             0     0         1
 //=       1             0     0         1
  ^=       1             0     0         0
 **=       0             2     0         0
  %=       0             0     0         0
 <<=       0             0     0         0
 >>=       0             0     0         0
====  ======  ============  ====  ========

These two numerical packages alone contain ~780 uses of matrix multiplication. Within these packages, matrix multiplication is used more heavily than most comparison operators (< != <= >=). Even when we dilute these counts by including the stdlib into our comparisons, matrix multiplication is still used more often in total than any of the bitwise operators, and 2x as often as //. This is true even though the stdlib, which contains a fair amount of integer arithmetic and no matrix operations, makes up more than 80% of the combined code base.

By coincidence, the numeric libraries make up approximately the same proportion of the 'combined' codebase as numeric tutorials make up of PyCon 2014's tutorial schedule, which suggests that the 'combined' column may not be wildly unrepresentative of new Python code in general. While it's impossible to know for certain, from this data it seems entirely possible that across all Python code currently being written, matrix multiplication is already used more often than // and the bitwise operations.

But isn't it weird to add an operator with no stdlib uses?

It's certainly unusual (though extended slicing existed for some time builtin types gained support for it, Ellipsis is still unused within the stdlib, etc.). But the important thing is whether a change will benefit users, not where the software is being downloaded from. It's clear from the above that @ will be used, and used heavily. And this PEP provides the critical piece that will allow the Python numerical community to finally reach consensus on a standard duck type for all array-like objects, which is a necessary precondition to ever adding a numerical array type to the stdlib.

Compatibility considerations

Currently, the only legal use of the @ token in Python code is at statement beginning in decorators. The new operators are both infix; the one place they can never occur is at statement beginning. Therefore, no existing code will be broken by the addition of these operators, and there is no possible parsing ambiguity between decorator-@ and the new operators.

Another important kind of compatibility is the mental cost paid by users to update their understanding of the Python language after this change, particularly for users who do not work with matrices and thus do not benefit. Here again, @ has minimal impact: even comprehensive tutorials and references will only need to add a sentence or two to fully document this PEP's changes for a non-numerical audience.

Intended usage details

This section is informative, rather than normative -- it documents the consensus of a number of libraries that provide array- or matrix-like objects on how @ will be implemented.

This section uses the numpy terminology for describing arbitrary multidimensional arrays of data, because it is a superset of all other commonly used models. In this model, the shape of any array is represented by a tuple of integers. Because matrices are two-dimensional, they have len(shape) == 2, while 1d vectors have len(shape) == 1, and scalars have shape == (), i.e., they are "0 dimensional". Any array contains prod(shape) total entries. Notice that prod(()) == 1 [20] (for the same reason that sum(()) == 0); scalars are just an ordinary kind of array, not a special case. Notice also that we distinguish between a single scalar value (shape == (), analogous to 1), a vector containing only a single entry (shape == (1,), analogous to [1]), a matrix containing only a single entry (shape == (1, 1), analogous to [[1]]), etc., so the dimensionality of any array is always well-defined. Other libraries with more restricted representations (e.g., those that support 2d arrays only) might implement only a subset of the functionality described here.

Semantics

The recommended semantics for @ for different inputs are:

  • 2d inputs are conventional matrices, and so the semantics are obvious: we apply conventional matrix multiplication. If we write arr(2, 3) to represent an arbitrary 2x3 array, then arr(2, 3) @ arr(3, 4) returns an array with shape (2, 4).

  • 1d vector inputs are promoted to 2d by prepending or appending a '1' to the shape, the operation is performed, and then the added dimension is removed from the output. The 1 is always added on the "outside" of the shape: prepended for left arguments, and appended for right arguments. The result is that matrix @ vector and vector @ matrix are both legal (assuming compatible shapes), and both return 1d vectors; vector @ vector returns a scalar. This is clearer with examples.

    • arr(2, 3) @ arr(3, 1) is a regular matrix product, and returns an array with shape (2, 1), i.e., a column vector.
    • arr(2, 3) @ arr(3) performs the same computation as the previous (i.e., treats the 1d vector as a matrix containing a single column, shape = (3, 1)), but returns the result with shape (2,), i.e., a 1d vector.
    • arr(1, 3) @ arr(3, 2) is a regular matrix product, and returns an array with shape (1, 2), i.e., a row vector.
    • arr(3) @ arr(3, 2) performs the same computation as the previous (i.e., treats the 1d vector as a matrix containing a single row, shape = (1, 3)), but returns the result with shape (2,), i.e., a 1d vector.
    • arr(1, 3) @ arr(3, 1) is a regular matrix product, and returns an array with shape (1, 1), i.e., a single value in matrix form.
    • arr(3) @ arr(3) performs the same computation as the previous, but returns the result with shape (), i.e., a single scalar value, not in matrix form. So this is the standard inner product on vectors.

    An infelicity of this definition for 1d vectors is that it makes @ non-associative in some cases ((Mat1 @ vec) @ Mat2 != Mat1 @ (vec @ Mat2)). But this seems to be a case where practicality beats purity: non-associativity only arises for strange expressions that would never be written in practice; if they are written anyway then there is a consistent rule for understanding what will happen (Mat1 @ vec @ Mat2 is parsed as (Mat1 @ vec) @ Mat2, just like a - b - c); and, not supporting 1d vectors would rule out many important use cases that do arise very commonly in practice. No-one wants to explain to new users why to solve the simplest linear system in the obvious way, they have to type (inv(A) @ b[:, np.newaxis]).flatten() instead of inv(A) @ b, or perform an ordinary least-squares regression by typing solve(X.T @ X, X @ y[:, np.newaxis]).flatten() instead of solve(X.T @ X, X @ y). No-one wants to type (a[np.newaxis, :] @ b[:, np.newaxis])[0, 0] instead of a @ b every time they compute an inner product, or (a[np.newaxis, :] @ Mat @ b[:, np.newaxis])[0, 0] for general quadratic forms instead of a @ Mat @ b. In addition, sage and sympy (see below) use these non-associative semantics with an infix matrix multiplication operator (they use *), and they report that they haven't experienced any problems caused by it.

  • For inputs with more than 2 dimensions, we treat the last two dimensions as being the dimensions of the matrices to multiply, and 'broadcast' across the other dimensions. This provides a convenient way to quickly compute many matrix products in a single operation. For example, arr(10, 2, 3) @ arr(10, 3, 4) performs 10 separate matrix multiplies, each of which multiplies a 2x3 and a 3x4 matrix to produce a 2x4 matrix, and then returns the 10 resulting matrices together in an array with shape (10, 2, 4). The intuition here is that we treat these 3d arrays of numbers as if they were 1d arrays of matrices, and then apply matrix multiplication in an elementwise manner, where now each 'element' is a whole matrix. Note that broadcasting is not limited to perfectly aligned arrays; in more complicated cases, it allows several simple but powerful tricks for controlling how arrays are aligned with each other; see [10] for details. (In particular, it turns out that when broadcasting is taken into account, the standard scalar * matrix product is a special case of the elementwise multiplication operator *.)

    If one operand is >2d, and another operand is 1d, then the above rules apply unchanged, with 1d->2d promotion performed before broadcasting. E.g., arr(10, 2, 3) @ arr(3) first promotes to arr(10, 2, 3) @ arr(3, 1), then broadcasts the right argument to create the aligned operation arr(10, 2, 3) @ arr(10, 3, 1), multiplies to get an array with shape (10, 2, 1), and finally removes the added dimension, returning an array with shape (10, 2). Similarly, arr(2) @ arr(10, 2, 3) produces an intermediate array with shape (10, 1, 3), and a final array with shape (10, 3).

  • 0d (scalar) inputs raise an error. Scalar * matrix multiplication is a mathematically and algorithmically distinct operation from matrix @ matrix multiplication, and is already covered by the elementwise * operator. Allowing scalar @ matrix would thus both require an unnecessary special case, and violate TOOWTDI.

Adoption

We group existing Python projects which provide array- or matrix-like types based on what API they currently use for elementwise and matrix multiplication.

Projects which currently use * for elementwise multiplication, and function/method calls for matrix multiplication:

The developers of the following projects have expressed an intention to implement @ on their array-like types using the above semantics:

  • numpy
  • pandas
  • blaze
  • theano

The following projects have been alerted to the existence of the PEP, but it's not yet known what they plan to do if it's accepted. We don't anticipate that they'll have any objections, though, since everything proposed here is consistent with how they already do things:

  • pycuda
  • panda3d

Projects which currently use * for matrix multiplication, and function/method calls for elementwise multiplication:

The following projects have expressed an intention, if this PEP is accepted, to migrate from their current API to the elementwise-*, matmul-@ convention (i.e., this is a list of projects whose API fragmentation will probably be eliminated if this PEP is accepted):

  • numpy (numpy.matrix)
  • scipy.sparse
  • pyoperators
  • pyviennacl

The following projects have been alerted to the existence of the PEP, but it's not known what they plan to do if it's accepted (i.e., this is a list of projects whose API fragmentation may or may not be eliminated if this PEP is accepted):

  • cvxopt

Projects which currently use * for matrix multiplication, and which don't really care about elementwise multiplication of matrices:

There are several projects which implement matrix types, but from a very different perspective than the numerical libraries discussed above. These projects focus on computational methods for analyzing matrices in the sense of abstract mathematical objects (i.e., linear maps over free modules over rings), rather than as big bags full of numbers that need crunching. And it turns out that from the abstract math point of view, there isn't much use for elementwise operations in the first place; as discussed in the Background section above, elementwise operations are motivated by the bag-of-numbers approach. So these projects don't encounter the basic problem that this PEP exists to address, making it mostly irrelevant to them; while they appear superficially similar to projects like numpy, they're actually doing something quite different. They use * for matrix multiplication (and for group actions, and so forth), and if this PEP is accepted, their expressed intention is to continue doing so, while perhaps adding @ as an alias. These projects include:

  • sympy
  • sage

Implementation details

New functions operator.matmul and operator.__matmul__ are added to the standard library, with the usual semantics.

A corresponding function PyObject* PyObject_MatrixMultiply(PyObject *o1, PyObject *o2) is added to the C API.

A new AST node is added named MatMult, along with a new token ATEQUAL and new bytecode opcodes BINARY_MATRIX_MULTIPLY and INPLACE_MATRIX_MULTIPLY.

Two new type slots are added; whether this is to PyNumberMethods or a new PyMatrixMethods struct remains to be determined.

Rationale for specification details

Choice of operator

Why @ instead of some other spelling? There isn't any consensus across other programming languages about how this operator should be named [11]; here we discuss the various options.

Restricting ourselves only to symbols present on US English keyboards, the punctuation characters that don't already have a meaning in Python expression context are: @, backtick, $, !, and ?. Of these options, @ is clearly the best; ! and ? are already heavily freighted with inapplicable meanings in the programming context, backtick has been banned from Python by BDFL pronouncement (see PEP 3099), and $ is uglier, even more dissimilar to * and , and has Perl/PHP baggage. $ is probably the second-best option of these, though.

Symbols which are not present on US English keyboards start at a significant disadvantage (having to spend 5 minutes at the beginning of every numeric Python tutorial just going over keyboard layouts is not a hassle anyone really wants). Plus, even if we somehow overcame the typing problem, it's not clear there are any that are actually better than @. Some options that have been suggested include:

  • U+00D7 MULTIPLICATION SIGN: A × B
  • U+22C5 DOT OPERATOR: A ⋅ B
  • U+2297 CIRCLED TIMES: A ⊗ B
  • U+00B0 DEGREE: A ° B

What we need, though, is an operator that means "matrix multiplication, as opposed to scalar/elementwise multiplication". There is no conventional symbol with this meaning in either programming or mathematics, where these operations are usually distinguished by context. (And U+2297 CIRCLED TIMES is actually used conventionally to mean exactly the wrong things: elementwise multiplication -- the "Hadamard product" -- or outer product, rather than matrix/inner product like our operator). @ at least has the virtue that it looks like a funny non-commutative operator; a naive user who knows maths but not programming couldn't look at A * B versus A × B, or A * B versus A ⋅ B, or A * B versus A ° B and guess which one is the usual multiplication, and which one is the special case.

Finally, there is the option of using multi-character tokens. Some options:

  • Matlab and Julia use a .* operator. Aside from being visually confusable with *, this would be a terrible choice for us because in Matlab and Julia, * means matrix multiplication and .* means elementwise multiplication, so using .* for matrix multiplication would make us exactly backwards from what Matlab and Julia users expect.
  • APL apparently used +.×, which by combining a multi-character token, confusing attribute-access-like . syntax, and a unicode character, ranks somewhere below U+2603 SNOWMAN on our candidate list. If we like the idea of combining addition and multiplication operators as being evocative of how matrix multiplication actually works, then something like +* could be used -- though this may be too easy to confuse with *+, which is just multiplication combined with the unary + operator.
  • PEP 211 suggested ~*. This has the downside that it sort of suggests that there is a unary * operator that is being combined with unary ~, but it could work.
  • R uses %*% for matrix multiplication. In R this forms part of a general extensible infix system in which all tokens of the form %foo% are user-defined binary operators. We could steal the token without stealing the system.
  • Some other plausible candidates that have been suggested: >< (= ascii drawing of the multiplication sign ×); the footnote operator [*] or |*| (but when used in context, the use of vertical grouping symbols tends to recreate the nested parentheses visual clutter that was noted as one of the major downsides of the function syntax we're trying to get away from); ^*.

So, it doesn't matter much, but @ seems as good or better than any of the alternatives:

  • It's a friendly character that Pythoneers are already used to typing in decorators, but the decorator usage and the math expression usage are sufficiently dissimilar that it would be hard to confuse them in practice.
  • It's widely accessible across keyboard layouts (and thanks to its use in email addresses, this is true even of weird keyboards like those in phones).
  • It's round like * and .
  • The mATrices mnemonic is cute.
  • The swirly shape is reminiscent of the simultaneous sweeps over rows and columns that define matrix multiplication
  • Its asymmetry is evocative of its non-commutative nature.
  • Whatever, we have to pick something.

Precedence and associativity

There was a long discussion [15] about whether @ should be right- or left-associative (or even something more exotic [18]). Almost all Python operators are left-associative, so following this convention would be the simplest approach, but there were two arguments that suggested matrix multiplication might be worth making right-associative as a special case:

First, matrix multiplication has a tight conceptual association with function application/composition, so many mathematically sophisticated users have an intuition that an expression like RSx proceeds from right-to-left, with first S transforming the vector x, and then R transforming the result. This isn't universally agreed (and not all number-crunchers are steeped in the pure-math conceptual framework that motivates this intuition [16]), but at the least this intuition is more common than for other operations like 2⋅3⋅4 which everyone reads as going from left-to-right.

Second, if expressions like Mat @ Mat @ vec appear often in code, then programs will run faster (and efficiency-minded programmers will be able to use fewer parentheses) if this is evaluated as Mat @ (Mat @ vec) then if it is evaluated like (Mat @ Mat) @ vec.

However, weighing against these arguments are the following:

Regarding the efficiency argument, empirically, we were unable to find any evidence that Mat @ Mat @ vec type expressions actually dominate in real-life code. Parsing a number of large projects that use numpy, we found that when forced by numpy's current funcall syntax to choose an order of operations for nested calls to dot, people actually use left-associative nesting slightly more often than right-associative nesting [17]. And anyway, writing parentheses isn't so bad -- if an efficiency-minded programmer is going to take the trouble to think through the best way to evaluate some expression, they probably should write down the parentheses regardless of whether they're needed, just to make it obvious to the next reader that they order of operations matter.

In addition, it turns out that other languages, including those with much more of a focus on linear algebra, overwhelmingly make their matmul operators left-associative. Specifically, the @ equivalent is left-associative in R, Matlab, Julia, IDL, and Gauss. The only exceptions we found are Mathematica, in which a @ b @ c would be parsed non-associatively as dot(a, b, c), and APL, in which all operators are right-associative. There do not seem to exist any languages that make @ right-associative and * left-associative. And these decisions don't seem to be controversial -- I've never seen anyone complaining about this particular aspect of any of these other languages, and the left-associativity of * doesn't seem to bother users of the existing Python libraries that use * for matrix multiplication. So, at the least we can conclude from this that making @ left-associative will certainly not cause any disasters. Making @ right-associative, OTOH, would be exploring new and uncertain ground.

And another advantage of left-associativity is that it is much easier to learn and remember that @ acts like *, than it is to remember first that @ is unlike other Python operators by being right-associative, and then on top of this, also have to remember whether it is more tightly or more loosely binding than *. (Right-associativity forces us to choose a precedence, and intuitions were about equally split on which precedence made more sense. So this suggests that no matter which choice we made, no-one would be able to guess or remember it.)

On net, therefore, the general consensus of the numerical community is that while matrix multiplication is something of a special case, it's not special enough to break the rules, and @ should parse like * does.

(Non)-Definitions for built-in types

No __matmul__ or __matpow__ are defined for builtin numeric types (float, int, etc.) or for the numbers.Number hierarchy, because these types represent scalars, and the consensus semantics for @ are that it should raise an error on scalars.

We do not -- for now -- define a __matmul__ method on the standard memoryview or array.array objects, for several reasons. Of course this could be added if someone wants it, but these types would require quite a bit of additional work beyond __matmul__ before they could be used for numeric work -- e.g., they have no way to do addition or scalar multiplication either! -- and adding such functionality is beyond the scope of this PEP. In addition, providing a quality implementation of matrix multiplication is highly non-trivial. Naive nested loop implementations are very slow and shipping such an implementation in CPython would just create a trap for users. But the alternative -- providing a modern, competitive matrix multiply -- would require that CPython link to a BLAS library, which brings a set of new complications. In particular, several popular BLAS libraries (including the one that ships by default on OS X) currently break the use of multiprocessing [8]. Together, these considerations mean that the cost/benefit of adding __matmul__ to these types just isn't there, so for now we'll continue to delegate these problems to numpy and friends, and defer a more systematic solution to a future proposal.

There are also non-numeric Python builtins which define __mul__ (str, list, ...). We do not define __matmul__ for these types either, because why would we even do that.

Non-definition of matrix power

Earlier versions of this PEP also proposed a matrix power operator, @@, analogous to **. But on further consideration, it was decided that the utility of this was sufficiently unclear that it would be better to leave it out for now, and only revisit the issue if -- once we have more experience with @ -- it turns out that @@ is truly missed. [14]

Rejected alternatives to adding a new operator

Over the past few decades, the Python numeric community has explored a variety of ways to resolve the tension between matrix and elementwise multiplication operations. PEP 211 and PEP 225, both proposed in 2000 and last seriously discussed in 2008 [9], were early attempts to add new operators to solve this problem, but suffered from serious flaws; in particular, at that time the Python numerical community had not yet reached consensus on the proper API for array objects, or on what operators might be needed or useful (e.g., PEP 225 proposes 6 new operators with unspecified semantics). Experience since then has now led to consensus that the best solution, for both numeric Python and core Python, is to add a single infix operator for matrix multiply (together with the other new operators this implies like @=).

We review some of the rejected alternatives here.

Use a second type that defines __mul__ as matrix multiplication: As discussed above (Background: What's wrong with the status quo?), this has been tried this for many years via the numpy.matrix type (and its predecessors in Numeric and numarray). The result is a strong consensus among both numpy developers and developers of downstream packages that numpy.matrix should essentially never be used, because of the problems caused by having conflicting duck types for arrays. (Of course one could then argue we should only define __mul__ to be matrix multiplication, but then we'd have the same problem with elementwise multiplication.) There have been several pushes to remove numpy.matrix entirely; the only counter-arguments have come from educators who find that its problems are outweighed by the need to provide a simple and clear mapping between mathematical notation and code for novices (see Transparent syntax is especially crucial for non-expert programmers). But, of course, starting out newbies with a dispreferred syntax and then expecting them to transition later causes its own problems. The two-type solution is worse than the disease.

Add lots of new operators, or add a new generic syntax for defining infix operators: In addition to being generally un-Pythonic and repeatedly rejected by BDFL fiat, this would be using a sledgehammer to smash a fly. The scientific python community has consensus that adding one operator for matrix multiplication is enough to fix the one otherwise unfixable pain point. (In retrospect, we all think PEP 225 was a bad idea too -- or at least far more complex than it needed to be.)

Add a new @ (or whatever) operator that has some other meaning in general Python, and then overload it in numeric code: This was the approach taken by PEP 211, which proposed defining @ to be the equivalent of itertools.product. The problem with this is that when taken on its own terms, it's pretty clear that itertools.product doesn't actually need a dedicated operator. It hasn't even been deemed worth of a builtin. (During discussions of this PEP, a similar suggestion was made to define @ as a general purpose function composition operator, and this suffers from the same problem; functools.compose isn't even useful enough to exist.) Matrix multiplication has a uniquely strong rationale for inclusion as an infix operator. There almost certainly don't exist any other binary operations that will ever justify adding any other infix operators to Python.

Add a .dot method to array types so as to allow "pseudo-infix" A.dot(B) syntax: This has been in numpy for some years, and in many cases it's better than dot(A, B). But it's still much less readable than real infix notation, and in particular still suffers from an extreme overabundance of parentheses. See Why should matrix multiplication be infix? above.

Use a 'with' block to toggle the meaning of * within a single code block: E.g., numpy could define a special context object so that we'd have:

c = a * b   # element-wise multiplication
with numpy.mul_as_dot:
    c = a * b  # matrix multiplication

However, this has two serious problems: first, it requires that every array-like type's __mul__ method know how to check some global state (numpy.mul_is_currently_dot or whatever). This is fine if a and b are numpy objects, but the world contains many non-numpy array-like objects. So this either requires non-local coupling -- every numpy competitor library has to import numpy and then check numpy.mul_is_currently_dot on every operation -- or else it breaks duck-typing, with the above code doing radically different things depending on whether a and b are numpy objects or some other sort of object. Second, and worse, with blocks are dynamically scoped, not lexically scoped; i.e., any function that gets called inside the with block will suddenly find itself executing inside the mul_as_dot world, and crash and burn horribly -- if you're lucky. So this is a construct that could only be used safely in rather limited cases (no function calls), and which would make it very easy to shoot yourself in the foot without warning.

Use a language preprocessor that adds extra numerically-oriented operators and perhaps other syntax: (As per recent BDFL suggestion: [1]) This suggestion seems based on the idea that numerical code needs a wide variety of syntax additions. In fact, given @, most numerical users don't need any other operators or syntax; it solves the one really painful problem that cannot be solved by other means, and that causes painful reverberations through the larger ecosystem. Defining a new language (presumably with its own parser which would have to be kept in sync with Python's, etc.), just to support a single binary operator, is neither practical nor desireable. In the numerical context, Python's competition is special-purpose numerical languages (Matlab, R, IDL, etc.). Compared to these, Python's killer feature is exactly that one can mix specialized numerical code with code for XML parsing, web page generation, database access, network programming, GUI libraries, and so forth, and we also gain major benefits from the huge variety of tutorials, reference material, introductory classes, etc., which use Python. Fragmenting "numerical Python" from "real Python" would be a major source of confusion. A major motivation for this PEP is to reduce fragmentation. Having to set up a preprocessor would be an especially prohibitive complication for unsophisticated users. And we use Python because we like Python! We don't want almost-but-not-quite-Python.

Use overloading hacks to define a "new infix operator" like *dot*, as in a well-known Python recipe: (See: [2]) Beautiful is better than ugly. This is... not beautiful. And not Pythonic. And especially unfriendly to beginners, who are just trying to wrap their heads around the idea that there's a coherent underlying system behind these magic incantations that they're learning, when along comes an evil hack like this that violates that system, creates bizarre error messages when accidentally misused, and whose underlying mechanisms can't be understood without deep knowledge of how object oriented systems work.

Use a special "facade" type to support syntax like arr.M * arr: This is very similar to the previous proposal, in that the .M attribute would basically return the same object as arr *dot` would, and thus suffers the same objections about 'magicalness'.  This approach also has some non-obvious complexities: for example, while ``arr.M * arr must return an array, arr.M * arr.M and arr * arr.M must return facade objects, or else arr.M * arr.M * arr and arr * arr.M * arr will not work. But this means that facade objects must be able to recognize both other array objects and other facade objects (which creates additional complexity for writing interoperating array types from different libraries who must now recognize both each other's array types and their facade types). It also creates pitfalls for users who may easily type arr * arr.M or arr.M * arr.M and expect to get back an array object; instead, they will get a mysterious object that throws errors when they attempt to use it. Basically with this approach users must be careful to think of .M* as an indivisible unit that acts as an infix operator -- and as infix-operator-like token strings go, at least *dot* is prettier looking (look at its cute little ears!).

Discussions of this PEP

Collected here for reference:

References

[1]From a comment by GvR on a G+ post by GvR; the comment itself does not seem to be directly linkable: https://plus.google.com/115212051037621986145/posts/hZVVtJ9bK3u
[2]http://code.activestate.com/recipes/384122-infix-operators/ http://www.sagemath.org/doc/reference/misc/sage/misc/decorators.html#sage.misc.decorators.infix_operator
[3]http://conference.scipy.org/past.html
[4]http://pydata.org/events/
[5]

In this formula, β is a vector or matrix of regression coefficients, V is the estimated variance/covariance matrix for these coefficients, and we want to test the null hypothesis that Hβ = r; a large S then indicates that this hypothesis is unlikely to be true. For example, in an analysis of human height, the vector β might contain one value which was the the average height of the measured men, and another value which was the average height of the measured women, and then setting H = [1,  − 1], r = 0 would let us test whether men and women are the same height on average. Compare to eq. 2.139 in http://sfb649.wiwi.hu-berlin.de/fedc_homepage/xplore/tutorials/xegbohtmlnode17.html

Example code is adapted from https://github.com/rerpy/rerpy/blob/0d274f85e14c3b1625acb22aed1efa85d122ecb7/rerpy/incremental_ls.py#L202

[6]

Out of the 36 tutorials scheduled for PyCon 2014 (https://us.pycon.org/2014/schedule/tutorials/), we guess that the 8 below will almost certainly deal with matrices:

  • Dynamics and control with Python
  • Exploring machine learning with Scikit-learn
  • How to formulate a (science) problem and analyze it using Python code
  • Diving deeper into Machine Learning with Scikit-learn
  • Data Wrangling for Kaggle Data Science Competitions – An etude
  • Hands-on with Pydata: how to build a minimal recommendation engine.
  • Python for Social Scientists
  • Bayesian statistics made simple

In addition, the following tutorials could easily involve matrices:

  • Introduction to game programming
  • mrjob: Snakes on a Hadoop ("We'll introduce some data science concepts, such as user-user similarity, and show how to calculate these metrics...")
  • Mining Social Web APIs with IPython Notebook
  • Beyond Defaults: Creating Polished Visualizations Using Matplotlib

This gives an estimated range of 8 to 12 / 36 = 22% to 33% of tutorials dealing with matrices; saying ~20% then gives us some wiggle room in case our estimates are high.

[7]

SLOCs were defined as physical lines which contain at least one token that is not a COMMENT, NEWLINE, ENCODING, INDENT, or DEDENT. Counts were made by using tokenize module from Python 3.2.3 to examine the tokens in all files ending .py underneath some directory. Only tokens which occur at least once in the source trees are included in the table. The counting script is available in the PEP repository.

Matrix multiply counts were estimated by counting how often certain tokens which are used as matrix multiply function names occurred in each package. This creates a small number of false positives for scikit-learn, because we also count instances of the wrappers around dot that this package uses, and so there are a few dozen tokens which actually occur in import or def statements.

All counts were made using the latest development version of each project as of 21 Feb 2014.

'stdlib' is the contents of the Lib/ directory in commit d6aa3fa646e2 to the cpython hg repository, and treats the following tokens as indicating matrix multiply: n/a.

'scikit-learn' is the contents of the sklearn/ directory in commit 69b71623273ccfc1181ea83d8fb9e05ae96f57c7 to the scikit-learn repository (https://github.com/scikit-learn/scikit-learn), and treats the following tokens as indicating matrix multiply: dot, fast_dot, safe_sparse_dot.

'nipy' is the contents of the nipy/ directory in commit 5419911e99546401b5a13bd8ccc3ad97f0d31037 to the nipy repository (https://github.com/nipy/nipy/), and treats the following tokens as indicating matrix multiply: dot.

[8]BLAS libraries have a habit of secretly spawning threads, even when used from single-threaded programs. And threads play very poorly with fork(); the usual symptom is that attempting to perform linear algebra in a child process causes an immediate deadlock.
[9]http://fperez.org/py4science/numpy-pep225/numpy-pep225.html
[10]http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html
[11]http://mail.scipy.org/pipermail/scipy-user/2014-February/035499.html
[12]

Counts were produced by manually entering the string "import foo" or "from foo import" (with quotes) into the Github code search page, e.g.: https://github.com/search?q=%22import+numpy%22&ref=simplesearch&type=Code on 2014-04-10 at ~21:00 UTC. The reported values are the numbers given in the "Languages" box on the lower-left corner, next to "Python". This also causes some undercounting (e.g., leaving out Cython code, and possibly one should also count HTML docs and so forth), but these effects are negligible (e.g., only ~1% of numpy usage appears to occur in Cython code, and probably even less for the other modules listed). The use of this box is crucial, however, because these counts appear to be stable, while the "overall" counts listed at the top of the page ("We've found ___ code results") are highly variable even for a single search -- simply reloading the page can cause this number to vary by a factor of 2 (!!). (They do seem to settle down if one reloads the page repeatedly, but nonetheless this is spooky enough that it seemed better to avoid these numbers.)

These numbers should of course be taken with multiple grains of salt; it's not clear how representative Github is of Python code in general, and limitations of the search tool make it impossible to get precise counts. AFAIK this is the best data set currently available, but it'd be nice if it were better. In particular:

  • Lines like import sys, os will only be counted in the sys row.
  • A file containing both import X and from X import will be counted twice
  • Imports of the form from X.foo import ... are missed. We could catch these by instead searching for "from X", but this is a common phrase in English prose, so we'd end up with false positives from comments, strings, etc. For many of the modules considered this shouldn't matter too much -- for example, the stdlib modules have flat namespaces -- but it might especially lead to undercounting of django, scipy, and twisted.

Also, it's possible there exist other non-stdlib modules we didn't think to test that are even more-imported than numpy -- though we tried quite a few of the obvious suspects. If you find one, let us know! The modules tested here were chosen based on a combination of intuition and the top-100 list at pypi-ranking.info.

Fortunately, it doesn't really matter if it turns out that numpy is, say, merely the third most-imported non-stdlib module, since the point is just that numeric programming is a common and mainstream activity.

Finally, we should point out the obvious: whether a package is import**ed** is rather different from whether it's import**ant**. No-one's claiming numpy is "the most important package" or anything like that. Certainly more packages depend on distutils, e.g., then depend on numpy -- and far fewer source files import distutils than import numpy. But this is fine for our present purposes. Most source files don't import distutils because most source files don't care how they're distributed, so long as they are; these source files thus don't care about details of how distutils' API works. This PEP is in some sense about changing how numpy's and related packages' APIs work, so the relevant metric is to look at source files that are choosing to directly interact with that API, which is sort of like what we get by looking at import statements.

[13]The first such proposal occurs in Jim Hugunin's very first email to the matrix SIG in 1995, which lays out the first draft of what became Numeric. He suggests using * for elementwise multiplication, and % for matrix multiplication: https://mail.python.org/pipermail/matrix-sig/1995-August/000002.html
[14]http://mail.scipy.org/pipermail/numpy-discussion/2014-March/069502.html
[15]http://mail.scipy.org/pipermail/numpy-discussion/2014-March/069444.html http://mail.scipy.org/pipermail/numpy-discussion/2014-March/069605.html
[16]http://mail.scipy.org/pipermail/numpy-discussion/2014-March/069610.html
[17]http://mail.scipy.org/pipermail/numpy-discussion/2014-March/069578.html
[18]http://mail.scipy.org/pipermail/numpy-discussion/2014-March/069530.html
[19]https://en.wikipedia.org/wiki/Matrix_multiplication
[20]https://en.wikipedia.org/wiki/Empty_product

pep-0466 Network Security Enhancements for Python 2.7.x

PEP:466
Title:Network Security Enhancements for Python 2.7.x
Version:$Revision$
Last-Modified:$Date$
Author:Nick Coghlan <ncoghlan at gmail.com>,
Status:Final
Type:Standards Track
Content-Type:text/x-rst
Created:23-Mar-2014
Python-Version:2.7.9
Post-History:23-Mar-2014, 24-Mar-2014, 25-Mar-2014, 26-Mar-2014, 16-Apr-2014
Resolution:https://mail.python.org/pipermail/python-dev/2014-April/134163.html

Abstract

Most CPython tracker issues are classified as errors in behaviour or proposed enhancements. Most patches to fix behavioural errors are applied to all active maintenance branches. Enhancement patches are restricted to the default branch that becomes the next Python version.

This cadence works reasonably well during Python's normal 18-24 month feature release cycle, which is still applicable to the Python 3 series. However, the age of the standard library in Python 2 has now reached a point where it is sufficiently far behind the state of the art in network security protocols for it to be causing real problems in use cases where upgrading to Python 3 in the near term may not be feasible.

In recognition of the additional practical considerations that have arisen during the 4+ year maintenance cycle for Python 2.7, this PEP allows a critical set of network security related features to be backported from Python 3.4 to upcoming Python 2.7.x maintenance releases.

While this PEP does not make any changes to the core development team's handling of security-fix-only branches that are no longer in active maintenance, it does recommend that commercial redistributors providing extended support periods for the Python standard library either backport these features to their supported versions, or else explicitly disclaim support for the use of older versions in roles that involve connecting directly to the public internet.

Implementation status

This PEP originally proposed adding all listed features to the Python 2.7.7 maintenance release. That approach proved to be too ambitious given the limited time frame between the original creation and acceptance of the PEP and the release of Python 2.7.7rc1. Instead, the progress of each individual accepted feature backport is being tracked as an independent enhancement targeting Python 2.7.

Implemented for Python 2.7.7:

Implemented for Python 2.7.8:

Implemented for Python 2.7.9 (in development):

Backwards compatibility considerations

As in the Python 3 series, the backported ssl.create_default_context() API is granted a backwards compatibility exemption that permits the protocol, options, cipher and other settings of the created SSL context to be updated in maintenance releases to use higher default security settings. This allows them to appropriately balance compatibility and security at the time of the maintenance release, rather than at the time of the original feature release.

This PEP does not grant any other exemptions to the usual backwards compatibility policy for maintenance releases. Instead, by explicitly encouraging the use of feature based checks, it is designed to make it easier to write more secure cross-version compatible Python software, while still limiting the risk of breaking currently working software when upgrading to a new Python 2.7 maintenance release.

In all cases where this proposal allows new features to be backported to the Python 2.7 release series, it is possible to write cross-version compatible code that operates by "feature detection" (for example, checking for particular attributes in a module), without needing to explicitly check the Python version.

It is then up to library and framework code to provide an appropriate warning and fallback behaviour if a desired feature is found to be missing. While some especially security sensitive software MAY fail outright if a desired security feature is unavailable, most software SHOULD instead emit a warning and continue operating using a slightly degraded security configuration.

The backported APIs allow library and application code to perform the following actions after detecting the presence of a relevant network security related feature:

  • explicitly opt in to more secure settings (to allow the use of enhanced security features in older maintenance releases of Python with less secure default behaviour)
  • explicitly opt in to less secure settings (to allow the use of newer Python feature releases in lower security environments)
  • determine the default setting for the feature (this MAY require explicit Python version checks to determine the Python feature release, but DOES NOT require checking for a specific maintenance release)

Security related changes to other modules (such as higher level networking libraries and data format processing libraries) will continue to be made available as backports and new modules on the Python Package Index, as independent distribution remains the preferred approach to handling software that must continue to evolve to handle changing development requirements independently of the Python 2 standard library. Refer to the Motivation and Rationale section for a review of the characteristics that make the secure networking infrastructure worthy of special consideration.

OpenSSL compatibility

Under this proposal, OpenSSL may be upgraded to more recent feature releases in Python 2.7 maintenance releases. On Linux and most other POSIX systems, the specific version of OpenSSL used already varies, as CPython dynamically links to the system provided OpenSSL library by default.

For the Windows binary installers, the _ssl and _hashlib modules are statically linked with OpenSSL and the associated symbols are not exported. Marc-Andre Lemburg indicates that updating to newer OpenSSL releases in the egenix-pyopenssl binaries has not resulted in any reported compatibility issues [3]

The Mac OS X binary installers historically followed the same policy as other POSIX installations and dynamically linked to the Apple provided OpenSSL libraries. However, Apple has now ceased updating these cross-platform libraries, instead requiring that even cross-platform developers adopt Mac OS X specific interfaces to access up to date security infrastructure on their platform. Accordingly, and independently of this PEP, the Mac OS X binary installers were already going to be switched to statically linker newer versions of OpenSSL [4]

Other Considerations

Maintainability

A number of developers, including Alex Gaynor and Donald Stufft, have expressed interest in carrying out the feature backports covered by this policy, and assisting with any additional maintenance burdens that arise in the Python 2 series as a result.

Steve Dower and Brian Curtin have offered to help with the creation of the Windows installers, allowing Martin von Lรถwis the opportunity to step back from the task of maintaining the 2.7 Windows installer.

This PEP is primarily about establishing the consensus needed to allow them to carry out this work. For other core developers, this policy change shouldn't impose any additional effort beyond potentially reviewing the resulting patches for those developers specifically interested in the affected modules.

Security releases

This PEP does not propose any changes to the handling of security releases - those will continue to be source only releases that include only critical security fixes.

However, the recommendations for library and application developers are deliberately designed to accommodate commercial redistributors that choose to apply these changes to additional Python release series that are either in security fix only mode, or have been declared "end of life" by the core development team.

Whether or not redistributors choose to exercise that option will be up to the individual redistributor.

Integration testing

Third party integration testing services should offer users the ability to test against multiple Python 2.7 maintenance releases (at least 2.7.6 and 2.7.7+), to ensure that libraries, frameworks and applications can still test their handling of the legacy security infrastructure correctly (either failing or degrading gracefully, depending on the security sensitivity of the software), even after the features covered in this proposal have been backported to the Python 2.7 series.

Handling lower security environments with low risk tolerance

For better or for worse (mostly worse), there are some environments where the risk of latent security defects is more tolerated than even a slightly increased risk of regressions in maintenance releases. This proposal largely excludes these environments from consideration where the modules covered by the exemption are concerned - this approach is entirely inappropriate for software connected to the public internet, and defence in depth security principles suggest that it is not appropriate for most private networks either.

Downstream redistributors may still choose to cater to such environments, but they will need to handle the process of downgrading the security related modules and doing the associated regression testing themselves. The main CPython continuous integration infrastructure will not cover this scenario.

Motivation and Rationale

The creation of this PEP was prompted primarily by the aging SSL support in the Python 2 series. As of March 2014, the Python 2.7 SSL module is approaching four years of age, and the SSL support in the still popular Python 2.6 release had its feature set locked six years ago.

These are simply too old to provide a foundation that can be recommended in good conscience for secure networking software that operates over the public internet, especially in an era where it is becoming quite clearly evident that advanced persistent security threats are even more widespread and more indiscriminate in their targeting than had previously been understood. While they represented reasonable security infrastructure in their time, the state of the art has moved on, and we need to investigate mechanisms for effectively providing more up to date network security infrastructure for users that, for whatever reason, are not currently in a position to migrate to Python 3.

While the use of the system OpenSSL installation addresses many of these concerns on Linux platforms, it doesn't address all of them (in particular, it is still difficult for sotware to explicitly require some higher level security settings). The standard library support can be bypassed by using a third party library like PyOpenSSL or Pycurl, but this still results in a security problem, as these can be difficult dependencies to deploy, and many users will remain unaware that they might want them. Rather than explaining to potentially naive users how to obtain and use these libraries, it seems better to just fix the included batteries.

In the case of the binary installers for Windows and Mac OS X that are published on python.org, the version of OpenSSL used is entirely within the control of the Python core development team, but is currently limited to OpenSSL maintenance releases for the version initially shipped with the corresponding Python feature release.

With increased popularity comes increased responsibility, and this proposal aims to acknowledge the fact that Python's popularity and adoption is at a sufficiently high level that some of our design and policy decisions have significant implications beyond the Python development community.

As one example, the Python 2 ssl module does not support the Server Name Indication standard. While it is possible to obtain SNI support by using the third party requests client library, actually doing so currently requires using not only requests and its embedded dependencies, but also half a dozen or more additional libraries. The lack of support in the Python 2 series thus serves as an impediment to making effective use of SNI on servers, as Python 2 clients will frequently fail to handle it correctly.

Another more critical example is the lack of SSL hostname matching in the Python 2 standard library - it is currently necessary to rely on a third party library, such as requests or backports.ssl_match_hostname to obtain that functionality in Python 2.

The Python 2 series also remains more vulnerable to remote timing attacks on security sensitive comparisons than the Python 3 series, as it lacks a standard library equivalent to the timing attack resistant hmac.compare_digest() function. While appropriate secure comparison functions can be implemented in third party extensions, many users don't even consider the issue and use ordinary equality comparisons instead - while a standard library solution doesn't automatically fix that problem, it does make the barrier to resolution much lower once the problem is pointed out.

Python 2.7 represents the only long term maintenance release the core development team has provided, and it is natural that there will be things that worked over a historically shorter maintenance lifespan that don't work over this longer support period. In the specific case of the problem described in this PEP, the simplest available solution is to acknowledge that long term maintenance of network security related modules requires the ability to add new features, even while retaining backwards compatibility for existing interfaces.

For those familiar with it, it is worth comparing the approach described in this PEP with Red Hat's handling of its long term open source support commitments: it isn't the RHEL 6.0 release itself that receives 10 years worth of support, but the overall RHEL 6 series. The individual RHEL 6.x point releases within the series then receive a wide variety of new features, including security enhancements, all while meeting strict backwards compatibility guarantees for existing software. The proposal covered in this PEP brings our approach to long term maintenance more into line with this precedent - we retain our strict backwards compatibility requirements, but make an exception to the restriction against adding new features.

To date, downstream redistributors have respected our upstream policy of "no new features in Python maintenance releases". This PEP explicitly accepts that a more nuanced policy is appropriate in the case of network security related features, and the specific change it describes is deliberately designed such that it is potentially suitable for Red Hat Enterprise Linux and its downstream derivatives.

Why these particular changes?

The key requirement for a feature to be considered for inclusion in this proposal was that it must have security implications beyond the specific application that is written in Python and the system that application is running on. Thus the focus on network security protocols, password storage and related cryptographic infrastructure - Python is a popular choice for the development of web services and clients, and thus the capabilities of widely used Python versions have implications for the security design of other services that may themselves be using newer versions of Python or other development languages, but need to interoperate with clients or servers written using older versions of Python.

The intent behind this requirement was to minimise any impact that the introduction of this policy may have on the stability and compatibility of maintenance releases, while still addressing some key security concerns relating to the particular aspects of Python 2.7. It would be thoroughly counterproductive if end users became as cautious about updating to new Python 2.7 maintenance releases as they are about updating to new feature releases within the same release series.

The ssl module changes are included in this proposal to bring the Python 2 series up to date with the past 4 years of evolution in network security standards, and make it easier for those standards to be broadly adopted in both servers and clients. Similarly the hash algorithm availability indicators in hashlib are included to make it easier for applications to detect and employ appropriate hash definitions across both Python 2 and 3.

The hmac.compare_digest() and hashlib.pbkdf2_hmac() are included to help lower the barriers to secure password storage and checking in Python 2 server applications.

The os.urandom() change has been included in this proposal to further encourage users to leave the task of providing high quality random numbers for cryptographic use cases to operating system vendors. The use of insufficiently random numbers has the potential to compromise any cryptographic system, and operating system developers have more tools available to address that problem adequately than the typical Python application runtime.

Rejected alternative: just advise developers to migrate to Python 3

This alternative represents the status quo. Unfortunately, it has proven to be unworkable in practice, as the backwards compatibility implications mean that this is a non-trivial migration process for large applications and integration projects. While the tools for migration have evolved to a point where it is possible to migrate even large applications opportunistically and incrementally (rather than all at once) by updating code to run in the large common subset of Python 2 and Python 3, using the most recent technology often isn't a priority in commercial environments.

Previously, this was considered an acceptable harm, as while it was an unfortunate problem for the affected developers to have to face, it was seen as an issue between them and their management chain to make the case for infrastructure modernisation, and this case would become naturally more compelling as the Python 3 series evolved.

However, now that we're fully aware of the impact the limitations of the Python 2 standard library may be having on the evolution of internet security standards, I no longer believe that it is reasonable to expect platform and application developers to resolve all of the latent defects in an application's Unicode correctness solely in order to gain access to the network security enhancements already available in Python 3.

While Ubuntu (and to some extent Debian as well) are committed to porting all default system services and scripts to Python 3, and to removing Python 2 from its default distribution images (but not from its archives), this is a mammoth task and won't be completed for the Ubuntu 14.04 LTS release (at least for the desktop image - it may be achieved for the mobile and server images).

Fedora has even more work to do to migrate, and it will take a non-trivial amount of time to migrate the relevant infrastructure components. While Red Hat are also actively working to make it easier for users to use more recent versions of Python on our stable platforms, it's going to take time for those efforts to start having an impact on end users' choice of version, and any such changes also don't benefit the core platform infrastructure that runs in the integrated system Python by necessity.

The OpenStack migration to Python 3 is also still in its infancy, and even though that's a project with an extensive and relatively robust automated test suite, it's still large enough that it is going to take quite some time to migrate fully to a Python 2/3 compatible code base.

And that's just three of the highest profile open source projects that make heavy use of Python. Given the likely existence of large amounts of legacy code that lacks the kind of automated regression test suite needed to help support a migration from Python 2 to Python 3, there are likely to be many cases where reimplementation (perhaps even in Python 3) proves easier than migration. The key point of this PEP is that those situations affect more people than just the developers and users of the affected application: the existence of clients and servers with outdated network security infrastructure becomes something that developers of secure networked services need to take into account as part of their security design, and that's a problem that inhibits the adoption of better security standards.

As Terry Reedy noted, if we try to persist with the status quo, the likely outcome is that commercial redistributors will attempt to do something like this on behalf of their customers anyway, but in a potentially inconsistent and ad hoc manner. By drawing the scope definition process into the upstream project we are in a better position to influence the approach taken to address the situation and to help ensure some consistency across redistributors.

The problem is real, so something needs to change, and this PEP describes my preferred approach to addressing the situation.

Rejected alternative: create and release Python 2.8

With sufficient corporate support, it likely would be possible to create and release Python 2.8 (it's highly unlikely such a project would garner enough interest to be achievable with only volunteers). However, this wouldn't actually solve the problem, as the aim is to provide a relatively low impact way to incorporate enhanced security features into integrated products and deployments that make use of Python 2.

Upgrading to a new Python feature release would mean both more work for the core development team, as well as a more disruptive update that most potential end users would likely just skip entirely.

Attempting to create a Python 2.8 release would also bring in suggestions to backport many additional features from Python 3 (such as tracemalloc and the improved coroutine support), making the migration from Python 2.7 to this hypothetical 2.8 release even riskier and more disruptive.

This is not a recommended approach, as it would involve substantial additional work for a result that is actually less effective in achieving the original aim (which is to eliminate the current widespread use of the aging network security infrastructure in the Python 2 series).

Furthermore, while I can't make any commitments to actually addressing this issue on Red Hat platforms, I can categorically rule out the idea of a Python 2.8 being of any use to me in even attempting to get it addressed.

Rejected alternative: distribute the security enhancements via PyPI

While this initially appears to be an attractive and easier to manage approach, it actually suffers from several significant problems.

Firstly, this is complex, low level, cross-platform code that integrates with the underlying operating system across a variety of POSIX platforms (including Mac OS X) and Windows. The CPython BuildBot fleet is already set up to handle continuous integration in that context, but most of the freely available continuous integration services just offer Linux, and perhaps paid access to Windows. Those services work reasonably well for software that largely runs on the abstraction layers offered by Python and other dynamic languages, as well as the more comprehensive abstraction offered by the JVM, but won't suffice for the kind of code involved here.

The OpenSSL dependency for the network security support also qualifies as the kind of "complex binary dependency" that isn't yet handled well by the pip based software distribution ecosystem. Relying on a third party binary dependency also creates potential compatibility problems for pip when running on other interpreters like PyPy.

Another practical problem with the idea is the fact that pip itself relies on the ssl support in the standard library (with some additional support from a bundled copy of requests, which in turn bundles backport.ssl_match_hostname), and hence would require any replacement module to also be bundled within pip. This wouldn't pose any insurmountable difficulties (it's just another dependency to vendor), but it would mean yet another copy of OpenSSL to keep up to date.

This approach also has the same flaw as all other "improve security by renaming things" approaches: they completely miss the users who most need help, and raise significant barriers against being able to encourage users to do the right thing when their infrastructure supports it (since "use this other module" is a much higher impact change than "turn on this higher security setting"). Deprecating the aging SSL infrastructure in the standard library in favour of an external module would be even more user hostile than accepting the slightly increased risk of regressions associated with upgrading it in place.

Last, but certainly not least, this approach suffers from the same problem as the idea of doing a Python 2.8 release: likely not solving the actual problem. Commercial redistributors of Python are set up to redistribute Python, and a pre-existing set of additional packages. Getting new packages added to the pre-existing set can be done, but means approaching each and every redistributor and asking them to update their repackaging process accordingly. By contrast, the approach described in this PEP would require redistributors to deliberately opt out of the security enhancements by deliberately downgrading the provided network security infrastructure, which most of them are unlikely to do.

Rejected variant: provide a "legacy SSL infrastructure" branch

Earlier versions of this PEP included the concept of a 2.7-legacy-ssl branch that preserved the exact feature set of the Python 2.7.6 network security infrastructure.

In my opinion, anyone that actually wants this is almost certainly making a mistake, and if they insist they really do want it in their specific situation, they're welcome to either make it themselves or arrange for a downstream redistributor to make it for them.

If they are made publicly available, any such rebuilds should be referred to as "Python 2.7 with Legacy SSL" to clearly distinguish them from the official Python 2.7 releases that include more up to date network security infrastructure.

After the first Python 2.7 maintenance release that implements this PEP, it would also be appropriate to refer to Python 2.7.6 and earlier releases as "Python 2.7 with Legacy SSL".

Rejected variant: synchronise particular modules entirely with Python 3

Earlier versions of this PEP suggested synchronising the hmac, hashlib and ssl modules entirely with their Python 3 counterparts.

This approach proved too vague to build a compelling case for the exception, and has thus been replaced by the current more explicit proposal.

Rejected variant: open ended backport policy

Earlier versions of this PEP suggested a general policy change related to future Python 3 enhancements that impact the general security of the internet.

That approach created unnecessary uncertainty, so it has been simplified to propose backport a specific concrete set of changes. Future feature backport proposals can refer back to this PEP as precedent, but it will still be necessary to make a specific case for each feature addition to the Python 2.7 long term support release.

Disclosure of Interest

The author of this PEP currently works for Red Hat on test automation tools. If this proposal is accepted, I will be strongly encouraging Red Hat to take advantage of the resulting opportunity to help improve the overall security of the Python ecosystem. However, I do not speak for Red Hat in this matter, and cannot make any commitments on Red Hat's behalf.

Acknowledgements

Thanks to Christian Heimes and other for their efforts in greatly improving Python's SSL support in the Python 3 series, and a variety of members of the Python community for helping me to better understand the implications of the default settings we provide in our SSL modules, and the impact that tolerating the use of SSL infrastructure that was defined in 2010 (Python 2.7) or even 2008 (Python 2.6) potentially has for the security of the web as a whole.

Thanks to Donald Stufft and Alex Gaynor for identifying a more limited set of essential security features that allowed the proposal to be made more fine-grained than backporting entire modules from Python 3.4 ([7], [8]).

Christian and Donald also provided valuable feedback on a preliminary draft of this proposal.

Thanks also to participants in the python-dev mailing list threads ([1], [2], [5], [6]), as well as the various folks I discussed this issue with at PyCon 2014 in Montreal.

pep-0467 Minor API improvements for binary sequences

PEP:467
Title:Minor API improvements for binary sequences
Version:$Revision$
Last-Modified:$Date$
Author:Nick Coghlan <ncoghlan at gmail.com>
Status:Draft
Type:Standards Track
Content-Type:text/x-rst
Created:2014-03-30
Python-Version:3.5
Post-History:2014-03-30 2014-08-15 2014-08-16

Abstract

During the initial development of the Python 3 language specification, the core bytes type for arbitrary binary data started as the mutable type that is now referred to as bytearray. Other aspects of operating in the binary domain in Python have also evolved over the course of the Python 3 series.

This PEP proposes four small adjustments to the APIs of the bytes, bytearray and memoryview types to make it easier to operate entirely in the binary domain:

  • Deprecate passing single integer values to bytes and bytearray
  • Add bytes.zeros and bytearray.zeros alternative constructors
  • Add bytes.byte and bytearray.byte alternative constructors
  • Add bytes.iterbytes, bytearray.iterbytes and memoryview.iterbytes alternative iterators

Proposals

Deprecation of current "zero-initialised sequence" behaviour

Currently, the bytes and bytearray constructors accept an integer argument and interpret it as meaning to create a zero-initialised sequence of the given size:

>>> bytes(3)
b'\x00\x00\x00'
>>> bytearray(3)
bytearray(b'\x00\x00\x00')

This PEP proposes to deprecate that behaviour in Python 3.5, and remove it entirely in Python 3.6.

No other changes are proposed to the existing constructors.

Addition of explicit "zero-initialised sequence" constructors

To replace the deprecated behaviour, this PEP proposes the addition of an explicit zeros alternative constructor as a class method on both bytes and bytearray:

>>> bytes.zeros(3)
b'\x00\x00\x00'
>>> bytearray.zeros(3)
bytearray(b'\x00\x00\x00')

It will behave just as the current constructors behave when passed a single integer.

The specific choice of zeros as the alternative constructor name is taken from the corresponding initialisation function in NumPy (although, as these are 1-dimensional sequence types rather than N-dimensional matrices, the constructors take a length as input rather than a shape tuple)

Addition of explicit "single byte" constructors

As binary counterparts to the text chr function, this PEP proposes the addition of an explicit byte alternative constructor as a class method on both bytes and bytearray:

>>> bytes.byte(3)
b'\x03'
>>> bytearray.byte(3)
bytearray(b'\x03')

These methods will only accept integers in the range 0 to 255 (inclusive):

>>> bytes.byte(512)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: bytes must be in range(0, 256)

>>> bytes.byte(1.0)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'float' object cannot be interpreted as an integer

The documentation of the ord builtin will be updated to explicitly note that bytes.byte is the inverse operation for binary data, while chr is the inverse operation for text data.

Behaviourally, bytes.byte(x) will be equivalent to the current bytes([x]) (and similarly for bytearray). The new spelling is expected to be easier to discover and easier to read (especially when used in conjunction with indexing operations on binary sequence types).

As a separate method, the new spelling will also work better with higher order functions like map.

Addition of optimised iterator methods that produce bytes objects

This PEP proposes that bytes, bytearray and memoryview gain an optimised iterbytes method that produces length 1 bytes objects rather than integers:

for x in data.iterbytes():
    # x is a length 1 ``bytes`` object, rather than an integer

The method can be used with arbitrary buffer exporting objects by wrapping them in a memoryview instance first:

for x in memoryview(data).iterbytes():
    # x is a length 1 ``bytes`` object, rather than an integer

For memoryview, the semantics of iterbytes() are defined such that:

memview.tobytes() == b''.join(memview.iterbytes())

This allows the raw bytes of the memory view to be iterated over without needing to make a copy, regardless of the defined shape and format.

The main advantage this method offers over the map(bytes.byte, data) approach is that it is guaranteed not to fail midstream with a ValueError or TypeError. By contrast, when using the map based approach, the type and value of the individual items in the iterable are only checked as they are retrieved and passed through the bytes.byte constructor.

Design discussion

Why not rely on sequence repetition to create zero-initialised sequences?

Zero-initialised sequences can be created via sequence repetition:

>>> b'\x00' * 3
b'\x00\x00\x00'
>>> bytearray(b'\x00') * 3
bytearray(b'\x00\x00\x00')

However, this was also the case when the bytearray type was originally designed, and the decision was made to add explicit support for it in the type constructor. The immutable bytes type then inherited that feature when it was introduced in PEP 3137.

This PEP isn't revisiting that original design decision, just changing the spelling as users sometimes find the current behaviour of the binary sequence constructors surprising. In particular, there's a reasonable case to be made that bytes(x) (where x is an integer) should behave like the bytes.byte(x) proposal in this PEP. Providing both behaviours as separate class methods avoids that ambiguity.

References

[1]Initial March 2014 discussion thread on python-ideas (https://mail.python.org/pipermail/python-ideas/2014-March/027295.html)
[2]Guido's initial feedback in that thread (https://mail.python.org/pipermail/python-ideas/2014-March/027376.html)
[3]Issue proposing moving zero-initialised sequences to a dedicated API (http://bugs.python.org/issue20895)
[4]Issue proposing to use calloc() for zero-initialised binary sequences (http://bugs.python.org/issue21644)
[5]August 2014 discussion thread on python-dev (https://mail.python.org/pipermail/python-ideas/2014-March/027295.html)

pep-0468 Preserving the order of **kwargs in a function.

PEP:468
Title:Preserving the order of **kwargs in a function.
Version:$Revision$
Last-Modified:$Date$
Author:Eric Snow <ericsnowcurrently at gmail.com>
Discussions-To:python-ideas at python.org
Status:Draft
Type:Standards Track
Content-Type:text/x-rst
Created:5-Apr-2014
Python-Version:3.5
Post-History:5-Apr-2014
Resolution:

Abstract

The **kwargs syntax in a function definition indicates that the interpreter should collect all keyword arguments that do not correspond to other named parameters. However, Python does not preserved the order in which those collected keyword arguments were passed to the function. In some contexts the order matters. This PEP introduces a mechanism by which the passed order of collected keyword arguments will now be preserved.

Motivation

Python's **kwargs syntax in function definitions provides a powerful means of dynamically handling keyword arguments. In some applications of the syntax (see Use Cases), the semantics applied to the collected keyword arguments requires that order be preserved. Unsurprisingly, this is similar to how OrderedDict is related to dict.

Currently to preserved the order you have to do so manually and separately from the actual function call. This involves building an ordered mapping, whether an OrderedDict or an iterable of 2-tuples, which is then passed as a single argument to the function. [1]

With the capability described in this PEP, that boilerplate would no longer be required.

For comparision, currently:

kwargs = OrderedDict()
kwargs['eggs'] = ...
...
def spam(a, kwargs):
    ...

and with this proposal:

def spam(a, **kwargs):
    ...

Nick Coglan, speaking of some of the uses cases, summed it up well [2]:

These *can* all be done today, but *not* by using keyword arguments.
In my view, the problem to be addressed is that keyword arguments
*look* like they should work for these cases, because they have a
definite order in the source code. The only reason they don't work
is because the interpreter throws that ordering information away.

It's a textbook case of a language feature becoming an attractive
nuisance in some circumstances: the simple and obvious solution for
the above use cases *doesn't actually work* for reasons that aren't
obviously clear if you don't have a firm grasp of Python's admittedly
complicated argument handling.

This observation is supported by the appearance of this proposal over the years and the numerous times that people have been confused by the constructor for OrderedDict. [3] [4] [5]

Use Cases

As Nick noted, the current behavior of **kwargs is unintuitive in cases where one would expect order to matter. Aside from more specific cases outlined below, in general "anything else where you want to control the iteration order and set field names and values in a single call will potentially benefit." [6] That matters in the case of factories (e.g. __init__()) for ordered types.

Serialization

Obviously OrderedDict would benefit (both __init__() and update()) from ordered kwargs. However, the benefit also extends to serialization APIs [2]:

In the context of serialisation, one key lesson we have learned is
that arbitrary ordering is a problem when you want to minimise
spurious diffs, and sorting isn't a simple solution.

Tools like doctest don't tolerate spurious diffs at all, but are
often amenable to a sorting based answer.

The cases where it would be highly desirable to be able use keyword
arguments to control the order of display of a collection of key
value pairs are ones like:

* printing out key:value pairs in CLI output
* mapping semantic names to column order in a CSV
* serialising attributes and elements in particular orders in XML
* serialising map keys in particular orders in human readable formats
  like JSON and YAML (particularly when they're going to be placed
  under source control)

Debugging

In the words of Raymond Hettinger [7]:

It makes it easier to debug if the arguments show-up in the order
they were created.  AFAICT, no purpose is served by scrambling them.

Other Use Cases

  • Mock objects. [8]
  • Controlling object presentation.
  • Alternate namedtuple() where defaults can be specified.
  • Specifying argument priority by order.

Concerns

Performance

As already noted, the idea of ordered keyword arguments has come up on a number of occasions. Each time it has been met with the same response, namely that preserving keyword arg order would have a sufficiently adverse effect on function call performance that it's not worth doing. However, Guido noted the following [9]:

Making **kwds ordered is still open, but requires careful design and
implementation to avoid slowing down function calls that don't benefit.

As will be noted below, there are ways to work around this at the expense of increased complication. Ultimately the simplest approach is the one that makes the most sense: pack collected key word arguments into an OrderedDict. However, without a C implementation of OrderedDict there isn't much to discuss. That should change in Python 3.5. [10]

In some cases the difference of performance between dict and OrderedDict may be of significance. For instance: when the collected kwargs has an extended lifetime outside the originating function or the number of collected kwargs is massive. However, the difference in performance (both CPU and memory) in those cases should not be significant. Furthermore, the performance of the C OrderedDict implementation is essentially identical with dict for the non-mutating API. A concrete representation of the difference in performance will be a part of this proposal before its resolution.

Other Python Implementations

Another important issue to consider is that new features must be cognizant of the multiple Python implementations. At some point each of them would be expected to have implemented ordered kwargs. In this regard there doesn't seem to be an issue with the idea. [11] Each of the major Python implementations will be consulted regarding this proposal before its resolution.

Specification

Starting in version 3.5 Python will preserve the order of keyword arguments as passed to a function. To accomplish this the collected kwargs will now be an OrderedDict rather than a dict.

This will apply only to functions for which the definition uses the **kwargs syntax for collecting otherwise unspecified keyword arguments. Only the order of those keyword arguments will be preserved.

Relationship to **-unpacking syntax

The ** unpacking syntax in function calls has no special connection with this proposal. Keyword arguments provided by unpacking will be treated in exactly the same way as they are now: ones that match defined parameters are gather there and the remainder will be collected into the ordered kwargs (just like any other unmatched keyword argument).

Note that unpacking a mapping with undefined order, such as dict, will preserve its iteration order like normal. It's just that the order will remain undefined. The OrderedDict into which the unpacked key-value pairs will then be packed will not be able to provide any alternate ordering. This should not be surprising.

There have been brief discussions of simply passing these mappings through to the functions kwargs without unpacking and repacking them, but that is both outside the scope of this proposal and probably a bad idea regardless. (There is a reason those discussions were brief.)

Relationship to inspect.Signature

Signature objects should need no changes. The kwargs parameter of inspect.BoundArguments (returned by Signature.bind() and Signature.bind_partial()) will change from a dict to an OrderedDict.

C-API

TBD

Syntax

No syntax is added or changed by this proposal.

Backward-Compatibility

The following will change:

  • type(kwargs)
  • iteration order of kwargs will now be consistent (except of course in the case described above)
  • as already noted, performance will be marginally different

None of these should be an issue. However, each will be carefully considered while this proposal is under discussion.

Alternate Approaches

Opt-out Decorator

This is identical to the current proposal with the exception that Python would also provide a decorator in functools that would cause collected keyword arguments to be packed into a normal dict instead of an OrderedDict.

Prognosis:

This would only be necessary if performance is determined to be significantly different in some uncommon cases or that there are other backward-compatibility concerns that cannot be resolved otherwise.

Opt-in Decorator

The status quo would be unchanged. Instead Python would provide a decorator in functools that would register or mark the decorated function as one that should get ordered keyword arguments. The performance overhead to check the function at call time would be marginal.

Prognosis:

The only real down-side is in the case of function wrappers factories (e.g. functools.partial and many decorators) that aim to perfectly preserve keyword arguments by using kwargs in the wrapper definition and kwargs unpacking in the call to the wrapped function. Each wrapper would have to be updated separately, though having functools.wraps() do this automaticallywould help.

__kworder__

The order of keyword arguments would be stored separately in a list at call time. The list would be bound to __kworder__ in the function locals.

Prognosis:

This likewise complicates the wrapper case.

Compact dict with faster iteration

Raymond Hettinger has introduced the idea of a dict implementation that would result in preserving insertion order on dicts (until the first deletion). This would be a perfect fit for kwargs. [5]

Prognosis:

The idea is still uncertain in both viability and timeframe.

***kwargs

This would add a new form to a function's signature as a mutually exclusive parallel to **kwargs. The new syntax, ***kwargs (note that there are three asterisks), would indicate that kwargs should preserve the order of keyword arguments.

Prognosis:

New syntax is only added to Python under the most dire circumstances. With other available solutions, new syntax is not justifiable. Furthermore, like all opt-in solutions, the new syntax would complicate the pass-through case.

annotations

This is a variation on the decorator approach. Instead of using a decorator to mark the function, you would use a function annotation on **kwargs.

Prognosis:

In addition to the pass-through complication, annotations have been actively discouraged in Python core development. Use of annotations to opt-in to order preservation runs the risk of interfering with other application-level use of annotations.

dict.__order__

dict objects would have a new attribute, __order__ that would default to None and that in the kwargs case the interpreter would use in the same way as described above for __kworder__.

Prognosis:

It would mean zero impact on kwargs performance but the change would be pretty intrusive (Python uses dict a lot). Also, for the wrapper case the interpreter would have to be careful to preserve __order__.

KWArgsDict.__order__

This is the same as the dict.__order__ idea, but kwargs would be an instance of a new minimal dict subclass that provides the __order__ attribute. dict would instead be unchanged.

Prognosis:

Simply switching to OrderedDict is a less complicated and more intuitive change.

Acknowledgements

Thanks to Andrew Barnert for helpful feedback and to the participants of all the past email threads.

Footnotes

[1]Alternately, you could also replace ** in your function definition with * and then pass in key/value 2-tuples. This has the advantage of not requiring the keys to be valid identifier strings. See https://mail.python.org/pipermail/python-ideas/2014-April/027491.html.

References

[2](1, 2) https://mail.python.org/pipermail/python-ideas/2014-April/027512.html
[3]

https://mail.python.org/pipermail/python-ideas/2009-April/004163.html

https://mail.python.org/pipermail/python-ideas/2010-October/008445.html

https://mail.python.org/pipermail/python-ideas/2011-January/009037.html

https://mail.python.org/pipermail/python-ideas/2013-February/019690.html

https://mail.python.org/pipermail/python-ideas/2013-May/020727.html

https://mail.python.org/pipermail/python-ideas/2014-March/027225.html

http://bugs.python.org/issue16276

http://bugs.python.org/issue16553

http://bugs.python.org/issue19026

http://bugs.python.org/issue5397#msg82972

[4]https://mail.python.org/pipermail/python-dev/2007-February/071310.html
[5](1, 2)

https://mail.python.org/pipermail/python-dev/2012-December/123028.html

https://mail.python.org/pipermail/python-dev/2012-December/123105.html

https://mail.python.org/pipermail/python-dev/2013-May/126327.html

https://mail.python.org/pipermail/python-dev/2013-May/126328.html
[6]https://mail.python.org/pipermail/python-dev/2012-December/123105.html
[7]https://mail.python.org/pipermail/python-dev/2013-May/126327.html
[8]

https://mail.python.org/pipermail/python-ideas/2009-April/004163.html

https://mail.python.org/pipermail/python-ideas/2009-April/004165.html

https://mail.python.org/pipermail/python-ideas/2009-April/004175.html

[9]https://mail.python.org/pipermail/python-dev/2013-May/126404.html
[10]http://bugs.python.org/issue16991
[11]https://mail.python.org/pipermail/python-dev/2012-December/123100.html

pep-0469 Migration of dict iteration code to Python 3

PEP:469
Title:Migration of dict iteration code to Python 3
Version:$Revision$
Last-Modified:$Date$
Author:Nick Coghlan <ncoghlan at gmail.com>
Status:Withdrawn
Type:Standards Track
Content-Type:text/x-rst
Created:2014-04-18
Python-Version:3.5
Post-History:2014-04-18, 2014-04-21

Abstract

For Python 3, PEP 3106 changed the design of the dict builtin and the mapping API in general to replace the separate list based and iterator based APIs in Python 2 with a merged, memory efficient set and multiset view based API. This new style of dict iteration was also added to the Python 2.7 dict type as a new set of iteration methods.

This means that there are now 3 different kinds of dict iteration that may need to be migrated to Python 3 when an application makes the transition:

  • Lists as mutable snapshots: d.items() -> list(d.items())
  • Iterator objects: d.iteritems() -> iter(d.items())
  • Set based dynamic views: d.viewitems() -> d.items()

There is currently no widely agreed best practice on how to reliably convert all Python 2 dict iteration code to the common subset of Python 2 and 3, especially when test coverage of the ported code is limited. This PEP reviews the various ways the Python 2 iteration APIs may be accessed, and looks at the available options for migrating that code to Python 3 by way of the common subset of Python 2.6+ and Python 3.0+.

The PEP also considers the question of whether or not there are any additions that may be worth making to Python 3.5 that may ease the transition process for application code that doesn't need to worry about supporting earlier versions when eventually making the leap to Python 3.

PEP Withdrawal

In writing the second draft of this PEP, I came to the conclusion that the readability of hybrid Python 2/3 mapping code can actually be best enhanced by better helper functions rather than by making changes to Python 3.5+. The main value I now see in this PEP is as a clear record of the recommended approaches to migrating mapping iteration code from Python 2 to Python 3, as well as suggesting ways to keep things readable and maintainable when writing hybrid code that supports both versions.

Notably, I recommend that hybrid code avoid calling mapping iteration methods directly, and instead rely on builtin functions where possible, and some additional helper functions for cases that would be a simple combination of a builtin and a mapping method in pure Python 3 code, but need to be handled slightly differently to get the exact same semantics in Python 2.

Static code checkers like pylint could potentially be extended with an optional warning regarding direct use of the mapping iteration methods in a hybrid code base.

Mapping iteration models

Python 2.7 provides three different sets of methods to extract the keys, values and items from a dict instance, accounting for 9 out of the 18 public methods of the dict type.

In Python 3, this has been rationalised to just 3 out of 11 public methods (as the has_key method has also been removed).

Lists as mutable snapshots

This is the oldest of the three styles of dict iteration, and hence the one implemented by the d.keys(), d.values() and d.items() methods in Python 2.

These methods all return lists that are snapshots of the state of the mapping at the time the method was called. This has a few consequences:

  • the original object can be mutated freely without affecting iteration over the snapshot
  • the snapshot can be modified independently of the original object
  • the snapshot consumes memory proportional to the size of the original mapping

The semantic equivalent of these operations in Python 3 are list(d.keys()), list(d.values()) and list(d.iteritems()).

Iterator objects

In Python 2.2, dict objects gained support for the then-new iterator protocol, allowing direct iteration over the keys stored in the dictionary, thus avoiding the need to build a list just to iterate over the dictionary contents one entry at a time. iter(d) provides direct access to the iterator object for the keys.

Python 2 also provides a d.iterkeys() method that is essentially synonymous with iter(d), along with d.itervalues() and d.iteritems() methods.

These iterators provide live views of the underlying object, and hence may fail if the set of keys in the underlying object is changed during iteration:

>>> d = dict(a=1)
>>> for k in d:
...     del d[k]
...
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: dictionary changed size during iteration

As iterators, iteration over these objects is also a one-time operation: once the iterator is exhausted, you have to go back to the original mapping in order to iterate again.

In Python 3, direct iteration over mappings works the same way as it does in Python 2. There are no method based equivalents - the semantic equivalents of d.itervalues() and d.iteritems() in Python 3 are iter(d.values()) and iter(d.iteritems()).

The six and future.utils compatibility modules also both provide iterkeys(), itervalues() and iteritems() helper functions that provide efficient iterator semantics in both Python 2 and 3.

Set based dynamic views

The model that is provided in Python 3 as a method based API is that of set based dynamic views (technically multisets in the case of the values() view).

In Python 3, the objects returned by d.keys(), d.values() and d. items() provide a live view of the current state of the underlying object, rather than taking a full snapshot of the current state as they did in Python 2. This change is safe in many circumstances, but does mean that, as with the direct iteration API, it is necessary to avoid adding or removing keys during iteration, in order to avoid encountering the following error:

>>> d = dict(a=1)
>>> for k, v in d.items():
...     del d[k]
...
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: dictionary changed size during iteration

Unlike the iteration API, these objects are iterables, rather than iterators: you can iterate over them multiple times, and each time they will iterate over the entire underlying mapping.

These semantics are also available in Python 2.7 as the d.viewkeys(), d.viewvalues() and `d.viewitems() methods.

The future.utils compatibility module also provides viewkeys(), viewvalues() and viewitems() helper functions when running on Python 2.7 or Python 3.x.

Migrating directly to Python 3

The 2to3 migration tool handles direct migrations to Python 3 in accordance with the semantic equivalents described above:

  • d.keys() -> list(d.keys())
  • d.values() -> list(d.values())
  • d.items() -> list(d.items())
  • d.iterkeys() -> iter(d.keys())
  • d.itervalues() -> iter(d.values())
  • d.iteritems() -> iter(d.items())
  • d.viewkeys() -> d.keys()
  • d.viewvalues() -> d.values()
  • d.viewitems() -> d.items()

Rather than 9 distinct mapping methods for iteration, there are now only the 3 view methods, which combine in straightforward ways with the two relevant builtin functions to cover all of the behaviours that are available as dict methods in Python 2.7.

Note that in many cases d.keys() can be replaced by just d, but the 2to3 migration tool doesn't attempt that replacement.

The 2to3 migration tool also does not provide any automatic assistance for migrating references to these objects as bound or unbound methods - it only automates conversions where the API is called immediately.

Migrating to the common subset of Python 2 and 3

When migrating to the common subset of Python 2 and 3, the above transformations are not generally appropriate, as they all either result in the creation of a redundant list in Python 2, have unexpectedly different semantics in at least some cases, or both.

Since most code running in the common subset of Python 2 and 3 supports at least as far back as Python 2.6, the currently recommended approach to conversion of mapping iteration operation depends on two helper functions for efficient iteration over mapping values and mapping item tuples:

  • d.keys() -> list(d)
  • d.values() -> list(itervalues(d))
  • d.items() -> list(iteritems(d))
  • d.iterkeys() -> iter(d)
  • d.itervalues() -> itervalues(d)
  • d.iteritems() -> iteritems(d)

Both six and future.utils provide appropriate definitions of itervalues() and iteritems() (along with essentially redundant definitions of iterkeys()). Creating your own definitions of these functions in a custom compatibility module is also relatively straightforward:

try:
    dict.iteritems
except AttributeError:
    # Python 3
    def itervalues(d):
        return iter(d.values())
    def iteritems(d):
        return iter(d.items())
else:
    # Python 2
    def itervalues(d):
        return d.itervalues()
    def iteritems(d):
        return d.iteritems()

The greatest loss of readability currently arises when converting code that actually needs the list based snapshots that were the default in Python 2. This readability loss could likely be mitigated by also providing listvalues and listitems helper functions, allowing the affected conversions to be simplified to:

  • d.values() -> listvalues(d)
  • d.items() -> listitems(d)

The corresponding compatibility function definitions are as straightforward as their iterator counterparts:

try:
    dict.iteritems
except AttributeError:
    # Python 3
    def listvalues(d):
        return list(d.values())
    def listitems(d):
        return list(d.items())
else:
    # Python 2
    def listvalues(d):
        return d.values()
    def listitems(d):
        return d.items()

With that expanded set of compatibility functions, Python 2 code would then be converted to "idiomatic" hybrid 2/3 code as:

  • d.keys() -> list(d)
  • d.values() -> listvalues(d)
  • d.items() -> listitems(d)
  • d.iterkeys() -> iter(d)
  • d.itervalues() -> itervalues(d)
  • d.iteritems() -> iteritems(d)

This compares well for readability with the idiomatic pure Python 3 code that uses the mapping methods and builtins directly:

  • d.keys() -> list(d)
  • d.values() -> list(d.values())
  • d.items() -> list(d.items())
  • d.iterkeys() -> iter(d)
  • d.itervalues() -> iter(d.values())
  • d.iteritems() -> iter(d.items())

It's also notable that when using this approach, hybrid code would never invoke the mapping methods directly: it would always invoke either a builtin or helper function instead, in order to ensure the exact same semantics on both Python 2 and 3.

Migrating from Python 3 to the common subset with Python 2.7

While the majority of migrations are currently from Python 2 either directly to Python 3 or to the common subset of Python 2 and Python 3, there are also some migrations of newer projects that start in Python 3 and then later add Python 2 support, either due to user demand, or to gain access to Python 2 libraries that are not yet available in Python 3 (and porting them to Python 3 or creating a Python 3 compatible replacement is not a trivial exercise).

In these cases, Python 2.7 compatibility is often sufficient, and the 2.7+ only view based helper functions provided by future.utils allow the bare accesses to the Python 3 mapping view methods to be replaced with code that is compatible with both Python 2.7 and Python 3 (note, this is the only migration chart in the PEP that has Python 3 code on the left of the conversion):

  • d.keys() -> viewkeys(d)
  • d.values() -> viewvalues(d)
  • d.items() -> viewitems(d)
  • list(d.keys()) -> list(d)
  • list(d.values()) -> listvalues(d)
  • list(d.items()) -> listitems(d)
  • iter(d.keys()) -> iter(d)
  • iter(d.values()) -> itervalues(d)
  • iter(d.items()) -> iteritems(d)

As with migrations from Python 2 to the common subset, note that the hybrid code ends up never invoking the mapping methods directly - it only calls builtins and helper methods, with the latter addressing the semantic differences between Python 2 and Python 3.

Possible changes to Python 3.5+

The main proposal put forward to potentially aid migration of existing Python 2 code to Python 3 is the restoration of some or all of the alternate iteration APIs to the Python 3 mapping API. In particular, the initial draft of this PEP proposed making the following conversions possible when migrating to the common subset of Python 2 and Python 3.5+:

  • d.keys() -> list(d)
  • d.values() -> list(d.itervalues())
  • d.items() -> list(d.iteritems())
  • d.iterkeys() -> d.iterkeys()
  • d.itervalues() -> d.itervalues()
  • d.iteritems() -> d.iteritems()

Possible mitigations of the additional language complexity in Python 3 created by restoring these methods included immediately deprecating them, as well as potentially hiding them from the dir() function (or perhaps even defining a way to make pydoc aware of function deprecations).

However, in the case where the list output is actually desired, the end result of that proposal is actually less readable than an appropriately defined helper function, and the function and method forms of the iterator versions are pretty much equivalent from a readability perspective.

So unless I've missed something critical, readily available listvalues() and listitems() helper functions look like they will improve the readability of hybrid code more than anything we could add back to the Python 3.5+ mapping API, and won't have any long term impact on the complexity of Python 3 itself.

Discussion

The fact that 5 years in to the Python 3 migration we still have users considering the dict API changes a significant barrier to migration suggests that there are problems with previously recommended approaches. This PEP attempts to explore those issues and tries to isolate those cases where previous advice (such as it was) could prove problematic.

My assessment (largely based on feedback from Twisted devs) is that problems are most likely to arise when attempting to use d.keys(), d.values(), and d.items() in hybrid code. While superficially it seems as though there should be cases where it is safe to ignore the semantic differences, in practice, the change from "mutable snapshot" to "dynamic view" is significant enough that it is likely better to just force the use of either list or iterator semantics for hybrid code, and leave the use of the view semantics to pure Python 3 code.

This approach also creates rules that are simple enough and safe enough that it should be possible to automate them in code modernisation scripts that target the common subset of Python 2 and Python 3, just as 2to3 converts them automatically when targeting pure Python 3 code.

Acknowledgements

Thanks to the folks at the Twisted sprint table at PyCon for a very vigorous discussion of this idea (and several other topics), and especially to Hynek Schlawack for acting as a moderator when things got a little too heated :)

Thanks also to JP Calderone and Itamar Turner-Trauring for their email feedback, as well to the participants in the python-dev review of the initial version of the PEP.

pep-0470 Using Multi Repository Support for External to PyPI Package File Hosting

PEP:470
Title:Using Multi Repository Support for External to PyPI Package File Hosting
Version:$Revision$
Last-Modified:$Date$
Author:Donald Stufft <donald at stufft.io>,
BDFL-Delegate:Richard Jones <richard@python.org>
Discussions-To:distutils-sig at python.org
Status:Draft
Type:Process
Content-Type:text/x-rst
Created:12-May-2014
Post-History:14-May-2014, 05-Jun-2014, 03-Oct-2014, 13-Oct-2014
Replaces:438

Abstract

This PEP proposes a mechanism for project authors to register with PyPI an external repository where their project's downloads can be located. This information can than be included as part of the simple API so that installers can use it to tell users where the item they are attempting to install is located and what they need to do to enable this additional repository. In addition to adding discovery information to make explicit multiple repositories easy to use, this PEP also deprecates and removes the implicit multiple repository support which currently functions through directly or indirectly linking off site via the simple API. Finally this PEP also proposes deprecating and removing the functionality added by PEP 438, particularly the additional rel information and the meta tag to indicate the API version.

This PEP does not propose mandating that all authors upload their projects to PyPI in order to exist in the index nor does it propose any change to the human facing elements of PyPI.

Rationale

Historically PyPI did not have any method of hosting files nor any method of automatically retrieving installables, it was instead focused on providing a central registry of names, to prevent naming collisions, and as a means of discovery for finding projects to use. In the course of time setuptools began to scrape these human facing pages, as well as pages linked from those pages, looking for things it could automatically download and install. Eventually this became the "Simple" API which used a similar URL structure however it eliminated any of the extraneous links and information to make the API more efficient. Additionally PyPI grew the ability for a project to upload release files directly to PyPI enabling PyPI to act as a repository in addition to an index.

This gives PyPI two equally important roles that it plays in the Python ecosystem, that of index to enable easy discovery of Python projects and central repository to enable easy hosting, download, and installation of Python projects. Due to the history behind PyPI and the very organic growth it has experienced the lines between these two roles are blurry, and this blurring has caused confusion for the end users of both of these roles and this has in turn caused ire between people attempting to use PyPI in different capacities, most often when end users want to use PyPI as a repository but the author wants to use PyPI solely as an index.

This confusion comes down to end users of projects not realizing if a project is hosted on PyPI or if it relies on an external service. This often manifests itself when the external service is down but PyPI is not. People will see that PyPI works, and other projects works, but this one specific one does not. They often times do not realize who they need to contact in order to get this fixed or what their remediation steps are.

By moving to using explicit multiple repositories we can make the lines between these two roles much more explicit and remove the "hidden" surprises caused by the current implementation of handling people who do not want to use PyPI as a repository. However simply moving to explicit multiple repositories is a regression in discoverability, and for that reason this PEP adds an extension to the current simple API which will enable easy discovery of the specific repository that a project can be found in.

PEP 438 attempted to solve this issue by allowing projects to explicitly declare if they were using the repository features or not, and if they were not, it had the installers classify the links it found as either "internal", "verifiable external" or "unverifiable external". PEP 438 was accepted and implemented in pip 1.4 (released on Jul 23, 2013) with the final transition implemented in pip 1.5 (released on Jan 2, 2014).

PEP 438 was successful in bringing about more people to utilize PyPI's repository features, an altogether good thing given the global CDN powering PyPI providing speed ups for a lot of people, however it did so by introducing a new point of confusion and pain for both the end users and the authors.

Key User Experience Expectations

  1. Easily allow external hosting to "just work" when appropriately configured at the system, user or virtual environment level.
  2. Easily allow package authors to tell PyPI "my releases are hosted <here>" and have that advertised in such a way that tools can clearly communicate it to users, without silently introducing unexpected dependencies on third party services.
  3. Eliminate any and all references to the confusing "verifiable external" and "unverifiable external" distinction from the user experience (both when installing and when releasing packages).
  4. The repository aspects of PyPI should become just the default package hosting location (i.e. the only one that is treated as opt-out rather than opt-in by most client tools in their default configuration). Aside from that aspect, hosting on PyPI should not otherwise provide an enhanced user experience over hosting your own package repository.
  5. Do all of the above while providing default behaviour that is secure against most attackers below the nation state adversary level.

Why Additional Repositories?

The two common installer tools, pip and easy_install/setuptools, both support the concept of additional locations to search for files to satisfy the installation requirements and have done so for many years. This means that there is no need to "phase" in a new flag or concept and the solution to installing a project from a repository other than PyPI will function regardless of how old (within reason) the end user's installer is. Not only has this concept existed in the Python tooling for some time, but it is a concept that exists across languages and even extending to the OS level with OS package tools almost universally using multiple repository support making it extremely likely that someone is already familiar with the concept.

Additionally, the multiple repository approach is a concept that is useful outside of the narrow scope of allowing projects which wish to be included on the index portion of PyPI but do not wish to utilize the repository portion of PyPI. This includes places where a company may wish to host a repository that contains their internal packages or where a project may wish to have multiple "channels" of releases, such as alpha, beta, release candidate, and final release. This could also be used for projects wishing to host files which cannot be uploaded to PyPI, such as multi-gigabyte data files or, currently at least, Linux Wheels.

Why Not PEP 438 or Similar?

While the additional search location support has existed in pip and setuptools for quite some time support for PEP 438 has only existed in pip since the 1.4 version, and still has yet to be implemented in setuptools. The design of PEP 438 did mean that users still benefited for projects which did not require external files even with older installers, however for projects which did require external files, users are still silently being given either potentially unreliable or, even worse, unsafe files to download. This system is also unique to Python as it arises out of the history of PyPI, this means that it is almost certain that this concept will be foreign to most, if not all users, until they encounter it while attempting to use the Python toolchain.

Additionally, the classification system proposed by PEP 438 has, in practice, turned out to be extremely confusing to end users, so much so that it is a position of this PEP that the situation as it stands is completely untenable. The common pattern for a user with this system is to attempt to install a project possibly get an error message (or maybe not if the project ever uploaded something to PyPI but later switched without removing old files), see that the error message suggests --allow-external, they reissue the command adding that flag most likely getting another error message, see that this time the error message suggests also adding --allow-unverified, and again issue the command a third time, this time finally getting the thing they wish to install.

This UX failure exists for several reasons.

  1. If pip can locate files at all for a project on the Simple API it will simply use that instead of attempting to locate more. This is generally the right thing to do as attempting to locate more would erase a large part of the benefit of PEP 438. This means that if a project ever uploaded a file that matches what the user has requested for install that will be used regardless of how old it is.

  2. PEP 438 makes an implicit assumption that most projects would either upload themselves to PyPI or would update themselves to directly linking to release files. While a large number of projects did ultimately decide to upload to PyPI, some of them did so only because the UX around what PEP 438 was so bad that they felt forced to do so. More concerning however, is the fact that very few projects have opted to directly and safely link to files and instead they still simply link to pages which must be scraped in order to find the actual files, thus rendering the safe variant (--allow-external) largely useless.

  3. Even if an author wishes to directly link to their files, doing so safely is non-obvious. It requires the inclusion of a MD5 hash (for historical reasons) in the hash of the URL. If they do not include this then their files will be considered "unverified".

  4. PEP 438 takes a security centric view and disallows any form of a global opt in for unverified projects. While this is generally a good thing, it creates extremely verbose and repetitive command invocations such as:

    $ pip install --allow-external myproject --allow-unverified myproject myproject
    $ pip install --allow-all-external --allow-unverified myproject myproject
    

Multiple Repository/Index Support

Installers SHOULD implement or continue to offer, the ability to point the installer at multiple URL locations. The exact mechanisms for a user to indicate they wish to use an additional location is left up to each individual implementation.

Additionally the mechanism discovering an installation candidate when multiple repositories are being used is also up to each individual implementation, however once configured an implementation should not discourage, warn, or otherwise cast a negative light upon the use of a repository simply because it is not the default repository.

Currently both pip and setuptools implement multiple repository support by using the best installation candidate it can find from either repository, essentially treating it as if it were one large repository.

Installers SHOULD also implement some mechanism for removing or otherwise disabling use of the default repository. The exact specifics of how that is achieved is up to each individual implementation.

Installers SHOULD also implement some mechanism for whitelisting and blacklisting which projects a user wishes to install from a particular repository. The exact specifics of how that is achieved is up to each individual implementation.

External Index Discovery

One of the problems with using an additional index is one of discovery. Users will not generally be aware that an additional index is required at all much less where that index can be found. Projects can attempt to convey this information using their description on the PyPI page however that excludes people who discover their project organically through pip search.

To support projects that wish to externally host their files and to enable users to easily discover what additional indexes are required, PyPI will gain the ability for projects to register external index URLs along with an associated comment for each. These URLs will be made available on the simple page however they will not be linked or provided in a form that older installers will automatically search them.

This ability will take the form of a <meta> tag. The name of this tag must be set to repository or find-link and the content will be a link to the location of the repository. An optional data-description attribute will convey any comments or description that the author has provided.

An example would look something like:

<meta name="repository" content="https://index.example.com/" data-description="Primary Repository">
<meta name="repository" content="https://index.example.com/Ubuntu-14.04/" data-description="Wheels built for Ubuntu 14.04">
<meta name="find-link" content="https://links.example.com/find-links/" data-description="A flat index for find links">

When an installer fetches the simple page for a project, if it finds this additional meta-data then it should use this data to tell the user how to add one or more of the additional URLs to search in. This message should include any comments that the project has included to enable them to communicate to the user and provide hints as to which URL they might want (e.g. if some are only useful or compatible with certain platforms or situations). When the installer has implemented the auto discovery mechanisms they should also deprecate any of the mechanisms added for PEP 438 (such as --allow-external) for removal at the end of the deprecation period proposed by the PEP.

In addition to the API for programtic access to the registered external repositories, PyPI will also prevent these URLs in the UI so that users with an installer that does not implement the discovery mechanism can still easily discover what repository the project is using to host itself.

This feature MUST be added to PyPI and be contained in a released version of pip prior to starting the deprecation and removal process for the implicit offsite hosting functionality.

Summary of Changes

Repository side

  1. Implement simple API changes to allow the addition of an external repository.
  2. (Optional, Mandatory on PyPI) Deprecate and remove the hosting modes as defined by PEP 438.
  3. (Optional, Mandatory on PyPI) Restrict simple API to only list the files that are contained within the repository and the external repository metadata.

Client side

  1. Implement multiple repository support.
  2. Implement some mechanism for removing/disabling the default repository.
  3. Implement the discovery mechanism.
  4. (Optional) Deprecate / Remove PEP 438

Impact

The large impact of this PEP will be that for users of older installation clients they will not get a discovery mechanism built into the install command. This will require them to browse to the PyPI web UI and discover the repository there. Since any URLs required to instal a project will be automatically migrated to the new format, the biggest change to users will be requiring a new option to install these projects.

Looking at the numbers the actual impact should be quite low, with it affecting just 3.8% of projects which host any files only externally or 2.2% which have their latest version hosted only externally.

6674 unique IP addresses have accessed the Simple API for these 3.8% of projects in a single day (2014-09-30). Of those, 99.5% of them installed something which could not be verified, and thus they were open to a Remote Code Execution via a Man-In-The-Middle attack, while 7.9% installed something which could be verified and only 0.4% only installed things which could be verified.

This means that 99.5% users of these features, both new and old, are doing something unsafe, and for anything using an older copy of pip or using setuptools at all they are silently unsafe.

Projects Which Rely on Externally Hosted files

This is determined by crawling the simple index and looking for installable files using a similar detection method as pip and setuptools use. The "latest" version is determined using pkg_resources.parse_version sort order and it is used to show whether or not the latest version is hosted externally or only old versions are.

PyPI External (old) External (latest) Total
Safe 43313 16 39 43368
Unsafe 0 756 1092 1848
Total 43313 772 1131 45216

Top Externally Hosted Projects by Requests

This is determined by looking at the number of requests the /simple/<project>/ page had gotten in a single day. The total number of requests during that day was 10,623,831.

Project Requests
PIL 63869
Pygame 2681
mysql-connector-python 1562
pyodbc 724
elementtree 635
salesforce-python-toolkit 316
wxPython 295
PyXML 251
RBTools 235
python-graph-core 123
cElementTree 121

Top Externally Hosted Projects by Unique IPs

This is determined by looking at the IP addresses of requests the /simple/<project>/ page had gotten in a single day. The total number of unique IP addresses during that day was 124,604.

Project Unique IPs
PIL 4553
mysql-connector-python 462
Pygame 202
pyodbc 181
elementtree 166
wxPython 126
RBTools 114
PyXML 87
salesforce-python-toolkit 76
pyDes 76

Rejected Proposals

Keep the current classification system but adjust the options

This PEP rejects several related proposals which attempt to fix some of the usability problems with the current system but while still keeping the general gist of PEP 438.

This includes:

  • Default to allowing safely externally hosted files, but disallow unsafely hosted.
  • Default to disallowing safely externally hosted files with only a global flag to enable them, but disallow unsafely hosted.
  • Continue on the suggested path of PEP 438 and remove the option to unsafely host externally but continue to allow the option to safely host externally.

These proposals are rejected because:

  • The classification system introduced in PEP 438 in an entirely unique concept to PyPI which is not generically applicable even in the context of Python packaging. Adding additional concepts comes at a cost.
  • The classification system itself is non-obvious to explain and to pre-determine what classification of link a project will require entails inspecting the project's /simple/<project>/ page, and possibly any URLs linked from that page.
  • The ability to host externally while still being linked for automatic discovery is mostly a historic relic which causes a fair amount of pain and complexity for little reward.
  • The installer's ability to optimize or clean up the user interface is limited due to the nature of the implicit link scraping which would need to be done. This extends to the --allow-* options as well as the inability to determine if a link is expected to fail or not.
  • The mechanism paints a very broad brush when enabling an option, while PEP 438 attempts to limit this with per package options. However a project that has existed for an extended period of time may often times have several different URLs listed in their simple index. It is not unusual for at least one of these to no longer be under control of the project. While an unregistered domain will sit there relatively harmless most of the time, pip will continue to attempt to install from it on every discovery phase. This means that an attacker simply needs to look at projects which rely on unsafe external URLs and register expired domains to attack users.

pep-0471 os.scandir() function -- a better and faster directory iterator

PEP:471
Title:os.scandir() function -- a better and faster directory iterator
Version:$Revision$
Last-Modified:$Date$
Author:Ben Hoyt <benhoyt at gmail.com>
BDFL-Delegate:Victor Stinner <victor.stinner@gmail.com>
Status:Final
Type:Standards Track
Content-Type:text/x-rst
Created:30-May-2014
Python-Version:3.5
Post-History:27-Jun-2014, 8-Jul-2014, 14-Jul-2014

Abstract

This PEP proposes including a new directory iteration function, os.scandir(), in the standard library. This new function adds useful functionality and increases the speed of os.walk() by 2-20 times (depending on the platform and file system) by avoiding calls to os.stat() in most cases.

Rationale

Python's built-in os.walk() is significantly slower than it needs to be, because -- in addition to calling os.listdir() on each directory -- it executes the stat() system call or GetFileAttributes() on each file to determine whether the entry is a directory or not.

But the underlying system calls -- FindFirstFile / FindNextFile on Windows and readdir on POSIX systems -- already tell you whether the files returned are directories or not, so no further system calls are needed. Further, the Windows system calls return all the information for a stat_result object on the directory entry, such as file size and last modification time.

In short, you can reduce the number of system calls required for a tree function like os.walk() from approximately 2N to N, where N is the total number of files and directories in the tree. (And because directory trees are usually wider than they are deep, it's often much better than this.)

In practice, removing all those extra system calls makes os.walk() about 8-9 times as fast on Windows, and about 2-3 times as fast on POSIX systems. So we're not talking about micro- optimizations. See more benchmarks here [1].

Somewhat relatedly, many people (see Python Issue 11406 [2]) are also keen on a version of os.listdir() that yields filenames as it iterates instead of returning them as one big list. This improves memory efficiency for iterating very large directories.

So, as well as providing a scandir() iterator function for calling directly, Python's existing os.walk() function can be sped up a huge amount.

Implementation

The implementation of this proposal was written by Ben Hoyt (initial version) and Tim Golden (who helped a lot with the C extension module). It lives on GitHub at benhoyt/scandir [3]. (The implementation may lag behind the updates to this PEP a little.)

Note that this module has been used and tested (see "Use in the wild" section in this PEP), so it's more than a proof-of-concept. However, it is marked as beta software and is not extensively battle-tested. It will need some cleanup and more thorough testing before going into the standard library, as well as integration into posixmodule.c.

Specifics of proposal

os.scandir()

Specifically, this PEP proposes adding a single function to the os module in the standard library, scandir, that takes a single, optional string as its argument:

scandir(path='.') -> generator of DirEntry objects

Like listdir, scandir calls the operating system's directory iteration system calls to get the names of the files in the given path, but it's different from listdir in two ways:

  • Instead of returning bare filename strings, it returns lightweight DirEntry objects that hold the filename string and provide simple methods that allow access to the additional data the operating system may have returned.
  • It returns a generator instead of a list, so that scandir acts as a true iterator instead of returning the full list immediately.

scandir() yields a DirEntry object for each file and sub-directory in path. Just like listdir, the '.' and '..' pseudo-directories are skipped, and the entries are yielded in system-dependent order. Each DirEntry object has the following attributes and methods:

  • name: the entry's filename, relative to the scandir path argument (corresponds to the return values of os.listdir)
  • path: the entry's full path name (not necessarily an absolute path) -- the equivalent of os.path.join(scandir_path, entry.name)
  • is_dir(*, follow_symlinks=True): similar to pathlib.Path.is_dir(), but the return value is cached on the DirEntry object; doesn't require a system call in most cases; don't follow symbolic links if follow_symlinks is False
  • is_file(*, follow_symlinks=True): similar to pathlib.Path.is_file(), but the return value is cached on the DirEntry object; doesn't require a system call in most cases; don't follow symbolic links if follow_symlinks is False
  • is_symlink(): similar to pathlib.Path.is_symlink(), but the return value is cached on the DirEntry object; doesn't require a system call in most cases
  • stat(*, follow_symlinks=True): like os.stat(), but the return value is cached on the DirEntry object; does not require a system call on Windows (except for symlinks); don't follow symbolic links (like os.lstat()) if follow_symlinks is False

All methods may perform system calls in some cases and therefore possibly raise OSError -- see the "Notes on exception handling" section for more details.

The DirEntry attribute and method names were chosen to be the same as those in the new pathlib module where possible, for consistency. The only difference in functionality is that the DirEntry methods cache their values on the entry object after the first call.

Like the other functions in the os module, scandir() accepts either a bytes or str object for the path parameter, and returns the DirEntry.name and DirEntry.path attributes with the same type as path. However, it is strongly recommended to use the str type, as this ensures cross-platform support for Unicode filenames. (On Windows, bytes filenames have been deprecated since Python 3.3).

os.walk()

As part of this proposal, os.walk() will also be modified to use scandir() rather than listdir() and os.path.isdir(). This will increase the speed of os.walk() very significantly (as mentioned above, by 2-20 times, depending on the system).

Examples

First, a very simple example of scandir() showing use of the DirEntry.name attribute and the DirEntry.is_dir() method:

def subdirs(path):
    """Yield directory names not starting with '.' under given path."""
    for entry in os.scandir(path):
        if not entry.name.startswith('.') and entry.is_dir():
            yield entry.name

This subdirs() function will be significantly faster with scandir than os.listdir() and os.path.isdir() on both Windows and POSIX systems, especially on medium-sized or large directories.

Or, for getting the total size of files in a directory tree, showing use of the DirEntry.stat() method and DirEntry.path attribute:

def get_tree_size(path):
    """Return total size of files in given path and subdirs."""
    total = 0
    for entry in os.scandir(path):
        if entry.is_dir(follow_symlinks=False):
            total += get_tree_size(entry.path)
        else:
            total += entry.stat(follow_symlinks=False).st_size
    return total

This also shows the use of the follow_symlinks parameter to is_dir() -- in a recursive function like this, we probably don't want to follow links. (To properly follow links in a recursive function like this we'd want special handling for the case where following a symlink leads to a recursive loop.)

Note that get_tree_size() will get a huge speed boost on Windows, because no extra stat call are needed, but on POSIX systems the size information is not returned by the directory iteration functions, so this function won't gain anything there.

Notes on caching

The DirEntry objects are relatively dumb -- the name and path attributes are obviously always cached, and the is_X and stat methods cache their values (immediately on Windows via FindNextFile, and on first use on POSIX systems via a stat system call) and never refetch from the system.

For this reason, DirEntry objects are intended to be used and thrown away after iteration, not stored in long-lived data structured and the methods called again and again.

If developers want "refresh" behaviour (for example, for watching a file's size change), they can simply use pathlib.Path objects, or call the regular os.stat() or os.path.getsize() functions which get fresh data from the operating system every call.

Notes on exception handling

DirEntry.is_X() and DirEntry.stat() are explicitly methods rather than attributes or properties, to make it clear that they may not be cheap operations (although they often are), and they may do a system call. As a result, these methods may raise OSError.

For example, DirEntry.stat() will always make a system call on POSIX-based systems, and the DirEntry.is_X() methods will make a stat() system call on such systems if readdir() does not support d_type or returns a d_type with a value of DT_UNKNOWN, which can occur under certain conditions or on certain file systems.

Often this does not matter -- for example, os.walk() as defined in the standard library only catches errors around the listdir() calls.

Also, because the exception-raising behaviour of the DirEntry.is_X methods matches that of pathlib -- which only raises OSError in the case of permissions or other fatal errors, but returns False if the path doesn't exist or is a broken symlink -- it's often not necessary to catch errors around the is_X() calls.

However, when a user requires fine-grained error handling, it may be desirable to catch OSError around all method calls and handle as appropriate.

For example, below is a version of the get_tree_size() example shown above, but with fine-grained error handling added:

def get_tree_size(path):
    """Return total size of files in path and subdirs. If
    is_dir() or stat() fails, print an error message to stderr
    and assume zero size (for example, file has been deleted).
    """
    total = 0
    for entry in os.scandir(path):
        try:
            is_dir = entry.is_dir(follow_symlinks=False)
        except OSError as error:
            print('Error calling is_dir():', error, file=sys.stderr)
            continue
        if is_dir:
            total += get_tree_size(entry.path)
        else:
            try:
                total += entry.stat(follow_symlinks=False).st_size
            except OSError as error:
                print('Error calling stat():', error, file=sys.stderr)
    return total

Support

The scandir module on GitHub has been forked and used quite a bit (see "Use in the wild" in this PEP), but there's also been a fair bit of direct support for a scandir-like function from core developers and others on the python-dev and python-ideas mailing lists. A sampling:

  • python-dev: a good number of +1's and very few negatives for scandir and PEP 471 on this June 2014 python-dev thread
  • Nick Coghlan, a core Python developer: "I've had the local Red Hat release engineering team express their displeasure at having to stat every file in a network mounted directory tree for info that is present in the dirent structure, so a definite +1 to os.scandir from me, so long as it makes that info available." [source1]
  • Tim Golden, a core Python developer, supports scandir enough to have spent time refactoring and significantly improving scandir's C extension module. [source2]
  • Christian Heimes, a core Python developer: "+1 for something like yielddir()" [source3] and "Indeed! I'd like to see the feature in 3.4 so I can remove my own hack from our code base." [source4]
  • Gregory P. Smith, a core Python developer: "As 3.4beta1 happens tonight, this isn't going to make 3.4 so i'm bumping this to 3.5. I really like the proposed design outlined above." [source5]
  • Guido van Rossum on the possibility of adding scandir to Python 3.5 (as it was too late for 3.4): "The ship has likewise sailed for adding scandir() (whether to os or pathlib). By all means experiment and get it ready for consideration for 3.5, but I don't want to add it to 3.4." [source6]

Support for this PEP itself (meta-support?) was given by Nick Coghlan on python-dev: "A PEP reviewing all this for 3.5 and proposing a specific os.scandir API would be a good thing." [source7]

Use in the wild

To date, the scandir implementation is definitely useful, but has been clearly marked "beta", so it's uncertain how much use of it there is in the wild. Ben Hoyt has had several reports from people using it. For example:

  • Chris F: "I am processing some pretty large directories and was half expecting to have to modify getdents. So thanks for saving me the effort." [via personal email]
  • bschollnick: "I wanted to let you know about this, since I am using Scandir as a building block for this code. Here's a good example of scandir making a radical performance improvement over os.listdir." [source8]
  • Avram L: "I'm testing our scandir for a project I'm working on. Seems pretty solid, so first thing, just want to say nice work!" [via personal email]
  • Matt Z: "I used scandir to dump the contents of a network dir in under 15 seconds. 13 root dirs, 60,000 files in the structure. This will replace some old VBA code embedded in a spreadsheet that was taking 15-20 minutes to do the exact same thing." [via personal email]

Others have requested a PyPI package [4] for it, which has been created. See PyPI package [5].

GitHub stats don't mean too much, but scandir does have several watchers, issues, forks, etc. Here's the run-down as of the stats as of July 7, 2014:

  • Watchers: 17
  • Stars: 57
  • Forks: 20
  • Issues: 4 open, 26 closed

Also, because this PEP will increase the speed of os.walk() significantly, there are thousands of developers and scripts, and a lot of production code, that would benefit from it. For example, on GitHub, there are almost as many uses of os.walk (194,000) as there are of os.mkdir (230,000).

Rejected ideas

Naming

The only other real contender for this function's name was iterdir(). However, iterX() functions in Python (mostly found in Python 2) tend to be simple iterator equivalents of their non-iterator counterparts. For example, dict.iterkeys() is just an iterator version of dict.keys(), but the objects returned are identical. In scandir()'s case, however, the return values are quite different objects (DirEntry objects vs filename strings), so this should probably be reflected by a difference in name -- hence scandir().

See some relevant discussion on python-dev.

Wildcard support

FindFirstFile/FindNextFile on Windows support passing a "wildcard" like *.jpg, so at first folks (this PEP's author included) felt it would be a good idea to include a windows_wildcard keyword argument to the scandir function so users could pass this in.

However, on further thought and discussion it was decided that this would be bad idea, unless it could be made cross-platform (a pattern keyword argument or similar). This seems easy enough at first -- just use the OS wildcard support on Windows, and something like fnmatch or re afterwards on POSIX-based systems.

Unfortunately the exact Windows wildcard matching rules aren't really documented anywhere by Microsoft, and they're quite quirky (see this blog post), meaning it's very problematic to emulate using fnmatch or regexes.

So the consensus was that Windows wildcard support was a bad idea. It would be possible to add at a later date if there's a cross-platform way to achieve it, but not for the initial version.

Read more on the this Nov 2012 python-ideas thread and this June 2014 python-dev thread on PEP 471.

DirEntry attributes being properties

In some ways it would be nicer for the DirEntry is_X() and stat() to be properties instead of methods, to indicate they're very cheap or free. However, this isn't quite the case, as stat() will require an OS call on POSIX-based systems but not on Windows. Even is_dir() and friends may perform an OS call on POSIX-based systems if the dirent.d_type value is DT_UNKNOWN (on certain file systems).

Also, people would expect the attribute access entry.is_dir to only ever raise AttributeError, not OSError in the case it makes a system call under the covers. Calling code would have to have a try/except around what looks like a simple attribute access, and so it's much better to make them methods.

See this May 2013 python-dev thread where this PEP author makes this case and there's agreement from a core developers.

DirEntry fields being "static" attribute-only objects

In this July 2014 python-dev message, Paul Moore suggested a solution that was a "thin wrapper round the OS feature", where the DirEntry object had only static attributes: name, path, and is_X, with the st_X attributes only present on Windows. The idea was to use this simpler, lower-level function as a building block for higher-level functions.

At first there was general agreement that simplifying in this way was a good thing. However, there were two problems with this approach. First, the assumption is the is_dir and similar attributes are always present on POSIX, which isn't the case (if d_type is not present or is DT_UNKNOWN). Second, it's a much harder-to-use API in practice, as even the is_dir attributes aren't always present on POSIX, and would need to be tested with hasattr() and then os.stat() called if they weren't present.

See this July 2014 python-dev response from this PEP's author detailing why this option is a non-ideal solution, and the subsequent reply from Paul Moore voicing agreement.

DirEntry fields being static with an ensure_lstat option

Another seemingly simpler and attractive option was suggested by Nick Coghlan in this June 2014 python-dev message: make DirEntry.is_X and DirEntry.lstat_result properties, and populate DirEntry.lstat_result at iteration time, but only if the new argument ensure_lstat=True was specified on the scandir() call.

This does have the advantage over the above in that you can easily get the stat result from scandir() if you need it. However, it has the serious disadvantage that fine-grained error handling is messy, because stat() will be called (and hence potentially raise OSError) during iteration, leading to a rather ugly, hand-made iteration loop:

it = os.scandir(path)
while True:
    try:
        entry = next(it)
    except OSError as error:
        handle_error(path, error)
    except StopIteration:
        break

Or it means that scandir() would have to accept an onerror argument -- a function to call when stat() errors occur during iteration. This seems to this PEP's author neither as direct nor as Pythonic as try/except around a DirEntry.stat() call.

Another drawback is that os.scandir() is written to make code faster. Always calling os.lstat() on POSIX would not bring any speedup. In most cases, you don't need the full stat_result object -- the is_X() methods are enough and this information is already known.

See Ben Hoyt's July 2014 reply to the discussion summarizing this and detailing why he thinks the original PEP 471 proposal is "the right one" after all.

Return values being (name, stat_result) two-tuples

Initially this PEP's author proposed this concept as a function called iterdir_stat() which yielded two-tuples of (name, stat_result). This does have the advantage that there are no new types introduced. However, the stat_result is only partially filled on POSIX-based systems (most fields set to None and other quirks), so they're not really stat_result objects at all, and this would have to be thoroughly documented as different from os.stat().

Also, Python has good support for proper objects with attributes and methods, which makes for a saner and simpler API than two-tuples. It also makes the DirEntry objects more extensible and future-proof as operating systems add functionality and we want to include this in DirEntry.

See also some previous discussion:

Return values being overloaded stat_result objects

Another alternative discussed was making the return values to be overloaded stat_result objects with name and path attributes. However, apart from this being a strange (and strained!) kind of overloading, this has the same problems mentioned above -- most of the stat_result information is not fetched by readdir() on POSIX systems, only (part of) the st_mode value.

Return values being pathlib.Path objects

With Antoine Pitrou's new standard library pathlib module, it at first seems like a great idea for scandir() to return instances of pathlib.Path. However, pathlib.Path's is_X() and stat() functions are explicitly not cached, whereas scandir has to cache them by design, because it's (often) returning values from the original directory iteration system call.

And if the pathlib.Path instances returned by scandir cached stat values, but the ordinary pathlib.Path objects explicitly don't, that would be more than a little confusing.

Guido van Rossum explicitly rejected pathlib.Path caching stat in the context of scandir here, making pathlib.Path objects a bad choice for scandir return values.

Possible improvements

There are many possible improvements one could make to scandir, but here is a short list of some this PEP's author has in mind:

  • scandir could potentially be further sped up by calling readdir / FindNextFile say 50 times per Py_BEGIN_ALLOW_THREADS block so that it stays in the C extension module for longer, and may be somewhat faster as a result. This approach hasn't been tested, but was suggested by on Issue 11406 by Antoine Pitrou. [source9]
  • scandir could use a free list to avoid the cost of memory allocation for each iteration -- a short free list of 10 or maybe even 1 may help. Suggested by Victor Stinner on a python-dev thread on June 27 [6].

Previous discussion

pep-0472 Support for indexing with keyword arguments

PEP:472
Title:Support for indexing with keyword arguments
Version:$Revision$
Last-Modified:$Date$
Author:Stefano Borini, Joseph Martinot-Lagarde
Discussions-To:python-ideas at python.org
Status:Draft
Type:Standards Track
Content-Type:text/x-rst
Created:24-Jun-2014
Python-Version:3.6
Post-History:02-Jul-2014

Abstract

This PEP proposes an extension of the indexing operation to support keyword arguments. Notations in the form a[K=3,R=2] would become legal syntax. For future-proofing considerations, a[1:2, K=3, R=4] are considered and may be allowed as well, depending on the choice for implementation. In addition to a change in the parser, the index protocol (__getitem__, __setitem__ and __delitem__) will also potentially require adaptation.

Motivation

The indexing syntax carries a strong semantic content, differentiating it from a method call: it implies referring to a subset of data. We believe this semantic association to be important, and wish to expand the strategies allowed to refer to this data.

As a general observation, the number of indices needed by an indexing operation depends on the dimensionality of the data: one-dimensional data (e.g. a list) requires one index (e.g. a[3]), two-dimensional data (e.g. a matrix) requires two indices (e.g. a[2,3]) and so on. Each index is a selector along one of the axes of the dimensionality, and the position in the index tuple is the metainformation needed to associate each index to the corresponding axis.

The current python syntax focuses exclusively on position to express the association to the axes, and also contains syntactic sugar to refer to non-punctiform selection (slices)

>>> a[3]       # returns the fourth element of a
>>> a[1:10:2]  # slice notation (extract a non-trivial data subset)
>>> a[3,2]     # multiple indexes (for multidimensional arrays)

The additional notation proposed in this PEP would allow notations involving keyword arguments in the indexing operation, e.g.

>>> a[K=3, R=2]

which would allow to refer to axes by conventional names.

One must additionally consider the extended form that allows both positional and keyword specification

>>> a[3,R=3,K=4]

This PEP will explore different strategies to enable the use of these notations.

Use cases

The following practical use cases present two broad categories of usage of a keyworded specification: Indexing and contextual option. For indexing:

  1. To provide a more communicative meaning to the index, preventing e.g. accidental inversion of indexes

    >>> gridValues[x=3, y=5, z=8]
    >>> rain[time=0:12, location=location]
    
  2. In some domain, such as computational physics and chemistry, the use of a notation such as Basis[Z=5] is a Domain Specific Language notation to represent a level of accuracy

    >>> low_accuracy_energy = computeEnergy(molecule, BasisSet[Z=3])
    

    In this case, the index operation would return a basis set at the chosen level of accuracy (represented by the parameter Z). The reason behind an indexing is that the BasisSet object could be internally represented as a numeric table, where rows (the "coefficient" axis, hidden to the user in this example) are associated to individual elements (e.g. row 0:5 contains coefficients for element 1, row 5:8 coefficients for element 2) and each column is associated to a given degree of accuracy ("accuracy" or "Z" axis) so that first column is low accuracy, second column is medium accuracy and so on. With that indexing, the user would obtain another object representing the contents of the column of the internal table for accuracy level 3.

Additionally, the keyword specification can be used as an option contextual to the indexing. Specifically:

  1. A "default" option allows to specify a default return value when the index is not present

    >>> lst = [1, 2, 3]
    >>> value = lst[5, default=0]  # value is 0
    
  2. For a sparse dataset, to specify an interpolation strategy to infer a missing point from e.g. its surrounding data.

    >>> value = array[1, 3, interpolate=spline_interpolator]
    
  3. A unit could be specified with the same mechanism

    >>> value = array[1, 3, unit="degrees"]
    

How the notation is interpreted is up to the implementing class.

Current implementation

Currently, the indexing operation is handled by methods __getitem__, __setitem__ and __delitem__. These methods' signature accept one argument for the index (with __setitem__ accepting an additional argument for the set value). In the following, we will analyze __getitem__(self, idx) exclusively, with the same considerations implied for the remaining two methods.

When an indexing operation is performed, __getitem__(self, idx) is called. Traditionally, the full content between square brackets is turned into a single object passed to argument idx:

  • When a single element is passed, e.g. a[2], idx will be 2.
  • When multiple elements are passed, they must be separated by commas: a[2, 3]. In this case, idx will be a tuple (2, 3). With a[2, 3, "hello", {}] idx will be (2, 3, "hello", {}).
  • A slicing notation e.g. a[2:10] will produce a slice object, or a tuple containing slice objects if multiple values were passed.

Except for its unique ability to handle slice notation, the indexing operation has similarities to a plain method call: it acts like one when invoked with only one element; If the number of elements is greater than one, the idx argument behaves like a *args. However, as stated in the Motivation section, an indexing operation has the strong semantic implication of extraction of a subset out of a larger set, which is not automatically associated to a regular method call unless appropriate naming is chosen. Moreover, its different visual style is important for readability.

Specifications

The implementation should try to preserve the current signature for __getitem__, or modify it in a backward-compatible way. We will present different alternatives, taking into account the possible cases that need to be addressed

C0. a[1]; a[1,2]         # Traditional indexing
C1. a[Z=3]
C2. a[Z=3, R=4]
C3. a[1, Z=3]
C4. a[1, Z=3, R=4]
C5. a[1, 2, Z=3]
C6. a[1, 2, Z=3, R=4]
C7. a[1, Z=3, 2, R=4]    # Interposed ordering

Strategy "Strict dictionary"

This strategy acknowledges that __getitem__ is special in accepting only one object, and the nature of that object must be non-ambiguous in its specification of the axes: it can be either by order, or by name. As a result of this assumption, in presence of keyword arguments, the passed entity is a dictionary and all labels must be specified.

C0. a[1]; a[1,2]      -> idx = 1; idx = (1, 2)
C1. a[Z=3]            -> idx = {"Z": 3}
C2. a[Z=3, R=4]       -> idx = {"Z": 3, "R": 4}
C3. a[1, Z=3]         -> raise SyntaxError
C4. a[1, Z=3, R=4]    -> raise SyntaxError
C5. a[1, 2, Z=3]      -> raise SyntaxError
C6. a[1, 2, Z=3, R=4] -> raise SyntaxError
C7. a[1, Z=3, 2, R=4] -> raise SyntaxError

Pros

  • Strong conceptual similarity between the tuple case and the dictionary case. In the first case, we are specifying a tuple, so we are naturally defining a plain set of values separated by commas. In the second, we are specifying a dictionary, so we are specifying a homogeneous set of key/value pairs, as in dict(Z=3, R=4);
  • Simple and easy to parse on the __getitem__ side: if it gets a tuple, determine the axes using positioning. If it gets a dictionary, use the keywords.
  • C interface does not need changes.

Neutral

  • Degeneracy of a[{"Z": 3, "R": 4}] with a[Z=3, R=4] means the notation is syntactic sugar.

Cons

  • Very strict.
  • Destroys ordering of the passed arguments. Preserving the order would be possible with an OrderedDict as drafted by PEP-468 [5].
  • Does not allow use cases with mixed positional/keyword arguments such as a[1, 2, default=5].

Strategy "mixed dictionary"

This strategy relaxes the above constraint to return a dictionary containing both numbers and strings as keys.

C0. a[1]; a[1,2]      -> idx = 1; idx = (1, 2)
C1. a[Z=3]            -> idx = {"Z": 3}
C2. a[Z=3, R=4]       -> idx = {"Z": 3, "R": 4}
C3. a[1, Z=3]         -> idx = { 0: 1, "Z": 3}
C4. a[1, Z=3, R=4]    -> idx = { 0: 1, "Z": 3, "R": 4}
C5. a[1, 2, Z=3]      -> idx = { 0: 1, 1: 2, "Z": 3}
C6. a[1, 2, Z=3, R=4] -> idx = { 0: 1, 1: 2, "Z": 3, "R": 4}
C7. a[1, Z=3, 2, R=4] -> idx = { 0: 1, "Z": 3, 2: 2, "R": 4}

Pros

  • Opens for mixed cases.

Cons

  • Destroys ordering information for string keys. We have no way of saying if "Z" in C7 was in position 1 or 3.
  • Implies switching from a tuple to a dict as soon as one specified index has a keyword argument. May be confusing to parse.

Strategy "named tuple"

Return a named tuple for idx instead of a tuple. Keyword arguments would obviously have their stated name as key, and positional argument would have an underscore followed by their order:

C0. a[1]; a[1,2]      -> idx = 1; idx = (_0=1, _1=2)
C1. a[Z=3]            -> idx = (Z=3)
C2. a[Z=3, R=2]       -> idx = (Z=3, R=2)
C3. a[1, Z=3]         -> idx = (_0=1, Z=3)
C4. a[1, Z=3, R=2]    -> idx = (_0=1, Z=3, R=2)
C5. a[1, 2, Z=3]      -> idx = (_0=1, _2=2, Z=3)
C6. a[1, 2, Z=3, R=4] -> (_0=1, _1=2, Z=3, R=4)
C7. a[1, Z=3, 2, R=4] -> (_0=1, Z=3, _1=2, R=4)
                      or (_0=1, Z=3, _2=2, R=4)
                      or raise SyntaxError

The required typename of the namedtuple could be Index or the name of the argument in the function definition, it keeps the ordering and is easy to analyse by using the _fields attribute. It is backward compatible, provided that C0 with more than one entry now passes a namedtuple instead of a plain tuple.

Pros

  • Looks nice. namedtuple transparently replaces tuple and gracefully degrades to the old behavior.
  • Does not require a change in the C interface

Cons

  • According to some sources [4] namedtuple is not well developed. To include it as such important object would probably require rework and improvement;
  • The namedtuple fields, and thus the type, will have to change according to the passed arguments. This can be a performance bottleneck, and makes it impossible to guarantee that two subsequent index accesses get the same Index class;
  • the _n "magic" fields are a bit unusual, but ipython already uses them for result history.
  • Python currently has no builtin namedtuple. The current one is available in the "collections" module in the standard library.
  • Differently from a function, the two notations gridValues[x=3, y=5, z=8] and gridValues[3,5,8] would not gracefully match if the order is modified at call time (e.g. we ask for gridValues[y=5, z=8, x=3]). In a function, we can pre-define argument names so that keyword arguments are properly matched. Not so in __getitem__, leaving the task for interpreting and matching to __getitem__ itself.

Strategy "New argument contents"

In the current implementation, when many arguments are passed to __getitem__, they are grouped in a tuple and this tuple is passed to __getitem__ as the single argument idx. This strategy keeps the current signature, but expands the range of variability in type and contents of idx to more complex representations.

We identify four possible ways to implement this strategy:

  • P1: uses a single dictionary for the keyword arguments.
  • P2: uses individual single-item dictionaries.
  • P3: similar to P2, but replaces single-item dictionaries with a (key, value) tuple.
  • P4: similar to P2, but uses a special and additional new object: keyword()

Some of these possibilities lead to degenerate notations, i.e. indistinguishable from an already possible representation. Once again, the proposed notation becomes syntactic sugar for these representations.

Under this strategy, the old behavior for C0 is unchanged.

C0: a[1]        -> idx = 1                    # integer
    a[1,2]      -> idx = (1,2)                # tuple

In C1, we can use either a dictionary or a tuple to represent key and value pair for the specific indexing entry. We need to have a tuple with a tuple in C1 because otherwise we cannot differentiate a["Z", 3] from a[Z=3].

C1: a[Z=3]      -> idx = {"Z": 3}             # P1/P2 dictionary with single key
                or idx = (("Z", 3),)          # P3 tuple of tuples
                or idx = keyword("Z", 3)      # P4 keyword object

As you can see, notation P1/P2 implies that a[Z=3] and a[{"Z": 3}] will call __getitem__ passing the exact same value, and is therefore syntactic sugar for the latter. Same situation occurs, although with different index, for P3. Using a keyword object as in P4 would remove this degeneracy.

For the C2 case:

C2. a[Z=3, R=4] -> idx = {"Z": 3, "R": 4}     # P1 dictionary/ordereddict
                or idx = ({"Z": 3}, {"R": 4}) # P2 tuple of two single-key dict
                or idx = (("Z", 3), ("R", 4)) # P3 tuple of tuples
                or idx = (keyword("Z", 3),
                          keyword("R", 4) )   # P4 keyword objects

P1 naturally maps to the traditional **kwargs behavior, however it breaks the convention that two or more entries for the index produce a tuple. P2 preserves this behavior, and additionally preserves the order. Preserving the order would also be possible with an OrderedDict as drafted by PEP-468 [5].

The remaining cases are here shown:

C3. a[1, Z=3]   -> idx = (1, {"Z": 3})                     # P1/P2
                or idx = (1, ("Z", 3))                     # P3
                or idx = (1, keyword("Z", 3))              # P4

C4. a[1, Z=3, R=4] -> idx = (1, {"Z": 3, "R": 4})          # P1
                   or idx = (1, {"Z": 3}, {"R": 4})        # P2
                   or idx = (1, ("Z", 3), ("R", 4))        # P3
                   or idx = (1, keyword("Z", 3),
                                keyword("R", 4))           # P4

C5. a[1, 2, Z=3]   -> idx = (1, 2, {"Z": 3})               # P1/P2
                   or idx = (1, 2, ("Z", 3))               # P3
                   or idx = (1, 2, keyword("Z", 3))        # P4

C6. a[1, 2, Z=3, R=4] -> idx = (1, 2, {"Z":3, "R": 4})     # P1
                      or idx = (1, 2, {"Z": 3}, {"R": 4})  # P2
                      or idx = (1, 2, ("Z", 3), ("R", 4))  # P3
                      or idx = (1, 2, keyword("Z", 3),
                                      keyword("R", 4))     # P4

C7. a[1, Z=3, 2, R=4] -> idx = (1, 2, {"Z": 3, "R": 4})    # P1. Pack the keyword arguments. Ugly.
                      or raise SyntaxError                 # P1. Same behavior as in function calls.
                      or idx = (1, {"Z": 3}, 2, {"R": 4})  # P2
                      or idx =  (1, ("Z", 3), 2, ("R", 4)) # P3
                      or idx =  (1, keyword("Z", 3),
                                 2, keyword("R", 4))       # P4

Pros

  • Signature is unchanged;
  • P2/P3 can preserve ordering of keyword arguments as specified at indexing,
  • P1 needs an OrderedDict, but would destroy interposed ordering if allowed: all keyword indexes would be dumped into the dictionary;
  • Stays within traditional types: tuples and dicts. Evt. OrderedDict;
  • Some proposed strategies are similar in behavior to a traditional function call;
  • The C interface for PyObject_GetItem and family would remain unchanged.

Cons

  • Apparenty complex and wasteful;
  • Degeneracy in notation (e.g. a[Z=3] and a[{"Z":3}] are equivalent and indistinguishable notations at the __[get|set|del]item__ level). This behavior may or may not be acceptable.
  • for P4, an additional object similar in nature to slice() is needed, but only to disambiguate the above degeneracy.
  • idx type and layout seems to change depending on the whims of the caller;
  • May be complex to parse what is passed, especially in the case of tuple of tuples;
  • P2 Creates a lot of single keys dictionary as members of a tuple. Looks ugly. P3 would be lighter and easier to use than the tuple of dicts, and still preserves order (unlike the regular dict), but would result in clumsy extraction of keywords.

Strategy "kwargs argument"

__getitem__ accepts an optional **kwargs argument which should be keyword only. idx also becomes optional to support a case where no non-keyword arguments are allowed. The signature would then be either

__getitem__(self, idx)
__getitem__(self, idx, **kwargs)
__getitem__(self, **kwargs)

Applied to our cases would produce:

C0. a[1,2]            -> idx=(1,2);  kwargs={}
C1. a[Z=3]            -> idx=None ;  kwargs={"Z":3}
C2. a[Z=3, R=4]       -> idx=None ;  kwargs={"Z":3, "R":4}
C3. a[1, Z=3]         -> idx=1    ;  kwargs={"Z":3}
C4. a[1, Z=3, R=4]    -> idx=1    ;  kwargs={"Z":3, "R":4}
C5. a[1, 2, Z=3]      -> idx=(1,2);  kwargs={"Z":3}
C6. a[1, 2, Z=3, R=4] -> idx=(1,2);  kwargs={"Z":3, "R":4}
C7. a[1, Z=3, 2, R=4] -> raise SyntaxError # in agreement to function behavior

Empty indexing a[] of course remains invalid syntax.

Pros

  • Similar to function call, evolves naturally from it;
  • Use of keyword indexing with an object whose __getitem__ doesn't have a kwargs will fail in an obvious way. That's not the case for the other strategies.

Cons

  • It doesn't preserve order, unless an OrderedDict is used;
  • Forbids C7, but is it really needed?
  • Requires a change in the C interface to pass an additional PyObject for the keyword arguments.

C interface

As briefly introduced in the previous analysis, the C interface would potentially have to change to allow the new feature. Specifically, PyObject_GetItem and related routines would have to accept an additional PyObject *kw argument for Strategy "kwargs argument". The remaining strategies would not require a change in the C function signatures, but the different nature of the passed object would potentially require adaptation.

Strategy "named tuple" would behave correctly without any change: the class returned by the factory method in collections returns a subclass of tuple, meaning that PyTuple_* functions can handle the resulting object.

Alternative Solutions

In this section, we present alternative solutions that would workaround the missing feature and make the proposed enhancement not worth of implementation.

Use a method

One could keep the indexing as is, and use a traditional get() method for those cases where basic indexing is not enough. This is a good point, but as already reported in the introduction, methods have a different semantic weight from indexing, and you can't use slices directly in methods. Compare e.g. a[1:3, Z=2] with a.get(slice(1,3), Z=2).

The authors however recognize this argument as compelling, and the advantage in semantic expressivity of a keyword-based indexing may be offset by a rarely used feature that does not bring enough benefit and may have limited adoption.

Emulate requested behavior by abusing the slice object

This extremely creative method exploits the slice objects' behavior, provided that one accepts to use strings (or instantiate properly named placeholder objects for the keys), and accept to use ":" instead of "=".

>>> a["K":3]
slice('K', 3, None)
>>> a["K":3, "R":4]
(slice('K', 3, None), slice('R', 4, None))
>>>

While clearly smart, this approach does not allow easy inquire of the key/value pair, it's too clever and esotheric, and does not allow to pass a slice as in a[K=1:10:2].

However, Tim Delaney comments

"I really do think that a[b=c, d=e] should just be syntax sugar for a['b':c, 'd':e]. It's simple to explain, and gives the greatest backwards compatibility. In particular, libraries that already abused slices in this way will just continue to work with the new syntax."

We think this behavior would produce inconvenient results. The library Pandas uses strings as labels, allowing notation such as

>>> a[:, "A":"F"]

to extract data from column "A" to column "F". Under the above comment, this notation would be equally obtained with

>>> a[:, A="F"]

which is weird and collides with the intended meaning of keyword in indexing, that is, specifying the axis through conventional names rather than positioning.

Pass a dictionary as an additional index

>>> a[1, 2, {"K": 3}]

this notation, although less elegant, can already be used and achieves similar results. It's evident that the proposed Strategy "New argument contents" can be interpreted as syntactic sugar for this notation.

Additional Comments

Commenters also expressed the following relevant points:

Relevance of ordering of keyword arguments

As part of the discussion of this PEP, it's important to decide if the ordering information of the keyword arguments is important, and if indexes and keys can be ordered in an arbitrary way (e.g. a[1,Z=3,2,R=4]). PEP-468 [5] tries to address the first point by proposing the use of an ordereddict, however one would be inclined to accept that keyword arguments in indexing are equivalent to kwargs in function calls, and therefore as of today equally unordered, and with the same restrictions.

Need for homogeneity of behavior

Relative to Strategy "New argument contents", a comment from Ian Cordasco points out that

"it would be unreasonable for just one method to behave totally differently from the standard behaviour in Python. It would be confusing for only __getitem__ (and ostensibly, __setitem__) to take keyword arguments but instead of turning them into a dictionary, turn them into individual single-item dictionaries." We agree with his point, however it must be pointed out that __getitem__ is already special in some regards when it comes to passed arguments.

Chris Angelico also states:

"it seems very odd to start out by saying "here, let's give indexing the option to carry keyword args, just like with function calls", and then come back and say "oh, but unlike function calls, they're inherently ordered and carried very differently"." Again, we agree on this point. The most straightforward strategy to keep homogeneity would be Strategy "kwargs argument", opening to a **kwargs argument on __getitem__.

One of the authors (Stefano Borini) thinks that only the "strict dictionary" strategy is worth of implementation. It is non-ambiguous, simple, does not force complex parsing, and addresses the problem of referring to axes either by position or by name. The "options" use case is probably best handled with a different approach, and may be irrelevant for this PEP. The alternative "named tuple" is another valid choice.

Having .get() become obsolete for indexing with default fallback

Introducing a "default" keyword could make dict.get() obsolete, which would be replaced by d["key", default=3]. Chris Angelico however states:

"Currently, you need to write __getitem__ (which raises an exception on finding a problem) plus something else, e.g. get(), which returns a default instead. By your proposal, both branches would go inside __getitem__, which means they could share code; but there still need to be two branches."

Additionally, Chris continues:

"There'll be an ad-hoc and fairly arbitrary puddle of names (some will go default=, others will say that's way too long and go def=, except that that's a keyword so they'll use dflt= or something...), unless there's a strong force pushing people to one consistent name.".

This argument is valid but it's equally valid for any function call, and is generally fixed by established convention and documentation.

On degeneracy of notation

User Drekin commented: "The case of a[Z=3] and a[{"Z": 3}] is similar to current a[1, 2] and a[(1, 2)]. Even though one may argue that the parentheses are actually not part of tuple notation but are just needed because of syntax, it may look as degeneracy of notation when compared to function call: f(1, 2) is not the same thing as f((1, 2)).".

References

[1]"keyword-only args in __getitem__" (http://article.gmane.org/gmane.comp.python.ideas/27584)
[2]"Accepting keyword arguments for __getitem__" (https://mail.python.org/pipermail/python-ideas/2014-June/028164.html)
[3]"PEP pre-draft: Support for indexing with keyword arguments" https://mail.python.org/pipermail/python-ideas/2014-July/028250.html
[4]"namedtuple is not as good as it should be" (https://mail.python.org/pipermail/python-ideas/2013-June/021257.html)
[5](1, 2, 3) "Preserving the order of **kwargs in a function." http://legacy.python.org/dev/peps/pep-0468/

pep-0473 Adding structured data to built-in exceptions

PEP:473
Title:Adding structured data to built-in exceptions
Version:$Revision$
Last-Modified:$Date$
Author:Sebastian Kreft <skreft at deezer.com>
Status:Draft
Type:Standards Track
Content-Type:text/x-rst
Created:29-Mar-2014
Post-History:

Abstract

Exceptions like AttributeError, IndexError, KeyError, LookupError, NameError, TypeError, and ValueError do not provide all information required by programmers to debug and better understand what caused them. Furthermore, in some cases the messages even have slightly different formats, which makes it really difficult for tools to automatically provide additional information to diagnose the problem. To tackle the former and to lay ground for the latter, it is proposed to expand these exceptions so to hold both the offending and affected entities.

Rationale

The main issue this PEP aims to solve is the fact that currently error messages are not that expressive and lack some key information to resolve the exceptions. Additionally, the information present on the error message is not always in the same format, which makes it very difficult for third-party libraries to provide automated diagnosis of the error.

These automated tools could, for example, detect typos or display or log extra debug information. These could be particularly useful when running tests or in a long running application.

Although it is in theory possible to have such libraries, they need to resort to hacks in order to achieve the goal. One such example is python-improved-exceptions [1], which modifies the byte-code to keep references to the possibly interesting objects and also parses the error messages to extract information like types or names. Unfortunately, such approach is extremely fragile and not portable.

A similar proposal [2] has been implemented for ImportError and in the same fashion this idea has received support [3]. Additionally, almost 10 years ago Guido asked in [11] to have a clean API to access the affected objects in Exceptions like KeyError, AttributeError, NameError, and IndexError. Similar issues and proposals ideas have been written in the last year. Some other issues have been created, but despite receiving support they finally get abandoned. References to the created issues are listed below:

To move forward with the development and to centralize the information and discussion, this PEP aims to be a meta-issue summarizing all the above discussions and ideas.

Examples

IndexError

The error message does not reference the list's length nor the index used.

a = [1, 2, 3, 4, 5]
a[5]
IndexError: list index out of range

KeyError

By convention the key is the first element of the error's argument, but there's no other information regarding the affected dictionary (keys types, size, etc.)

b = {'foo': 1}
b['fo']
KeyError: 'fo'

AttributeError

The object's type and the offending attribute are part of the error message. However, there are some different formats and the information is not always available. Furthermore, although the object type is useful in some cases, given the dynamic nature of Python, it would be much more useful to have a reference to the object itself. Additionally the reference to the type is not fully qualified and in some cases the type is just too generic to provide useful information, for example in case of accessing a module's attribute.

c = object()
c.foo
AttributeError: 'object' object has no attribute 'foo'

import string
string.foo
AttributeError: 'module' object has no attribute 'foo'

a = string.Formatter()
a.foo
AttributeError: 'Formatter' object has no attribute 'foo'

NameError

The error message provides typically the name.

foo = 1
fo
NameError: global name 'fo' is not defined

Other Cases

Issues are even harder to debug when the target object is the result of another expression, for example:

a[b[c[0]]]

This issue is also related to the fact that opcodes only have line number information and not the offset. This proposal would help in this case but not as much as having offsets.

Proposal

Extend the exceptions AttributeError, IndexError, KeyError, LookupError, NameError, TypeError, and ValueError with the following:

  • AttributeError: target w, attribute
  • IndexError: target w, key w, index (just an alias to key)
  • KeyError: target w, key w
  • LookupError: target w, key w
  • NameError: name, scope?
  • TypeError: unexpected_type
  • ValueError: unexpected_value w

Attributes with the superscript w may need to be weak references [12] to prevent any memory cycles. However, this may add an unnecessary extra complexity as noted by R. David Murray [13]. This is specially true given that builtin types do not support being weak referenced.

TODO(skreft): expand this with examples of corner cases.

To remain backwards compatible these new attributes will be optional and keyword only.

It is proposed to add this information, rather than just improve the error, as the former would allow new debugging frameworks and tools and also in the future to switch to a lazy generated message. Generated messages are discussed in [2], although they are not implemented at the moment. They would not only save some resources, but also uniform the messages.

The stdlib will be then gradually changed so to start using these new attributes.

Potential Uses

An automated tool could for example search for similar keys within the object, allowing to display the following::

a = {'foo': 1}
a['fo']
KeyError: 'fo'. Did you mean 'foo'?

foo = 1
fo
NameError: global name 'fo' is not defined. Did you mean 'foo'?

See [3] for the output a TestRunner could display.

Performance

Filling these new attributes would only require two extra parameters with data already available so the impact should be marginal. However, it may need special care for KeyError as the following pattern is already widespread.

try:
  a[foo] = a[foo] + 1
except:
  a[foo] = 0

Note as well that storing these objects into the error itself would allow the lazy generation of the error message, as discussed in [2].

References

[1]Python Exceptions Improved (https://www.github.com/sk-/python-exceptions-improved)
[2](1, 2, 3) ImportError needs attributes for module and file name (http://bugs.python.org/issue1559549)
[3](1, 2, 3, 4, 5, 6) Enhance exceptions by attaching some more information to them (https://mail.python.org/pipermail/python-ideas/2014-February/025601.html)
[4]Specifity in AttributeError (https://mail.python.org/pipermail/python-ideas/2013-April/020308.html)
[5]Add an 'attr' attribute to AttributeError (http://bugs.python.org/issue18156)
[6]Add index attribute to IndexError (http://bugs.python.org/issue18162)
[7]Add a 'key' attribute to KeyError (http://bugs.python.org/issue18163)
[8]Add 'unexpected_type' to TypeError (http://bugs.python.org/issue18165)
[9]'value' attribute for ValueError (http://bugs.python.org/issue18166)
[10](1, 2) making builtin exceptions more informative (http://bugs.python.org/issue1182143)
[11](1, 2, 3, 4, 5, 6) LookupError etc. need API to get the key (http://bugs.python.org/issue614557)
[12]weakref - Weak References (https://docs.python.org/3/library/weakref.html)
[13]Message by R. David Murray: Weak refs on exceptions? (http://bugs.python.org/issue18163#msg190791)

pep-0474 Creating forge.python.org

PEP:474
Title:Creating forge.python.org
Version:$Revision$
Last-Modified:$Date$
Author:Nick Coghlan <ncoghlan at gmail.com>
Status:Draft
Type:Process
Content-Type:text/x-rst
Created:19-Jul-2014
Post-History:19-Jul-2014, 08-Jan-2015, 01-Feb-2015

Abstract

This PEP proposes setting up a new PSF provided resource, forge.python.org, as a location for maintaining various supporting repositories (such as the repository for Python Enhancement Proposals) in a way that is more accessible to new contributors, and easier to manage for core developers.

This PEP does not propose any changes to the core development workflow for CPython itself (see PEP 462 in relation to that).

Proposal

This PEP proposes that an instance of the self-hosted Kallithea code repository management system be deployed as "forge.python.org".

Individual repositories (such as the developer guide or the PEPs repository) may then be migrated from the existing hg.python.org infrastructure to the new forge.python.org infrastructure on a case by case basis. Each migration will need to decide whether to retain a read-only mirror on hg.python.org, or whether to just migrate wholesale to the new location.

In addition to supporting read-only mirrors on hg.python.org, forge.python.org will also aim to support hosting mirrors on popular proprietary hosting sites like GitHub and BitBucket. The aim will be to allow users familiar with these sites to submit and discuss pull requests using their preferred workflow, with forge.python.org automatically bringing those contributions over to the master repository.

Given the availability and popularity of commercially backed "free for open source projects" repository hosting services, this would not be a general purpose hosting site for arbitrary Python projects. The initial focus will be specifically on CPython and other repositories currently hosted on hg.python.org. In the future, this could potentially be expanded to consolidating other PSF managed repositories that are currently externally hosted to gain access to a pull request based workflow, such as the repository for the python.org Django application. As with the initial migrations, any such future migrations would be considered on a case-by-case basis, taking into account the preferences of the primary users of each repository.

Rationale

Currently, hg.python.org hosts more than just the core CPython repository, it also hosts other repositories such as those for the CPython developer guide and for Python Enhancement Proposals, along with various "sandbox" repositories for core developer experimentation.

While the simple "pull request" style workflow made popular by code hosting sites like GitHub and BitBucket isn't adequate for the complex branching model needed for parallel maintenance and development of the various CPython releases, it's a good fit for several of the ancillary projects that surround CPython that we don't wish to move to a proprietary hosting site.

The key requirements proposed for a PSF provided software forge are:

  • MUST support simple "pull request" style workflows
  • MUST support online editing for simple changes
  • MUST be backed by an active development organisation (community or commercial)

Additional recommended requirements that are satisfied by this proposal, but may be negotiable if a sufficiently compelling alternative is presented:

  • SHOULD support self-hosting on PSF infrastructure without ongoing fees
  • SHOULD be a fully open source application written in Python
  • SHOULD support Mercurial (for consistency with existing tooling)
  • SHOULD support Git (to provide that option to users that prefer it)
  • SHOULD allow users of git and Mercurial clients to transparently collaborate on the same repository
  • SHOULD be open to customisation to meet the needs of CPython core development, including providing a potential path forward for the proposed migration to a core reviewer model in PEP 462

The preference for self-hosting without ongoing fees rules out the free-as-in-beer providers like GitHub and BitBucket, in addition to the various proprietary source code management offerings.

The preference for Mercurial support not only rules out GitHub, but also other Git-only solutions like GitLab and Gitorious.

The hard requirement for online editing support rules out the Apache Allura/HgForge combination.

The preference for a fully open source solution rules out RhodeCode.

Of the various options considered by the author of this proposal, that leaves Kallithea SCM as the proposed foundation for a forge.python.org service.

Kallithea is a full GPLv3 application (derived from the clearly and unambiguously GPLv3 licensed components of RhodeCode) that is being developed under the auspices of the Software Freedom Conservancy. The Conservancy has affirmed that the Kallithea codebase is completely and validly licensed under GPLv3. In addition to their role in building the initial Kallithea community, the Conservancy is also the legal home of both the Mercurial and Git projects. Other SFC member projects that may be familiar to Python users include Twisted, Gevent, BuildBot and PyPy.

Intended Benefits

The primary benefit of deploying Kallithea as forge.python.org is that supporting repositories such as the developer guide and the PEP repo could potentially be managed using pull requests and online editing. This would be much simpler than the current workflow which requires PEP editors and other core developers to act as intermediaries to apply updates suggested by other users.

The richer administrative functionality would also make it substantially easier to grant users access to particular repositories for collaboration purposes, without having to grant them general access to the entire installation. This helps lower barriers to entry, as trust can more readily be granted and earned incrementally, rather than being an all-or-nothing decision around granting core developer access.

Sustaining Engineering Considerations

Even with its current workflow, CPython itself remains one of the largest open source projects in the world (in the top 2% of projects tracked on OpenHub). Unfortunately, we have been significantly less effective at encouraging contributions to the projects that make up CPython's workflow infrastructure, including ensuring that our installations track upstream, and that wherever feasible, our own customisations are contributed back to the original project.

As such, a core component of this proposal is to actively engage with the upstream Kallithea community to lower the barriers to working with and on the Kallithea SCM, as well as with the PSF Infrastructure team to ensure the forge.python.org service integrates cleanly with the PSF's infrastructure automation.

This approach aims to provide a number of key benefits:

  • allowing those of us contributing to maintenance of this service to be as productive as possible in the time we have available
  • offering a compelling professional development opportunity to those volunteers that choose to participate in maintenance of this service
  • making the Kallithea project itself more attractive to other potential users by making it as easy as possible to adopt, deploy and manage
  • as a result of the above benefits, attracting sufficient contributors both in the upstream Kallithea community, and within the CPython infrastructure community, to allow the forge.python.org service to evolve effectively to meet changing developer expectations

Some initial steps have already been taken to address these sustaining engineering concerns:

  • Tymoteusz Jankowski has been working with Donald Stufft to work out what would be involved in deploying Kallithea using the PSF's Salt based infrastructure automation.
  • Graham Dumpleton and I have been working on making it easy to deploy demonstration Kallithea instances to the free tier of Red Hat's open source hosting service, OpenShift Online. (See the comments on that post, or the quickstart issue tracker for links to Graham's follow on work)

The next major step to be undertaken is to come up with a local development workflow that allows contributors on Windows, Mac OS X and Linux to run the Kallithea tests locally, without interfering with the operation of their own system. The currently planned approach for this is to focus on Vagrant, which is a popular automated virtual machine management system specifically aimed at developers running local VMs for testing purposes. The Vagrant based development guidelines for OpenShift Origin provide an extended example of the kind of workflow this approach enables. It's also worth noting that Vagrant is one of the options for working with a local build of the main python.org website.

If these workflow proposals end up working well for Kallithea, they may also be worth proposing for use by the upstream projects backing other PSF and CPython infrastructure services, including Roundup, BuildBot, and the main python.org web site.

Funding of development

As several aspects of this proposal and PEP 462 align with various workflow improvements under consideration for Red Hat's Beaker open source hardware integration testing system and other work-related projects, I have arranged to be able to devote ~1 day a week to working on CPython infrastructure projects.

Together with Rackspace's existing contributions to maintaining the pypi.python.org infrastructure, I personally believe this arrangement is indicative of a more general recognition amongst CPython redistributors and major users of the merit in helping to sustain upstream infrastructure through direct contributions of developer time, rather than expecting volunteer contributors to maintain that infrastructure entirely in their spare time or funding it indirectly through the PSF (with the additional management overhead that would entail). I consider this a positive trend, and one that I will continue to encourage as best I can.

Personal Motivation

As of March 2015, having moved from Boeing Defence Australia (where I had worked since September 1998) to Red Hat back in June 2011 , I now work for Red Hat as a software development workflow designer and process architect, focusing on the open source cross-platform Atomic Developer Bundle, which is part of the tooling ecosystem for the Project Atomic container hosting platform. Two of the key pieces of that bundle will be familiar to many readers: Docker for container management, and Vagrant for cross-platform local development VM management.

However, rather than being a developer for the downstream Red Hat Enterprise Linux Container Development Kit, I work with the development teams for a range of Red Hat's internal services, encouraging the standardisation of internal development tooling and processes on the Atomic Developer Bundle, contributing upstream as required to ensure it meets our needs and expectations. As with other Red Hat community web service development projects like PatternFly, this approach helps enable standardisation across internal services, community projects, and commercial products, while still leaving individual development teams with significant scope to appropriately prioritise their process improvement efforts by focusing on the limitations currently causing the most difficulties for them and their users.

In that role, I'll be focusing on effectively integrating the Developer Bundle with tools and technologies used across Red Hat's project and product portfolio. As Red Hat is an open source system integrator, that means touching on a wide range of services and technologies, including GitHub, GerritHub, standalone Gerrit, GitLab, Bugzilla, JIRA, Jenkins, Docker, Kubernetes, OpenShift, OpenStack, oVirt, Ansible, Puppet, and more.

However, as noted above in the section on sustaining engineering considerations, I've also secured agreement to spend a portion of my work time on similarly applying these cross platforms tools to improving the developer experience for the maintenance of Python Software Foundation infrastructure, starting with this proposal for a Kallithea-based forge.python.org service.

Between them, my day job and my personal open source engagement have given me visibility into a lot of what the popular source code management services do well and what they do poorly. While Kallithea certainly has plenty of flaws of its own, it's the one I consider most fixable from a personal perspective, as it allows me to get directly involved in tailoring it to meet the needs of the CPython core development community in a way that wouldn't be possible with a proprietary service like GitHub or BitBucket, or practical with a PHP-based service like Phabricator or a Ruby-based service like GitLab.

Technical Concerns and Challenges

Introducing a new service into the CPython infrastructure presents a number of interesting technical concerns and challenges. This section covers several of the most significant ones.

Service hosting

The default position of this PEP is that the new forge.python.org service will be integrated into the existing PSF Salt infrastructure and hosted on the PSF's Rackspace cloud infrastructure.

However, other hosting options will also be considered, in particular, possible deployment as a Kubernetes hosted web service on either Google Container Engine or the next generation of Red Hat's OpenShift Online service, by using either GCEPersistentDisk or the open source GlusterFS distributed filesystem to hold the source code repositories.

Ongoing infrastructure maintenance

Ongoing infrastructure maintenance is an area of concern within the PSF, as we currently lack a system administrator mentorship program equivalent to the Fedora Infrastructure Apprentice or GNOME Infrastructure Apprentice programs.

Instead, systems tend to be maintained largely by developers as a part time activity on top of their development related contributions, rather than seeking to recruit folks that are more interested in operations (i.e. keeping existing systems running well) than they are in development (i.e. making changes to the services to provide new features or a better user experience, or to address existing issues).

While I'd personally like to see the PSF operating such a program at some point in the future, I don't consider setting one up to be a feasible near term goal. However, I do consider it feasible to continue laying the groundwork for such a program by extending the PSF's existing usage of modern infrastructure technologies like OpenStack and Salt to cover more services, as well as starting to explore the potential benefits of containers and container platforms when it comes to maintaining and enhancing PSF provided services.

I also plan to look into the question of whether or not an open source cloud management platform like ManageIQ may help us bring our emerging "cloud sprawl" problem across Rackspace, Google, Amazon and other services more under control.

User account management

Ideally we'd like to be able to offer a single account that spans all python.org services, including Kallithea, Roundup/Rietveld, PyPI and the back end for the new python.org site, but actually implementing that would be a distinct infrastructure project, independent of this PEP. (It's also worth noting that the fine-grained control of ACLs offered by such a capability is a prerequisite for setting up an effective system administrator mentorship program)

For the initial rollout of forge.python.org, we will likely create yet another identity silo within the PSF infrastructure. A potentially superior alternative would be to add support for python-social-auth to Kallithea, but actually doing so would not be a requirement for the initial rollout of the service (the main technical concern there is that Kallithea is a Pylons application that has not yet been ported to Pyramid, so integration will require either adding a Pylons backend to python-social-auth, or else embarking on the Pyramid migration in Kallithea).

Integration with Roundup

Kallithea provides configurable issue tracker integration. This will need to be set up appropriately to integrate with the Roundup issue tracker at bugs.python.org before the initial rollout of the forge.python.org service.

Accepting pull requests on GitHub and BitBucket

The initial rollout of forge.python.org would support publication of read-only mirrors, both on hg.python.org and other services, as that is a relatively straightforward operation that can be implemented in a commit hook.

While a highly desirable feature, accepting pull requests on external services, and mirroring them as submissions to the master repositories on forge.python.org is a more complex problem, and would likely not be included as part of the initial rollout of the forge.python.org service.

Transparent Git and Mercurial interoperability

Kallithea's native support for both Git and Mercurial offers an opportunity to make it relatively straightforward for developers to use the client of their choice to interact with repositories hosted on forge.python.org.

This transparent interoperability does not exist yet, but running our own multi-VCS repository hosting service provides the opportunity to make this capability a reality, rather than passively waiting for a proprietary provider to deign to provide a feature that likely isn't in their commercial interest. There's a significant misalignment of incentives between open source communities and commercial providers in this particular area, as even though offering VCS client choice can significantly reduce community friction by eliminating the need for projects to make autocratic decisions that force particular tooling choices on potential contributors, top down enforcement of tool selection (regardless of developer preference) is currently still the norm in the corporate and other organisational environments that produce GitHub and Atlassian's paying customers.

Prior to acceptance, in the absence of transparent interoperability, this PEP should propose specific recommendations for inclusion in the CPython developer's guide section for git users for creating pull requests against forge.python.org hosted Mercurial repositories.

Pilot Objectives and Timeline

This proposal is part of Brett Cannon's current evaluation of improvement proposals for various aspects of the CPython development workflow. Key dates in that timeline are:

  • Feb 1: Draft proposal published (for Kallithea, this PEP)
  • Apr 8: Discussion of final proposals at Python Language Summit
  • May 1: Brett's decision on which proposal to accept
  • Sep 13: Python 3.5 released, adopting new workflows for Python 3.6

If this proposal is selected for further development, it is proposed to start with the rollout of the following pilot deployment:

  • a reference implementation operational at kallithea-pilot.python.org, containing at least the developer guide and PEP repositories. This will be a "throwaway" instance, allowing core developers and other contributors to experiment freely without worrying about the long term consequences for the repository history.
  • read-only live mirrors of the Kallithea hosted repositories on GitHub and BitBucket. As with the pilot service itself, these would be temporary repos, to be discarded after the pilot period ends.
  • clear documentation on using those mirrors to create pull requests against Kallithea hosted Mercurial repositories (for the pilot, this will likely not include using the native pull request workflows of those hosted services)
  • automatic linking of issue references in code review comments and commit messages to the corresponding issues on bugs.python.org
  • draft updates to PEP 1 explaining the Kallithea based PEP editing and submission workflow

The following items would be needed for a production migration, but there doesn't appear to be an obvious way to trial an updated implementation as part of the pilot:

  • adjusting the PEP publication process and the developer guide publication process to be based on the relocated Mercurial repos

The following items would be objectives of the overall workflow improvement process, but are considered "desirable, but not essential" for the initial adoption of the new service in September (if this proposal is the one selected and the proposed pilot deployment is successful):

  • allowing the use of python-social-auth to authenticate against the PSF hosted Kallithea instance
  • allowing the use of the GitHub and BitBucket pull request workflows to submit pull requests to the main Kallithea repo
  • allowing easy triggering of forced BuildBot runs based on Kallithea hosted repos and pull requests (prior to the implementation of PEP 462, this would be intended for use with sandbox repos rather than the main CPython repo)

Future Implications for CPython Core Development

The workflow requirements for the main CPython development repository are significantly more complex than those for the repositories being discussed in this PEP. These concerns are covered in more detail in PEP 462.

Given Guido's recommendation to replace Rietveld with a more actively maintained code review system, my current plan is to rewrite that PEP to use Kallithea as the proposed glue layer, with enhanced Kallithea pull requests eventually replacing the current practice of uploading patche files directly to the issue tracker.

I've also started working with Pierre Yves-David on a custom Mercurial extension that automates some aspects of the CPython core development workflow.

pep-0475 Retry system calls failing with EINTR

PEP:475
Title:Retry system calls failing with EINTR
Version:$Revision$
Last-Modified:$Date$
Author:Charles-Franรงois Natali <cf.natali at gmail.com>, Victor Stinner <victor.stinner at gmail.com>
BDFL-Delegate:Antoine Pitrou <solipsis@pitrou.net>
Status:Final
Type:Standards Track
Content-Type:text/x-rst
Created:29-July-2014
Python-Version:3.5
Resolution:https://mail.python.org/pipermail/python-dev/2015-February/138018.html

Abstract

System call wrappers provided in the standard library should be retried automatically when they fail with EINTR, to relieve application code from the burden of doing so.

By system calls, we mean the functions exposed by the standard C library pertaining to I/O or handling of other system resources.

Rationale

Interrupted system calls

On POSIX systems, signals are common. Code calling system calls must be prepared to handle them. Examples of signals:

  • The most common signal is SIGINT, the signal sent when CTRL+c is pressed. By default, Python raises a KeyboardInterrupt exception when this signal is received.
  • When running subprocesses, the SIGCHLD signal is sent when a child process exits.
  • Resizing the terminal sends the SIGWINCH signal to the applications running in the terminal.
  • Putting the application in background (ex: press CTRL-z and then type the bg command) sends the SIGCONT signal.

Writing a C signal handler is difficult: only "async-signal-safe" functions can be called (for example, printf() and malloc() are not async-signal safe), and there are issues with reentrancy. Therefore, when a signal is received by a process during the execution of a system call, the system call can fail with the EINTR error to give the program an opportunity to handle the signal without the restriction on signal-safe functions.

This behaviour is system-dependent: on certain systems, using the SA_RESTART flag, some system calls are retried automatically instead of failing with EINTR. Regardless, Python's signal.signal() function clears the SA_RESTART flag when setting the signal handler: all system calls will probably fail with EINTR in Python.

Since receiving a signal is a non-exceptional occurrence, robust POSIX code must be prepared to handle EINTR (which, in most cases, means retry in a loop in the hope that the call eventually succeeds). Without special support from Python, this can make application code much more verbose than it needs to be.

Status in Python 3.4

In Python 3.4, handling the InterruptedError exception (EINTR's dedicated exception class) is duplicated at every call site on a case by case basis. Only a few Python modules actually handle this exception, and fixes usually took several years to cover a whole module. Example of code retrying file.read() on InterruptedError:

while True:
    try:
        data = file.read(size)
        break
    except InterruptedError:
        continue

List of Python modules in the standard library which handle InterruptedError:

  • asyncio
  • asyncore
  • io, _pyio
  • multiprocessing
  • selectors
  • socket
  • socketserver
  • subprocess

Other programming languages like Perl, Java and Go retry system calls failing with EINTR at a lower level, so that libraries and applications needn't bother.

Use Case 1: Don't Bother With Signals

In most cases, you don't want to be interrupted by signals and you don't expect to get InterruptedError exceptions. For example, do you really want to write such complex code for a "Hello World" example?

while True:
    try:
        print("Hello World")
        break
    except InterruptedError:
        continue

InterruptedError can happen in unexpected places. For example, os.close() and FileIO.close() may raise InterruptedError: see the article close() and EINTR.

The Python issues related to EINTR section below gives examples of bugs caused by EINTR.

The expectation in this use case is that Python hides the InterruptedError and retries system calls automatically.

Use Case 2: Be notified of signals as soon as possible

Sometimes yet, you expect some signals and you want to handle them as soon as possible. For example, you may want to immediately quit a program using the CTRL+c keyboard shortcut.

Besides, some signals are not interesting and should not disrupt the application. There are two options to interrupt an application on only some signals:

  • Set up a custom signal signal handler which raises an exception, such as KeyboardInterrupt for SIGINT.
  • Use a I/O multiplexing function like select() together with Python's signal wakeup file descriptor: see the function signal.set_wakeup_fd().

The expectation in this use case is for the Python signal handler to be executed timely, and the system call to fail if the handler raised an exception -- otherwise restart.

Proposal

This PEP proposes to handle EINTR and retries at the lowest level, i.e. in the wrappers provided by the stdlib (as opposed to higher-level libraries and applications).

Specifically, when a system call fails with EINTR, its Python wrapper must call the given signal handler (using PyErr_CheckSignals()). If the signal handler raises an exception, the Python wrapper bails out and fails with the exception.

If the signal handler returns successfully, the Python wrapper retries the system call automatically. If the system call involves a timeout parameter, the timeout is recomputed.

Modified functions

Example of standard library functions that need to be modified to comply with this PEP:

  • open(), os.open(), io.open()
  • functions of the faulthandler module
  • os functions:
    • os.fchdir()
    • os.fchmod()
    • os.fchown()
    • os.fdatasync()
    • os.fstat()
    • os.fstatvfs()
    • os.fsync()
    • os.ftruncate()
    • os.mkfifo()
    • os.mknod()
    • os.posix_fadvise()
    • os.posix_fallocate()
    • os.pread()
    • os.pwrite()
    • os.read()
    • os.readv()
    • os.sendfile()
    • os.wait3()
    • os.wait4()
    • os.wait()
    • os.waitid()
    • os.waitpid()
    • os.write()
    • os.writev()
    • special cases: os.close() and os.dup2() now ignore EINTR error, the syscall is not retried
  • select.select(), select.poll.poll(), select.epoll.poll(), select.kqueue.control(), select.devpoll.poll()
  • socket.socket() methods:
    • accept()
    • connect() (except for non-blocking sockets)
    • recv()
    • recvfrom()
    • recvmsg()
    • send()
    • sendall()
    • sendmsg()
    • sendto()
  • signal.sigtimedwait(), signal.sigwaitinfo()
  • time.sleep()

(Note: the selector module already retries on InterruptedError, but it doesn't recompute the timeout yet)

os.close, close() methods and os.dup2() are a special case: they will ignore EINTR instead of retrying. The reason is complex but involves behaviour under Linux and the fact that the file descriptor may really be closed even if EINTR is returned. See articles:

The socket.socket.connect() method does not retry connect() for non-blocking sockets if it is interrupted by a signal (fails with EINTR). The connection runs asynchronously in background. The caller is responsible to wait until the socket becomes writable (ex: using select.select()) and then call socket.socket.getsockopt(socket.SOL_SOCKET, socket.SO_ERROR) to check if the connection succeeded (getsockopt() returns 0) or failed.

InterruptedError handling

Since interrupted system calls are automatically retried, the InterruptedError exception should not occur anymore when calling those system calls. Therefore, manual handling of InterruptedError as described in Status in Python 3.4 can be removed, which will simplify standard library code.

Backward compatibility

Applications relying on the fact that system calls are interrupted with InterruptedError will hang. The authors of this PEP don't think that such applications exist, since they would be exposed to other issues such as race conditions (there is an opportunity for deadlock if the signal comes before the system call). Besides, such code would be non-portable.

In any case, those applications must be fixed to handle signals differently, to have a reliable behaviour on all platforms and all Python versions. A possible strategy is to set up a signal handler raising a well-defined exception, or use a wakeup file descriptor.

For applications using event loops, signal.set_wakeup_fd() is the recommanded option to handle signals. Python's low-level signal handler will write signal numbers into the file descriptor and the event loop will be awaken to read them. The event loop can handle those signals without the restriction of signal handlers (for example, the loop can be woken up in any thread, not just the main thread).

Appendix

Wakeup file descriptor

Since Python 3.3, signal.set_wakeup_fd() writes the signal number into the file descriptor, whereas it only wrote a null byte before. It becomes possible to distinguish between signals using the wakeup file descriptor.

Linux has a signalfd() system call which provides more information on each signal. For example, it's possible to know the pid and uid who sent the signal. This function is not exposed in Python yet (see issue 12304).

On Unix, the asyncio module uses the wakeup file descriptor to wake up its event loop.

Multithreading

A C signal handler can be called from any thread, but Python signal handlers will always be called in the main Python thread.

Python's C API provides the PyErr_SetInterrupt() function which calls the SIGINT signal handler in order to interrupt the main Python thread.

Signals on Windows

Control events

Windows uses "control events":

  • CTRL_BREAK_EVENT: Break (SIGBREAK)
  • CTRL_CLOSE_EVENT: Close event
  • CTRL_C_EVENT: CTRL+C (SIGINT)
  • CTRL_LOGOFF_EVENT: Logoff
  • CTRL_SHUTDOWN_EVENT: Shutdown

The SetConsoleCtrlHandler() function can be used to install a control handler.

The CTRL_C_EVENT and CTRL_BREAK_EVENT events can be sent to a process using the GenerateConsoleCtrlEvent() function. This function is exposed in Python as os.kill().

Signals

The following signals are supported on Windows:

  • SIGABRT
  • SIGBREAK (CTRL_BREAK_EVENT): signal only available on Windows
  • SIGFPE
  • SIGILL
  • SIGINT (CTRL_C_EVENT)
  • SIGSEGV
  • SIGTERM

SIGINT

The default Python signal handler for SIGINT sets a Windows event object: sigint_event.

time.sleep() is implemented with WaitForSingleObjectEx(), it waits for the sigint_event object using time.sleep() parameter as the timeout. So the sleep can be interrupted by SIGINT.

_winapi.WaitForMultipleObjects() automatically adds sigint_event to the list of watched handles, so it can also be interrupted.

PyOS_StdioReadline() also used sigint_event when fgets() failed to check if Ctrl-C or Ctrl-Z was pressed.

Implementation

The implementation is tracked in issue 23285. It was committed on February 07, 2015.

pep-0476 Enabling certificate verification by default for stdlib http clients

PEP:476
Title:Enabling certificate verification by default for stdlib http clients
Version:$Revision$
Last-Modified:$Date$
Author:Alex Gaynor <alex.gaynor at gmail.com>
Status:Final
Type:Standards Track
Content-Type:text/x-rst
Created:28-August-2014

Abstract

Currently when a standard library http client (the urllib, urllib2, http, and httplib modules) encounters an https:// URL it will wrap the network HTTP traffic in a TLS stream, as is necessary to communicate with such a server. However, during the TLS handshake it will not actually check that the server has an X509 certificate is signed by a CA in any trust root, nor will it verify that the Common Name (or Subject Alternate Name) on the presented certificate matches the requested host.

The failure to do these checks means that anyone with a privileged network position is able to trivially execute a man in the middle attack against a Python application using either of these HTTP clients, and change traffic at will.

This PEP proposes to enable verification of X509 certificate signatures, as well as hostname verification for Python's HTTP clients by default, subject to opt-out on a per-call basis. This change would be applied to Python 2.7, Python 3.4, and Python 3.5.

Rationale

The "S" in "HTTPS" stands for secure. When Python's users type "HTTPS" they are expecting a secure connection, and Python should adhere to a reasonable standard of care in delivering this. Currently we are failing at this, and in doing so, APIs which appear simple are misleading users.

When asked, many Python users state that they were not aware that Python failed to perform these validations, and are shocked.

The popularity of requests (which enables these checks by default) demonstrates that these checks are not overly burdensome in any way, and the fact that it is widely recommended as a major security improvement over the standard library clients demonstrates that many expect a higher standard for "security by default" from their tools.

The failure of various applications to note Python's negligence in this matter is a source of regular CVE assignment [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11].

[1]https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2010-4340
[2]https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2012-3533
[3]https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2012-5822
[4]https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2012-5825
[5]https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2013-1909
[6]https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2013-2037
[7]https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2013-2073
[8]https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2013-2191
[9]https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2013-4111
[10]https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2013-6396
[11]https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2013-6444

Technical Details

Python would use the system provided certificate database on all platforms. Failure to locate such a database would be an error, and users would need to explicitly specify a location to fix it.

This will be achieved by adding a new ssl._create_default_https_context function, which is the same as ssl.create_default_context.

http.client can then replace its usage of ssl._create_stdlib_context with the ssl._create_default_https_context.

Additionally ssl._create_stdlib_context is renamed ssl._create_unverified_context (an alias is kept around for backwards compatibility reasons).

Trust database

This PEP proposes using the system-provided certificate database. Previous discussions have suggested bundling Mozilla's certificate database and using that by default. This was decided against for several reasons:

  • Using the platform trust database imposes a lower maintenance burden on the Python developers -- shipping our own trust database would require doing a release every time a certificate was revoked.
  • Linux vendors, and other downstreams, would unbundle the Mozilla certificates, resulting in a more fragmented set of behaviors.
  • Using the platform stores makes it easier to handle situations such as corporate internal CAs.

OpenSSL also has a pair of environment variables, SSL_CERT_DIR and SSL_CERT_FILE which can be used to point Python at a different certificate database.

Backwards compatibility

This change will have the appearance of causing some HTTPS connections to "break", because they will now raise an Exception during handshake.

This is misleading however, in fact these connections are presently failing silently, an HTTPS URL indicates an expectation of confidentiality and authentication. The fact that Python does not actually verify that the user's request has been made is a bug, further: "Errors should never pass silently."

Nevertheless, users who have a need to access servers with self-signed or incorrect certificates would be able to do so by providing a context with custom trust roots or which disables validation (documentation should strongly recommend the former where possible). Users will also be able to add necessary certificates to system trust stores in order to trust them globally.

Twisted's 14.0 release made this same change, and it has been met with almost no opposition.

Opting out

For users who wish to opt out of certificate verification on a single connection, they can achieve this by providing the context argument to urllib.urlopen:

import ssl

# This restores the same behavior as before.
context = ssl._create_unverified_context()
urllib.urlopen("https://no-valid-cert", context=context)

It is also possible, though highly discouraged, to globally disable verification by monkeypatching the ssl module in versions of Python that implement this PEP:

import ssl

try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    # Legacy Python that doesn't verify HTTPS certificates by default
    pass
else:
    # Handle target environment that doesn't support HTTPS verification
    ssl._create_default_https_context = _create_unverified_https_context

This guidance is aimed primarily at system administrators that wish to adopt newer versions of Python that implement this PEP in legacy environments that do not yet support certificate verification on HTTPS connections. For example, an administrator may opt out by adding the monkeypatch above to sitecustomize.py in their Standard Operating Environment for Python. Applications and libraries SHOULD NOT be making this change process wide (except perhaps in response to a system administrator controlled configuration setting).

Particularly security sensitive applications should always provide an explicit application defined SSL context rather than relying on the default behaviour of the underlying Python implementation.

Other protocols

This PEP only proposes requiring this level of validation for HTTP clients, not for other protocols such as SMTP.

This is because while a high percentage of HTTPS servers have correct certificates, as a result of the validation performed by browsers, for other protocols self-signed or otherwise incorrect certificates are far more common. Note that for SMTP at least, this appears to be changing and should be reviewed for a potential similar PEP in the future:

Python Versions

This PEP describes changes that will occur on both the 3.4.x, 3.5 and 2.7.X branches. For 2.7.X this will require backporting the context (SSLContext) argument to httplib, in addition to the features already backported in PEP 466.

Implementation

  • LANDED: Issue 22366 adds the context argument to urlib.request.urlopen.
  • Issue 22417 implements the substance of this PEP.

pep-0477 Backport ensurepip (PEP 453) to Python 2.7

PEP:477
Title:Backport ensurepip (PEP 453) to Python 2.7
Version:$Revision$
Last-Modified:$Date$
Author:Donald Stufft <donald at stufft.io> Nick Coghlan <ncoghlan at gmail.com>
BDFL-Delegate:Benjamin Peterson <benjamin@python.org>
Status:Final
Type:Standards Track
Content-Type:text/x-rst
Created:26-Aug-2014
Post-History:1-Sep-2014
Resolution:https://mail.python.org/pipermail/python-dev/2014-September/136238.html

Abstract

This PEP proposes that the ensurepip module, added to Python 3.4 by PEP 453, be backported to Python 2.7. It also proposes that automatic invocation of ensurepip be added to the Python 2.7 Windows and OSX installers. However it does not propose that automatic invocation be added to the Makefile.

It also proposes that the documentation changes for the package distribution and installation guides be updated to match that in 3.4, which references using the ensurepip module to bootstrap the installer.

Rationale

Python 2.7 is effectively a LTS release of Python which represents the end of the 2.x series and there is still a very large contingent of users whom are still using Python 2.7 as their primary version. These users, in order to participate in the wider Python ecosystem, must manually attempt to go out and find the correct way to bootstrap the packaging tools.

It is the opinion of this PEP that making it as easy as possible for end users to participate in the wider Python ecosystem is important for 3 primary reasons:

  1. The Python 2.x to 3.x migration has a number of painpoints that are eased by a number of third party modules such as six [1], modernize [2], or future [3]. However relying on these tools requires that everyone who uses the project have a tool to install these packages.
  2. In addition to tooling to aid in migration from Python 2.x to 3.x, there are also a number of modules that are new in Python 3 for which there are backports available on PyPI. This can also aid in the ability for people to write 2.x and 3.x compatible software as well as enable them to use some of the newer features of Python 3 on Python 2.
  3. Users also will need a number of tools in order to create python packages that conform to the newer standards that are being proposed. Things like setuptools [4], Wheel [5], and twine [6] are enabling a safer, faster, and more reliable packaging tool chain. These tools can be difficult for people to use if first they must be told how to go out and install the package manager.
  4. One of Pythons biggest strengths is in the huge ecosystem of libraries and projects that have been built on top of it, most of which are distributed through PyPI. However in order to benefit from this wide ecosystem meaningfully requires end users, some of which are going to be new, to make a decision on which package manager they should get, how to get it, and then finally actually installing it first.

Furthermore, alternative implementations of Python are recognizing the benefits of PEP 453 and both PyPy and Jython have plans to backport ensurepip to their 2.7 runtimes.

Automatic Invocation

PEP 453 has ensurepip automatically invoked by default in the Makefile and the Windows and OSX Installers. This allowed it to ensure that, by default, all users would get Python with pip already installed. This PEP however believes that while this is fine for the Python 2.7 Windows and Mac OS X installers it is not ok for the Python 2.7 Makefile in general.

The primary consumers of the Makefile are downstream package managers which distribute Python themselves. These downstream distributors typically do not want pip to be installed via ensurepip and would prefer that end users install it with their own package manager. Not invoking ensurepip automatically from the Makefile would allow these distributors to simply ignore the fact that ensurepip has been backported and still not end up with pip installed via it.

The primary consumers of the OSX and Windows installers are end users who are attempting to install Python on their own machine. There is not a package manager available where these users can install pip into their Python through a more supported mechanism. For this reason it is the belief of this PEP that installing by default on OSX and Windows is the best course of action.

Documentation

As part of this PEP, the updated packaging distribution and installation guides for Python 3.4 would be backported to Python 2.7.

Disabling ensurepip by Downstream Distributors

Due to its use in the venv module, downstream distributors cannot disable the ensurepip module in Python 3.4. However since Python 2.7 has no such module it is explicitly allowed for downstream distributors to patch the ensurepip module to prevent it from installing anything.

If a downstream distributor wishes to disable ensurepip completely in Python 2.7, they should still at least provide the module and allow python -m ensurepip style invocation. However it should raise errors or otherwise exit with a non-zero exit code and print out an error on stderr directing users to what they can/should use instead of ensurepip.

pep-0478 Python 3.5 Release Schedule

PEP:478
Title:Python 3.5 Release Schedule
Version:$Revision$
Last-Modified:$Date$
Author:Larry Hastings <larry at hastings.org>
Status:Active
Type:Informational
Content-Type:text/x-rst
Created:22-Sep-2014
Python-Version:3.5

Abstract

This document describes the development and release schedule for Python 3.5. The schedule primarily concerns itself with PEP-sized items.

Release Manager and Crew

  • 3.5 Release Manager: Larry Hastings
  • Windows installers: Steve Dower
  • Mac installers: Ned Deily
  • Documentation: Georg Brandl

Release Schedule

The releases:

  • 3.5.0 alpha 1: February 8, 2015
  • 3.5.0 alpha 2: March 9, 2015
  • 3.5.0 alpha 3: March 29, 2015
  • 3.5.0 alpha 4: April 19, 2015
  • 3.5.0 beta 1: May 24, 2015
  • 3.5.0 beta 2: May 31, 2015
  • 3.5.0 beta 3: July 5, 2015
  • 3.5.0 beta 4: July 26, 2015
  • 3.5.0 candidate 1: August 9, 2015
  • 3.5.0 candidate 2: August 23, 2015
  • 3.5.0 candidate 3: September 6, 2015
  • 3.5.0 final: September 13, 2015

(Beta 1 is also "feature freeze"--no new features beyond this point.)

Features for 3.5

Implemented / Final PEPs:

  • PEP 465, a new matrix multiplication operator
  • PEP 461, %-formatting for binary strings
  • PEP 471, os.scandir()
  • PEP 479, change StopIteration handling inside generators
  • PEP 441, improved Python zip application support
  • PEP 448, additional unpacking generalizations
  • PEP 486, make the Python Launcher aware of virtual environments
  • PEP 475, retrying system calls that fail with EINTR
  • PEP 492, coroutines with async and await syntax
  • PEP 488, elimination of PYO files
  • PEP 484, type hints
  • PEP 489, redesigning extension module loading
  • PEP 485, math.isclose(), a function for testing approximate equality

Proposed changes for 3.5:

  • PEP 431, improved support for time zone databases
  • PEP 432, simplifying Python's startup sequence
  • PEP 436, a build tool generating boilerplate for extension modules
  • PEP 447, support for __locallookup__ metaclass method
  • PEP 455, key transforming dictionary
  • PEP 468, preserving the order of **kwargs in a function

pep-0479 Change StopIteration handling inside generators

PEP:479
Title:Change StopIteration handling inside generators
Version:$Revision$
Last-Modified:$Date$
Author:Chris Angelico <rosuav at gmail.com>, Guido van Rossum <guido at python.org>
Status:Final
Type:Standards Track
Content-Type:text/x-rst
Created:15-Nov-2014
Python-Version:3.5
Post-History:15-Nov-2014, 19-Nov-2014, 5-Dec-2014

Abstract

This PEP proposes a change to generators: when StopIteration is raised inside a generator, it is replaced it with RuntimeError. (More precisely, this happens when the exception is about to bubble out of the generator's stack frame.) Because the change is backwards incompatible, the feature is initially introduced using a __future__ statement.

Acceptance

This PEP was accepted by the BDFL on November 22. Because of the exceptionally short period from first draft to acceptance, the main objections brought up after acceptance were carefully considered and have been reflected in the "Alternate proposals" section below. However, none of the discussion changed the BDFL's mind and the PEP's acceptance is now final. (Suggestions for clarifying edits are still welcome -- unlike IETF RFCs, the text of a PEP is not cast in stone after its acceptance, although the core design/plan/specification should not change after acceptance.)

Rationale

The interaction of generators and StopIteration is currently somewhat surprising, and can conceal obscure bugs. An unexpected exception should not result in subtly altered behaviour, but should cause a noisy and easily-debugged traceback. Currently, StopIteration raised accidentally inside a generator function will be interpreted as the end of the iteration by the loop construct driving the generator.

The main goal of the proposal is to ease debugging in the situation where an unguarded next() call (perhaps several stack frames deep) raises StopIteration and causes the iteration controlled by the generator to terminate silently. (Whereas, when some other exception is raised, a traceback is printed pinpointing the cause of the problem.)

This is particularly pernicious in combination with the yield from construct of PEP 380 [1], as it breaks the abstraction that a subgenerator may be factored out of a generator. That PEP notes this limitation, but notes that "use cases for these [are] rare to non- existent". Unfortunately while intentional use is rare, it is easy to stumble on these cases by accident:

import contextlib

@contextlib.contextmanager
def transaction():
    print('begin')
    try:
        yield from do_it()
    except:
        print('rollback')
        raise
    else:
        print('commit')

def do_it():
    print('Refactored initial setup')
    yield # Body of with-statement is executed here
    print('Refactored finalization of successful transaction')

def gene():
    for i in range(2):
        with transaction():
            yield i
            # return
            raise StopIteration  # This is wrong
        print('Should not be reached')

for i in gene():
    print('main: i =', i)

Here factoring out do_it into a subgenerator has introduced a subtle bug: if the wrapped block raises StopIteration, under the current behavior this exception will be swallowed by the context manager; and, worse, the finalization is silently skipped! Similarly problematic behavior occurs when an asyncio coroutine raises StopIteration, causing it to terminate silently, or when next is used to take the first result from an iterator that unexpectedly turns out to be empty, for example:

# using the same context manager as above
import pathlib

with transaction():
    print('commit file {}'.format(
        # I can never remember what the README extension is
        next(pathlib.Path('/some/dir').glob('README*'))))

In both cases, the refactoring abstraction of yield from breaks in the presence of bugs in client code.

Additionally, the proposal reduces the difference between list comprehensions and generator expressions, preventing surprises such as the one that started this discussion [2]. Henceforth, the following statements will produce the same result if either produces a result at all:

a = list(F(x) for x in xs if P(x))
a = [F(x) for x in xs if P(x)]

With the current state of affairs, it is possible to write a function F(x) or a predicate P(x) that causes the first form to produce a (truncated) result, while the second form raises an exception (namely, StopIteration). With the proposed change, both forms will raise an exception at this point (albeit RuntimeError in the first case and StopIteration in the second).

Finally, the proposal also clears up the confusion about how to terminate a generator: the proper way is return, not raise StopIteration.

As an added bonus, the above changes bring generator functions much more in line with regular functions. If you wish to take a piece of code presented as a generator and turn it into something else, you can usually do this fairly simply, by replacing every yield with a call to print() or list.append(); however, if there are any bare next() calls in the code, you have to be aware of them. If the code was originally written without relying on StopIteration terminating the function, the transformation would be that much easier.

Background information

When a generator frame is (re)started as a result of a __next__() (or send() or throw()) call, one of three outcomes can occur:

  • A yield point is reached, and the yielded value is returned.
  • The frame is returned from; StopIteration is raised.
  • An exception is raised, which bubbles out.

In the latter two cases the frame is abandoned (and the generator object's gi_frame attribute is set to None).

Proposal

If a StopIteration is about to bubble out of a generator frame, it is replaced with RuntimeError, which causes the next() call (which invoked the generator) to fail, passing that exception out. From then on it's just like any old exception. [3]

This affects the third outcome listed above, without altering any other effects. Furthermore, it only affects this outcome when the exception raised is StopIteration (or a subclass thereof).

Note that the proposed replacement happens at the point where the exception is about to bubble out of the frame, i.e. after any except or finally blocks that could affect it have been exited. The StopIteration raised by returning from the frame is not affected (the point being that StopIteration means that the generator terminated "normally", i.e. it did not raise an exception).

A subtle issue is what will happen if the caller, having caught the RuntimeError, calls the generator object's __next__() method again. The answer is that from this point on it will raise StopIteration -- the behavior is the same as when any other exception was raised by the generator.

Another logical consequence of the proposal: if someone uses g.throw(StopIteration) to throw a StopIteration exception into a generator, if the generator doesn't catch it (which it could do using a try/except around the yield), it will be transformed into RuntimeError.

During the transition phase, the new feature must be enabled per-module using:

from __future__ import generator_stop

Any generator function constructed under the influence of this directive will have the REPLACE_STOPITERATION flag set on its code object, and generators with the flag set will behave according to this proposal. Once the feature becomes standard, the flag may be dropped; code should not inspect generators for it.

A proof-of-concept patch has been created to facilitate testing. [4]

Consequences for existing code

This change will affect existing code that depends on StopIteration bubbling up. The pure Python reference implementation of groupby [5] currently has comments "Exit on StopIteration" where it is expected that the exception will propagate and then be handled. This will be unusual, but not unknown, and such constructs will fail. Other examples abound, e.g. [6], [7].

(Nick Coghlan comments: """If you wanted to factor out a helper function that terminated the generator you'd have to do "return yield from helper()" rather than just "helper()".""")

There are also examples of generator expressions floating around that rely on a StopIteration raised by the expression, the target or the predicate (rather than by the __next__() call implied in the for loop proper).

Writing backwards and forwards compatible code

With the exception of hacks that raise StopIteration to exit a generator expression, it is easy to write code that works equally well under older Python versions as under the new semantics.

This is done by enclosing those places in the generator body where a StopIteration is expected (e.g. bare next() calls or in some cases helper functions that are expected to raise StopIteration) in a try/except construct that returns when StopIteration is raised. The try/except construct should appear directly in the generator function; doing this in a helper function that is not itself a generator does not work. If raise StopIteration occurs directly in a generator, simply replace it with return.

Examples of breakage

Generators which explicitly raise StopIteration can generally be changed to simply return instead. This will be compatible with all existing Python versions, and will not be affected by __future__. Here are some illustrations from the standard library.

Lib/ipaddress.py:

if other == self:
    raise StopIteration

Becomes:

if other == self:
    return

In some cases, this can be combined with yield from to simplify the code, such as Lib/difflib.py:

if context is None:
    while True:
        yield next(line_pair_iterator)

Becomes:

if context is None:
    yield from line_pair_iterator
    return

(The return is necessary for a strictly-equivalent translation, though in this particular file, there is no further code, and the return can be omitted.) For compatibility with pre-3.3 versions of Python, this could be written with an explicit for loop:

if context is None:
    for line in line_pair_iterator:
        yield line
    return

More complicated iteration patterns will need explicit try/except constructs. For example, a hypothetical parser like this:

def parser(f):
    while True:
        data = next(f)
        while True:
            line = next(f)
            if line == "- end -": break
            data += line
        yield data

would need to be rewritten as:

def parser(f):
    while True:
        try:
            data = next(f)
            while True:
                line = next(f)
                if line == "- end -": break
                data += line
            yield data
        except StopIteration:
            return

or possibly:

def parser(f):
    for data in f:
        while True:
            line = next(f)
            if line == "- end -": break
            data += line
        yield data

The latter form obscures the iteration by purporting to iterate over the file with a for loop, but then also fetches more data from the same iterator during the loop body. It does, however, clearly differentiate between a "normal" termination (StopIteration instead of the initial line) and an "abnormal" termination (failing to find the end marker in the inner loop, which will now raise RuntimeError).

This effect of StopIteration has been used to cut a generator expression short, creating a form of takewhile:

def stop():
    raise StopIteration
print(list(x for x in range(10) if x < 5 or stop()))
# prints [0, 1, 2, 3, 4]

Under the current proposal, this form of non-local flow control is not supported, and would have to be rewritten in statement form:

def gen():
    for x in range(10):
        if x >= 5: return
        yield x
print(list(gen()))
# prints [0, 1, 2, 3, 4]

While this is a small loss of functionality, it is functionality that often comes at the cost of readability, and just as lambda has restrictions compared to def, so does a generator expression have restrictions compared to a generator function. In many cases, the transformation to full generator function will be trivially easy, and may improve structural clarity.

Explanation of generators, iterators, and StopIteration

The proposal does not change the relationship between generators and iterators: a generator object is still an iterator, and not all iterators are generators. Generators have additional methods that iterators don't have, like send and throw. All this is unchanged. Nothing changes for generator users -- only authors of generator functions may have to learn something new. (This includes authors of generator expressions that depend on early termination of the iteration by a StopIteration raised in a condition.)

An iterator is an object with a __next__ method. Like many other special methods, it may either return a value, or raise a specific exception - in this case, StopIteration - to signal that it has no value to return. In this, it is similar to __getattr__ (can raise AttributeError), __getitem__ (can raise KeyError), and so on. A helper function for an iterator can be written to follow the same protocol; for example:

def helper(x, y):
    if x > y: return 1 / (x - y)
    raise StopIteration

def __next__(self):
    if self.a: return helper(self.b, self.c)
    return helper(self.d, self.e)

Both forms of signalling are carried through: a returned value is returned, an exception bubbles up. The helper is written to match the protocol of the calling function.

A generator function is one which contains a yield expression. Each time it is (re)started, it may either yield a value, or return (including "falling off the end"). A helper function for a generator can also be written, but it must also follow generator protocol:

def helper(x, y):
    if x > y: yield 1 / (x - y)

def gen(self):
    if self.a: return (yield from helper(self.b, self.c))
    return (yield from helper(self.d, self.e))

In both cases, any unexpected exception will bubble up. Due to the nature of generators and iterators, an unexpected StopIteration inside a generator will be converted into RuntimeError, but beyond that, all exceptions will propagate normally.

Transition plan

  • Python 3.5: Enable new semantics under __future__ import; silent deprecation warning if StopIteration bubbles out of a generator not under __future__ import.
  • Python 3.6: Non-silent deprecation warning.
  • Python 3.7: Enable new semantics everywhere.

Alternate proposals

Raising something other than RuntimeError

Rather than the generic RuntimeError, it might make sense to raise a new exception type UnexpectedStopIteration. This has the downside of implicitly encouraging that it be caught; the correct action is to catch the original StopIteration, not the chained exception.

Supplying a specific exception to raise on return

Nick Coghlan suggested a means of providing a specific StopIteration instance to the generator; if any other instance of StopIteration is raised, it is an error, but if that particular one is raised, the generator has properly completed. This subproposal has been withdrawn in favour of better options, but is retained for reference.

Making return-triggered StopIterations obvious

For certain situations, a simpler and fully backward-compatible solution may be sufficient: when a generator returns, instead of raising StopIteration, it raises a specific subclass of StopIteration (GeneratorReturn) which can then be detected. If it is not that subclass, it is an escaping exception rather than a return statement.

The inspiration for this alternative proposal was Nick's observation [8] that if an asyncio coroutine [9] accidentally raises StopIteration, it currently terminates silently, which may present a hard-to-debug mystery to the developer. The main proposal turns such accidents into clearly distinguishable RuntimeError exceptions, but if that is rejected, this alternate proposal would enable asyncio to distinguish between a return statement and an accidentally-raised StopIteration exception.

Of the three outcomes listed above, two change:

  • If a yield point is reached, the value, obviously, would still be returned.
  • If the frame is returned from, GeneratorReturn (rather than StopIteration) is raised.
  • If an instance of GeneratorReturn would be raised, instead an instance of StopIteration would be raised. Any other exception bubbles up normally.

In the third case, the StopIteration would have the value of the original GeneratorReturn, and would reference the original exception in its __cause__. If uncaught, this would clearly show the chaining of exceptions.

This alternative does not affect the discrepancy between generator expressions and list comprehensions, but allows generator-aware code (such as the contextlib and asyncio modules) to reliably differentiate between the second and third outcomes listed above.

However, once code exists that depends on this distinction between GeneratorReturn and StopIteration, a generator that invokes another generator and relies on the latter's StopIteration to bubble out would still be potentially wrong, depending on the use made of the distinction between the two exception types.

Converting the exception inside next()

Mark Shannon suggested [10] that the problem could be solved in next() rather than at the boundary of generator functions. By having next() catch StopIteration and raise instead ValueError, all unexpected StopIteration bubbling would be prevented; however, the backward-incompatibility concerns are far more serious than for the current proposal, as every next() call now needs to be rewritten to guard against ValueError instead of StopIteration - not to mention that there is no way to write one block of code which reliably works on multiple versions of Python. (Using a dedicated exception type, perhaps subclassing ValueError, would help this; however, all code would still need to be rewritten.)

Note that calling next(it, default) catches StopIteration and substitutes the given default value; this feature is often useful to avoid a try/except block.

Sub-proposal: decorator to explicitly request current behaviour

Nick Coghlan suggested [11] that the situations where the current behaviour is desired could be supported by means of a decorator:

from itertools import allow_implicit_stop

@allow_implicit_stop
def my_generator():
    ...
    yield next(it)
    ...

Which would be semantically equivalent to:

def my_generator():
    try:
        ...
        yield next(it)
        ...
    except StopIteration
        return

but be faster, as it could be implemented by simply permitting the StopIteration to bubble up directly.

Single-source Python 2/3 code would also benefit in a 3.7+ world, since libraries like six and python-future could just define their own version of "allow_implicit_stop" that referred to the new builtin in 3.5+, and was implemented as an identity function in other versions.

However, due to the implementation complexities required, the ongoing compatibility issues created, the subtlety of the decorator's effect, and the fact that it would encourage the "quick-fix" solution of just slapping the decorator onto all generators instead of properly fixing the code in question, this sub-proposal has been rejected. [12]

Criticism

Unofficial and apocryphal statistics suggest that this is seldom, if ever, a problem. [13] Code does exist which relies on the current behaviour (e.g. [3], [6], [7]), and there is the concern that this would be unnecessary code churn to achieve little or no gain.

Steven D'Aprano started an informal survey on comp.lang.python [14]; at the time of writing only two responses have been received: one was in favor of changing list comprehensions to match generator expressions (!), the other was in favor of this PEP's main proposal.

The existing model has been compared to the perfectly-acceptable issues inherent to every other case where an exception has special meaning. For instance, an unexpected KeyError inside a __getitem__ method will be interpreted as failure, rather than permitted to bubble up. However, there is a difference. Special methods use return to indicate normality, and raise to signal abnormality; generators yield to indicate data, and return to signal the abnormal state. This makes explicitly raising StopIteration entirely redundant, and potentially surprising. If other special methods had dedicated keywords to distinguish between their return paths, they too could turn unexpected exceptions into RuntimeError; the fact that they cannot should not preclude generators from doing so.

Why not fix all __next__() methods?

When implementing a regular __next__() method, the only way to indicate the end of the iteration is to raise StopIteration. So catching StopIteration here and converting it to RuntimeError would defeat the purpose. This is a reminder of the special status of generator functions: in a generator function, raising StopIteration is redundant since the iteration can be terminated by a simple return.

References

[1]PEP 380 - Syntax for Delegating to a Subgenerator (https://www.python.org/dev/peps/pep-0380)
[2]Initial mailing list comment (https://mail.python.org/pipermail/python-ideas/2014-November/029906.html)
[3](1, 2) Proposal by GvR (https://mail.python.org/pipermail/python-ideas/2014-November/029953.html)
[4]Tracker issue with Proof-of-Concept patch (http://bugs.python.org/issue22906)
[5]Pure Python implementation of groupby (https://docs.python.org/3/library/itertools.html#itertools.groupby)
[6](1, 2) Split a sequence or generator using a predicate (http://code.activestate.com/recipes/578416-split-a-sequence-or-generator-using-a-predicate/)
[7](1, 2) wrap unbounded generator to restrict its output (http://code.activestate.com/recipes/66427-wrap-unbounded-generator-to-restrict-its-output/)
[8]Post from Nick Coghlan mentioning asyncio (https://mail.python.org/pipermail/python-ideas/2014-November/029961.html)
[9]Coroutines in asyncio (https://docs.python.org/3/library/asyncio-task.html#coroutines)
[10]Post from Mark Shannon with alternate proposal (https://mail.python.org/pipermail/python-dev/2014-November/137129.html)
[11]Idea from Nick Coghlan (https://mail.python.org/pipermail/python-dev/2014-November/137201.html)
[12]Rejection of above idea by GvR (https://mail.python.org/pipermail/python-dev/2014-November/137243.html)
[13]Response by Steven D'Aprano (https://mail.python.org/pipermail/python-ideas/2014-November/029994.html)
[14]Thread on comp.lang.python started by Steven D'Aprano (https://mail.python.org/pipermail/python-list/2014-November/680757.html)

pep-0480 Surviving a Compromise of PyPI: The Maximum Security Model

PEP:480
Title:Surviving a Compromise of PyPI: The Maximum Security Model
Version:$Revision$
Last-Modified:$Date$
Author:Trishank Karthik Kuppusamy <trishank at nyu.edu>, Vladimir Diaz <vladimir.diaz at nyu.edu>, Donald Stufft <donald at stufft.io>, Justin Cappos <jcappos at nyu.edu>
BDFL-Delegate:Richard Jones <r1chardj0n3s@gmail.com>
Discussions-To:DistUtils mailing list <distutils-sig at python.org>
Status:Draft
Type:Standards Track
Content-Type:text/x-rst
Requires:458
Created:8-Oct-2014

Abstract

Proposed is an extension to PEP 458 that adds support for end-to-end signing and the maximum security model. End-to-end signing allows both PyPI and developers to sign for the distributions that are downloaded by clients. The minimum security model proposed by PEP 458 supports continuous delivery of distributions (because they are signed by online keys), but that model does not protect distributions in the event that PyPI is compromised. In the minimum security model, attackers may sign for malicious distributions by compromising the signing keys stored on PyPI infrastructure. The maximum security model, described in this PEP, retains the benefits of PEP 458 (e.g., immediate availability of distributions that are uploaded to PyPI), but additionally ensures that end-users are not at risk of installing forged software if PyPI is compromised.

This PEP discusses the changes made to PEP 458 but excludes its informational elements to primarily focus on the maximum security model. For example, an overview of The Update Framework or the basic mechanisms in PEP 458 are not covered here. The changes to PEP 458 include modifications to the snapshot process, key compromise analysis, auditing snapshots, and the steps that should be taken in the event of a PyPI compromise. The signing and key management process that PyPI MAY RECOMMEND is discussed but not strictly defined. How the release process should be implemented to manage keys and metadata is left to the implementors of the signing tools. That is, this PEP delineates the expected cryptographic key type and signature format included in metadata that MUST be uploaded by developers in order to support end-to-end verification of distributions.

Rationale

PEP 458 [1] proposes how PyPI should be integrated with The Update Framework (TUF) [2]. It explains how modern package managers like pip can be made more secure, and the types of attacks that can be prevented if PyPI is modified on the server side to include TUF metadata. Package managers can reference the TUF metadata available on PyPI to download distributions more securely.

PEP 458 also describes the metadata layout of the PyPI repository and employs the minimum security model, which supports continuous delivery of projects and uses online cryptographic keys to sign the distributions uploaded by developers. Although the minimum security model guards against most attacks on software updaters [5] [7], such as mix-and-match and extraneous dependencies attacks, it can be improved to support end-to-end signing and to prohibit forged distributions in the event that PyPI is compromised.

The main strength of PEP 458 and the minimum security model is the automated and simplified release process: developers may upload distributions and then have PyPI sign for their distributions. Much of the release process is handled in an automated fashion by online roles and this approach requires storing cryptographic signing keys on the PyPI infrastructure. Unfortunately, cryptographic keys that are stored online are vulnerable to theft. The maximum security model, proposed in this PEP, permits developers to sign for the distributions that they make available to PyPI users, and does not put end-users at risk of downloading malicious distributions if the online keys stored on PyPI infrastructure are compromised.

Threat Model

The threat model assumes the following:

  • Offline keys are safe and securely stored.
  • Attackers can compromise at least one of PyPI's trusted keys that are stored online, and may do so at once or over a period of time.
  • Attackers can respond to client requests.
  • Attackers may control any number of developer keys for projects a client does not want to install.

Attackers are considered successful if they can cause a client to install (or leave installed) something other than the most up-to-date version of the software the client is updating. When an attacker is preventing the installation of updates, the attacker's goal is that clients not realize that anything is wrong.

Definitions

The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [13].

This PEP focuses on integrating TUF with PyPI; however, the reader is encouraged to read about TUF's design principles [2]. It is also RECOMMENDED that the reader be familiar with the TUF specification [3], and PEP 458 [1] (which this PEP is extending).

Terms used in this PEP are defined as follows:

  • Projects: Projects are software components that are made available for integration. Projects include Python libraries, frameworks, scripts, plugins, applications, collections of data or other resources, and various combinations thereof. Public Python projects are typically registered on the Python Package Index [4].
  • Releases: Releases are uniquely identified snapshots of a project [4].
  • Distributions: Distributions are the packaged files that are used to publish and distribute a release.
  • Simple index: The HTML page that contains internal links to the distributions of a project [4].
  • Roles: There is one root role in PyPI. There are multiple roles whose responsibilities are delegated to them directly or indirectly by the root role. The term "top-level role" refers to the root role and any role delegated by the root role. Each role has a single metadata file that it is trusted to provide.
  • Metadata: Metadata are files that describe roles, other metadata, and target files.
  • Repository: A repository is a resource comprised of named metadata and target files. Clients request metadata and target files stored on a repository.
  • Consistent snapshot: A set of TUF metadata and PyPI targets that capture the complete state of all projects on PyPI as they existed at some fixed point in time.
  • The snapshot (release) role: In order to prevent confusion due to the different meanings of the term "release" used in PEP 426 [1] and the TUF specification [3], the release role is renamed to the snapshot role.
  • Developer: Either the owner or maintainer of a project who is allowed to update TUF metadata, as well as distribution metadata and files for a given project.
  • Online key: A private cryptographic key that MUST be stored on the PyPI server infrastructure. This usually allows automated signing with the key. An attacker who compromises the PyPI infrastructure will be able to immediately read these keys.
  • Offline key: A private cryptographic key that MUST be stored independent of the PyPI server infrastructure. This prevents automated signing with the key. An attacker who compromises the PyPI infrastructure will not be able to immediately read these keys.
  • Threshold signature scheme: A role can increase its resilience to key compromises by specifying that at least t out of n keys are REQUIRED to sign its metadata. A compromise of t-1 keys is insufficient to compromise the role itself. Saying that a role requires (t, n) keys denotes the threshold signature property.

Maximum Security Model

The maximum security model permits developers to sign their projects and to upload signed metadata to PyPI. If the PyPI infrastructure were compromised, attackers would be unable to serve malicious versions of a claimed project without having access to that project's developer key. Figure 1 depicts the changes made to the metadata layout of the minimum security model, namely that developer roles are now supported and that three new delegated roles exist: claimed, recently-claimed, and unclaimed. The bins role from the minimum security model has been renamed unclaimed and can contain any projects that have not been added to claimed. The unclaimed role functions just as before (i.e., as explained in PEP 458, projects added to this role are signed by PyPI with an online key). Offline keys provided by developers ensure the strength of the maximum security model over the minimum model. Although the minimum security model supports continuous delivery of projects, all projects are signed by an online key. That is, an attacker is able to corrupt packages in the minimum security model, but not in the maximum model, without also compromising a developer's key.

pep-0480-1.png

Figure 1: An overview of the metadata layout in the maximum security model. The maximum security model supports continuous delivery and survivable key compromise.

Projects that are signed by developers and uploaded to PyPI for the first time are added to the recently-claimed role. The recently-claimed role uses an online key, so projects uploaded for the first time are immediately available to clients. After some time has passed, PyPI administrators MAY periodically move (e.g., every month) projects listed in recently-claimed to the claimed role for maximum security. The claimed role uses an offline key, thus projects added to this role cannot be easily forged if PyPI is compromised.

The recently-claimed role is separate from the unclaimed role for usability and efficiency, not security. If new project delegations were prepended to unclaimed metadata, unclaimed would need to be re-downloaded every time a project obtained a key. By separating out new projects, the amount of data retrieved is reduced. From a usability standpoint, it also makes it easier for administrators to see which projects are now claimed. This information is needed when moving keys from recently-claimed to claimed, which is discussed in more detail in the "Producing Consistent Snapshots" section.

End-to-End Signing

End-to-end signing allows both PyPI and developers to sign for the metadata downloaded by clients. PyPI is trusted to make uploaded projects available to clients (PyPI signs the metadata for this part of the process), and developers sign the distributions that they upload to PyPI.

In order to delegate trust to a project, developers are required to submit a public key to PyPI. PyPI takes the project's public key and adds it to parent metadata that PyPI then signs. After the initial trust is established, developers are required to sign distributions that they upload to PyPI using the public key's corresponding private key. The signed TUF metadata that developers upload to PyPI includes information like the distribution's file size and hash, which package managers use to verify distributions that are downloaded.

The practical implications of end-to-end signing is the extra administrative work needed to delegate trust to a project, and the signed metadata that developers MUST upload to PyPI along with the distribution. Specifically, PyPI is expected to periodically sign metadata with an offline key by adding projects to the claimed metadata file and signing it. In contrast, projects are only ever signed with an online key in the minimum security model. End-to-end signing does require manual intervention to delegate trust (i.e., to sign metadata with an offline key), but this is a one-time cost and projects have stronger protections against PyPI compromises thereafter.

Metadata Signatures, Key Management, and Signing Distributions

This section discusses the tools, signature scheme, and signing methods that PyPI MAY recommend to implementors of the signing tools. Developers are expected to use these tools to sign and upload distributions to PyPI. To summarize the RECOMMENDED tools and schemes discussed in the subsections below, developers MAY generate cryptographic keys and sign metadata (with the Ed25519 signature scheme) in some automated fashion, where the metadata includes the information required to verify the authenticity of the distribution. Developers then upload metadata to PyPI, where it will be available for download by package managers such as pip (i.e., package managers that support TUF metadata). The entire process is transparent to the end-users (using a package manager that supports TUF) that download distributions from PyPI.

The first three subsections (Cryptographic Signature Scheme, Cryptographic Key Files, and Key Management) cover the cryptographic components of the developer release process. That is, which key type PyPI supports, how keys may be stored, and how keys may be generated. The two subsections that follow the first three discuss the PyPI modules that SHOULD be modified to support TUF metadata. For example, Twine and Distutils are two projects that SHOULD be modified. Finally, the last subsection goes over the automated key management and signing solution that is RECOMMENDED for the signing tools.

TUF's design is flexible with respect to cryptographic key types, signatures, and signing methods. The tools, modification, and methods discussed in the following sections are RECOMMENDATIONS for the implementors of the signing tools.

Cryptographic Signature Scheme: Ed25519

The package manager (pip) shipped with CPython MUST work on non-CPython interpreters and cannot have dependencies that have to be compiled (i.e., the PyPI+TUF integration MUST NOT require compilation of C extensions in order to verify cryptographic signatures). Verification of signatures MUST be done in Python, and verifying RSA [11] signatures in pure-Python may be impractical due to speed. Therefore, PyPI MAY use the Ed25519 [14] signature scheme.

Ed25519 [12] is a public-key signature system that uses small cryptographic signatures and keys. A pure-Python implementation [15] of the Ed25519 signature scheme is available. Verification of Ed25519 signatures is fast even when performed in Python.

Cryptographic Key Files

The implementation MAY encrypt key files with AES-256-CTR-Mode and strengthen passwords with PBKDF2-HMAC-SHA256 (100K iterations by default, but this may be overridden by the developer). The current Python implementation of TUF can use any cryptographic library (support for PyCA cryptography will be added in the future), may override the default number of PBKDF2 iterations, and the KDF tweaked to taste.

Key Management: miniLock

An easy-to-use key management solution is needed. One solution is to derive a private key from a password so that developers do not have to manage cryptographic key files across multiple computers. miniLock [16] is an example of how this can be done. Developers may view the cryptographic key as a secondary password. miniLock also works well with a signature scheme like Ed25519, which only needs a very small key.

Third-party Upload Tools: Twine

Third-party tools like Twine [17] MAY be modified (if they wish to support distributions that include TUF metadata) to sign and upload developer projects to PyPI. Twine is a utility for interacting with PyPI that uses TLS to upload distributions, and prevents MITM attacks on usernames and passwords.

Distutils

Distutils [18] MAY be modified to sign metadata and to upload signed distributions to PyPI. Distutils comes packaged with CPython and is the most widely-used tool for uploading distributions to PyPI.

Automated Signing Solution

An easy-to-use key management solution is RECOMMENDED for developers. One approach is to generate a cryptographic private key from a user password, akin to miniLock. Although developer signatures can remain optional, this approach may be inadequate due to the great number of potentially unsigned dependencies each distribution may have. If any one of these dependencies is unsigned, it negates any benefit the project gains from signing its own distribution (i.e., attackers would only need to compromise one of the unsigned dependencies to attack end-users). Requiring developers to manually sign distributions and manage keys is expected to render key signing an unused feature.

A default, PyPI-mediated key management and package signing solution that is transparent [19] to developers and does not require a key escrow (sharing of encrypted private keys with PyPI) is RECOMMENDED for the signing tools. Additionally, the signing tools SHOULD circumvent the sharing of private keys across multiple machines of each developer.

The following outlines an automated signing solution that a new developer MAY follow to upload a distribution to PyPI:

  1. Register a PyPI project.
  2. Enter a secondary password (independent of the PyPI user account password).
  3. Optional: Add a new identity to the developer's PyPI user account from a second machine (after a password prompt).
  4. Upload project.

Step 1 is the normal procedure followed by developers to register a PyPI project [20].

Step 2 generates an encrypted key file (private), uploads an Ed25519 public key to PyPI, and signs the TUF metadata that is generated for the distribution.

Optionally adding a new identity from a second machine, by simply entering a password, in step 3 also generates an encrypted private key file and uploads an Ed25519 public key to PyPI. Separate identities MAY be created to allow a developer, or other project maintainers, to sign releases on multiple machines. An existing verified identity (its public key is contained in project metadata or has been uploaded to PyPI) signs for new identities. By default, project metadata has a signature threshold of "1" and other verified identities may create new releases to satisfy the threshold.

Step 4 uploads the distribution file and TUF metadata to PyPI. The "Snapshot Process" section discusses in detail the procedure followed by developers to upload a distribution to PyPI.

Generation of cryptographic files and signatures is transparent to the developers in the default case: developers need not be aware that packages are automatically signed. However, the signing tools should be flexible; a single project key may also be shared between multiple machines if manual key management is preferred (e.g., ssh-copy-id).

The repository [21] and developer [22] TUF tools currently support all of the recommendations previously mentioned, except for the automated signing solution, which SHOULD be added to Distutils, Twine, and other third-party signing tools. The automated signing solution calls available repository tool functions to sign metadata and to generate the cryptographic key files.

Snapshot Process

The snapshot process is fairly simple and SHOULD be automated. The snapshot process MUST keep in memory the latest working set of root, targets, and delegated roles. Every minute or so the snapshot process will sign for this latest working set. (Recall that project transaction processes continuously inform the snapshot process about the latest delegated metadata in a concurrency-safe manner. The snapshot process will actually sign for a copy of the latest working set while the latest working set in memory will be updated with information that is continuously communicated by the project transaction processes.) The snapshot process MUST generate and sign new timestamp metadata that will vouch for the metadata (root, targets, and delegated roles) generated in the previous step. Finally, the snapshot process MUST make available to clients the new timestamp and snapshot metadata representing the latest snapshot.

A claimed or recently-claimed project will need to upload in its transaction to PyPI not just targets (a simple index as well as distributions) but also TUF metadata. The project MAY do so by uploading a ZIP file containing two directories, /metadata/ (containing delegated targets metadata files) and /targets/ (containing targets such as the project simple index and distributions that are signed by the delegated targets metadata).

Whenever the project uploads metadata or targets to PyPI, PyPI SHOULD check the project TUF metadata for at least the following properties:

  • A threshold number of the developers keys registered with PyPI by that project MUST have signed for the delegated targets metadata file that represents the "root" of targets for that project (e.g. metadata/targets/ project.txt).
  • The signatures of delegated targets metadata files MUST be valid.
  • The delegated targets metadata files MUST NOT have expired.
  • The delegated targets metadata MUST be consistent with the targets.
  • A delegator MUST NOT delegate targets that were not delegated to itself by another delegator.
  • A delegatee MUST NOT sign for targets that were not delegated to itself by a delegator.

If PyPI chooses to check the project TUF metadata, then PyPI MAY choose to reject publishing any set of metadata or targets that do not meet these requirements.

PyPI MUST enforce access control by ensuring that each project can only write to the TUF metadata for which it is responsible. It MUST do so by ensuring that project transaction processes write to the correct metadata as well as correct locations within those metadata. For example, a project transaction process for an unclaimed project MUST write to the correct target paths in the correct delegated unclaimed metadata for the targets of the project.

On rare occasions, PyPI MAY wish to extend the TUF metadata format for projects in a backward-incompatible manner. Note that PyPI will NOT be able to automatically rewrite existing TUF metadata on behalf of projects in order to upgrade the metadata to the new backward-incompatible format because this would invalidate the signatures of the metadata as signed by developer keys. Instead, package managers SHOULD be written to recognize and handle multiple incompatible versions of TUF metadata so that claimed and recently-claimed projects could be offered a reasonable time to migrate their metadata to newer but backward-incompatible formats.

If PyPI eventually runs out of disk space to produce a new consistent snapshot, then PyPI MAY then use something like a "mark-and-sweep" algorithm to delete sufficiently outdated consistent snapshots. That is, only outdated metadata like timestamp and snapshot that are no longer used are deleted. Specifically, in order to preserve the latest consistent snapshot, PyPI would walk objects -- beginning from the root (timestamp) -- of the latest consistent snapshot, mark all visited objects, and delete all unmarked objects. The last few consistent snapshots may be preserved in a similar fashion. Deleting a consistent snapshot will cause clients to see nothing except HTTP 404 responses to any request for a target of the deleted consistent snapshot. Clients SHOULD then retry (as before) their requests with the latest consistent snapshot.

All package managers that support TUF metadata MUST be modified to download every metadata and target file (except for timestamp metadata) by including, in the request for the file, the cryptographic hash of the file in the filename. Following the filename convention RECOMMENDED in the next subsection, a request for the file at filename.ext will be transformed to the equivalent request for the file at digest.filename.

Finally, PyPI SHOULD use a transaction log [23] to record project transaction processes and queues so that it will be easier to recover from errors after a server failure.

Producing Consistent Snapshots

PyPI is responsible for updating, depending on the project, either the claimed, recently-claimed, or unclaimed metadata and associated delegated metadata. Every project MUST upload its set of metadata and targets in a single transaction. The uploaded set of files is called the "project transaction." How PyPI MAY validate files in a project transaction is discussed in a later section. The focus of this section is on how PyPI will respond to a project transaction.

Every metadata and target file MUST include in its filename the hex digest [24] of its SHA-256 [25] hash, which PyPI may prepend to filenames after the files have been uploaded. For this PEP, it is RECOMMENDED that PyPI adopt a simple convention of the form: digest.filename, where filename is the original filename without a copy of the hash, and digest is the hex digest of the hash.

When an unclaimed project uploads a new transaction, a project transaction process MUST add all new targets and relevant delegated unclaimed metadata. The project transaction process MUST inform the snapshot process about new delegated unclaimed metadata.

When a recently-claimed project uploads a new transaction, a project transaction process MUST add all new targets and delegated targets metadata for the project. If the project is new, then the project transaction process MUST also add new recently-claimed metadata with the public keys (which MUST be part of the transaction) for the project. recently-claimed projects have a threshold value of "1" set by the transaction process. Finally, the project transaction process MUST inform the snapshot process about new recently-claimed metadata, as well as the current set of delegated targets metadata for the project.

The transaction process for a claimed project is slightly different in that PyPI administrators periodically move (a manual process that MAY occur every two weeks to a month) projects from the recently-claimed role to the claimed role. (Moving a project from recently-claimed to claimed is a manual process because PyPI administrators have to use an offline key to sign the claimed project's distribution.) A project transaction process MUST then add new recently-claimed and claimed metadata to reflect this migration. As is the case for a recently-claimed project, the project transaction process MUST always add all new targets and delegated targets metadata for the claimed project. Finally, the project transaction process MUST inform the consistent snapshot process about new recently-claimed or claimed metadata, as well as the current set of delegated targets metadata for the project.

Project transaction processes SHOULD be automated, except when PyPI administrators move a project from the recently-claimed role to the claimed role. Project transaction processes MUST also be applied atomically: either all metadata and targets -- or none of them -- are added. The project transaction processes and snapshot process SHOULD work concurrently. Finally, project transaction processes SHOULD keep in memory the latest claimed, recently-claimed, and unclaimed metadata so that they will be correctly updated in new consistent snapshots.

The queue MAY be processed concurrently in order of appearance, provided that the following rules are observed:

  1. No pair of project transaction processes may concurrently work on the same project.
  2. No pair of project transaction processes may concurrently work on unclaimed projects that belong to the same delegated unclaimed role.
  3. No pair of project transaction processes may concurrently work on new recently-claimed projects.
  4. No pair of project transaction processes may concurrently work on new claimed projects.
  5. No project transaction process may work on a new claimed project while another project transaction process is working on a new recently-claimed project and vice versa.

These rules MUST be observed to ensure that metadata is not read from or written to inconsistently.

Auditing Snapshots

If a malicious party compromises PyPI, they can sign arbitrary files with any of the online keys. The roles with offline keys (i.e., root and targets) are still protected. To safely recover from a repository compromise, snapshots should be audited to ensure that files are only restored to trusted versions.

When a repository compromise has been detected, the integrity of three types of information must be validated:

  1. If the online keys of the repository have been compromised, they can be revoked by having the targets role sign new metadata, delegated to a new key.
  2. If the role metadata on the repository has been changed, this will impact the metadata that is signed by online keys. Any role information created since the compromise should be discarded. As a result, developers of new projects will need to re-register their projects.
  3. If the packages themselves may have been tampered with, they can be validated using the stored hash information for packages that existed in trusted metadata before the compromise. Also, new distributions that are signed by developers in the claimed role may be safely retained. However, any distributions signed by developers in the recently-claimed or unclaimed roles should be discarded.

In order to safely restore snapshots in the event of a compromise, PyPI SHOULD maintain a small number of its own mirrors to copy PyPI snapshots according to some schedule. The mirroring protocol can be used immediately for this purpose. The mirrors must be secured and isolated such that they are responsible only for mirroring PyPI. The mirrors can be checked against one another to detect accidental or malicious failures.

Another approach is to periodically generate the cryptographic hash of snapshot and tweet it. For example, upon receiving the tweet, a user comes forward with the actual metadata and the repository maintainers are then able to verify metadata's cryptographic hash. Alternatively, PyPI may periodically archive its own versions of snapshot rather than rely on externally provided metadata. In this case, PyPI SHOULD take the cryptographic hash of every package on the repository and store this data on an offline device. If any package hash has changed, this indicates an attack has occurred.

Attacks that serve different versions of metadata or that freeze a version of a package at a specific version can be handled by TUF with techniques such as implicit key revocation and metadata mismatch detection [2]. n

Key Compromise Analysis

This PEP has covered the maximum security model, the TUF roles that should be added to support continuous delivery of distributions, how to generate and sign the metadata of each role, and how to support distributions that have been signed by developers. The remaining sections discuss how PyPI SHOULD audit repository metadata, and the methods PyPI can use to detect and recover from a PyPI compromise.

Table 1 summarizes a few of the attacks possible when a threshold number of private cryptographic keys (belonging to any of the PyPI roles) are compromised. The leftmost column lists the roles (or a combination of roles) that have been compromised, and the columns to the right show whether the compromised roles leaves clients susceptible to malicious updates, freeze attacks, or metadata inconsistency attacks.

Role Compromise Malicious Updates Freeze Attack Metadata Inconsistency Attacks
timetamp NO snapshot and targets or any of the delegated roles need to cooperate YES limited by earliest root, targets, or bin metadata expiry time NO snapshot needs to cooperate
snapshot NO timestamp and targets or any of the delegated roles need to cooperate NO timestamp needs to coorperate NO timestamp needs to cooperate
timestamp AND snapshot NO targets or any of the delegated roles need to cooperate YES limited by earliest root, targets, or bin metadata expiry time YES limited by earliest root, targets, or bin metadata expiry time
targets OR claimed OR recently-claimed OR unclaimed OR project NO timestamp and snapshot need to cooperate NOT APPLICABLE need timestamp and snapshot NOT APPLICABLE need timestamp and snapshot
(timestamp AND snapshot) AND project YES YES limited by earliest root, targets, or bin metadata expiry time YES limited by earliest root, targets, or bin metadata expiry time
(timestamp AND snapshot) AND (recently-claimed OR unclaimed) YES but only of projects not delegated by claimed YES limited by earliest root, targets, claimed, recently-claimed, project, or unclaimed metadata expiry time YES limited by earliest root, targets, claimed, recently-claimed, project, or unclaimed metadata expiry time
(timestamp AND snapshot) AND (targets OR claimed) YES YES limited by earliest root, targets, claimed, recently-claimed, project, or unclaimed metadata expiry time YES limited by earliest root, targets, claimed, recently-claimed, project, or unclaimed metadata expiry time
root YES YES YES

Table 1: Attacks that are possible by compromising certain combinations of role keys. In September 2013 [26], it was shown how the latest version (at the time) of pip was susceptible to these attacks and how TUF could protect users against them [8]. Roles signed by offline keys are in bold.

Note that compromising targets or any delegated role (except for project targets metadata) does not immediately allow an attacker to serve malicious updates. The attacker must also compromise the timestamp and snapshot roles (which are both online and therefore more likely to be compromised). This means that in order to launch any attack, one must not only be able to act as a man-in-the-middle, but also compromise the timestamp key (or compromise the root keys and sign a new timestamp key). To launch any attack other than a freeze attack, one must also compromise the snapshot key. Finally, a compromise of the PyPI infrastructure MAY introduce malicious updates to recently-claimed projects because the keys for these roles are online.

In the Event of a Key Compromise

A key compromise means that a threshold of keys belonging to developers or the roles on PyPI, as well as the PyPI infrastructure, have been compromised and used to sign new metadata on PyPI.

If a threshold number of developer keys of a project have been compromised, the project MUST take the following steps:

  1. The project metadata and targets MUST be restored to the last known good consistent snapshot where the project was not known to be compromised. This can be done by developers repackaging and resigning all targets with the new keys.
  2. The project's metadata MUST have its version numbers incremented, expiry times suitably extended, and signatures renewed.

Whereas PyPI MUST take the following steps:

  1. Revoke the compromised developer keys from the recently-claimed or claimed role. This is done by replacing the compromised developer keys with newly issued developer keys.
  2. A new timestamped consistent snapshot MUST be issued.

If a threshold number of timestamp, snapshot, recently-claimed, or unclaimed keys have been compromised, then PyPI MUST take the following steps:

  1. Revoke the timestamp, snapshot, and targets role keys from the root role. This is done by replacing the compromised timestamp, snapshot, and targets keys with newly issued keys.
  2. Revoke the recently-claimed and unclaimed keys from the targets role by replacing their keys with newly issued keys. Sign the new targets role metadata and discard the new keys (because, as we explained earlier, this increases the security of targets metadata).
  3. Clear all targets or delegations in the recently-claimed role and delete all associated delegated targets metadata. Recently registered projects SHOULD register their developer keys again with PyPI.
  4. All targets of the recently-claimed and unclaimed roles SHOULD be compared with the last known good consistent snapshot where none of the timestamp, snapshot, recently-claimed, or unclaimed keys were known to have been compromised. Added, updated, or deleted targets in the compromised consistent snapshot that do not match the last known good consistent snapshot SHOULD be restored to their previous versions. After ensuring the integrity of all unclaimed targets, the unclaimed metadata MUST be regenerated.
  5. The recently-claimed and unclaimed metadata MUST have their version numbers incremented, expiry times suitably extended, and signatures renewed.
  6. A new timestamped consistent snapshot MUST be issued.

This would preemptively protect all of these roles even though only one of them may have been compromised.

If a threshold number of the targets or claimed keys have been compromised, then there is little that an attacker would be able do without the timestamp and snapshot keys. In this case, PyPI MUST simply revoke the compromised targets or claimed keys by replacing them with new keys in the root and targets roles, respectively.

If a threshold number of the timestamp, snapshot, and claimed keys have been compromised, then PyPI MUST take the following steps in addition to the steps taken when either the timestamp or snapshot keys are compromised:

  1. Revoke the claimed role keys from the targets role and replace them with newly issued keys.
  2. All project targets of the claimed roles SHOULD be compared with the last known good consistent snapshot where none of the timestamp, snapshot, or claimed keys were known to have been compromised. Added, updated, or deleted targets in the compromised consistent snapshot that do not match the last known good consistent snapshot MAY be restored to their previous versions. After ensuring the integrity of all claimed project targets, the claimed metadata MUST be regenerated.
  3. The claimed metadata MUST have their version numbers incremented, expiry times suitably extended, and signatures renewed.

Following these steps would preemptively protect all of these roles even though only one of them may have been compromised.

If a threshold number of root keys have been compromised, then PyPI MUST take the steps taken when the targets role has been compromised. All of the root keys must also be replaced.

It is also RECOMMENDED that PyPI sufficiently document compromises with security bulletins. These security bulletins will be most informative when users of pip-with-TUF are unable to install or update a project because the keys for the timestamp, snapshot, or root roles are no longer valid. Users could then visit the PyPI web site to consult security bulletins that would help to explain why users are no longer able to install or update, and then take action accordingly. When a threshold number of root keys have not been revoked due to a compromise, then new root metadata may be safely updated because a threshold number of existing root keys will be used to sign for the integrity of the new root metadata. TUF clients will be able to verify the integrity of the new root metadata with a threshold number of previously known root keys. This will be the common case. In the worst case, where a threshold number of root keys have been revoked due to a compromise, an end-user may choose to update new root metadata with out-of-band [27] mechanisms.

Appendix A: PyPI Build Farm and End-to-End Signing

PyPI administrators intend to support a central build farm. The PyPI build farm will auto-generate a Wheel [28], for each distribution that is uploaded by developers, on PyPI infrastructure and on supported platforms. Package managers will likely install projects by downloading these PyPI Wheels (which can be installed much faster than source distributions) rather than the source distributions signed by developers. The implications of having a central build farm with end-to-end signing SHOULD be investigated before the maximum security model is implemented.

An issue with a central build farm and end-to-end signing is that developers are unlikely to sign Wheel distributions once they have been generated on PyPI infrastructure. However, generating wheels from source distributions that are signed by developers can still be beneficial, provided that building Wheels is a deterministic process. If deterministic builds are infeasible, developers may delegate trust of these wheels to a PyPI role that signs for wheels with an online key.

References

[1](1, 2, 3) https://www.python.org/dev/peps/pep-0458/
[2](1, 2, 3) https://isis.poly.edu/~jcappos/papers/samuel_tuf_ccs_2010.pdf
[3](1, 2) https://github.com/theupdateframework/tuf/blob/develop/docs/tuf-spec.txt
[4](1, 2, 3) http://www.python.org/dev/peps/pep-0426/
[5]https://github.com/theupdateframework/pip/wiki/Attacks-on-software-repositories
[6]https://mail.python.org/pipermail/distutils-sig/2013-September/022773.html
[7]https://isis.poly.edu/~jcappos/papers/cappos_mirror_ccs_08.pdf
[8]https://mail.python.org/pipermail/distutils-sig/2013-September/022755.html
[9]https://pypi.python.org/security
[10]https://mail.python.org/pipermail/distutils-sig/2013-August/022154.html
[11]https://en.wikipedia.org/wiki/RSA_%28algorithm%29
[12]http://ed25519.cr.yp.to/
[13]http://www.ietf.org/rfc/rfc2119.txt
[14]http://ed25519.cr.yp.to/
[15]https://github.com/pyca/ed25519
[16]https://github.com/kaepora/miniLock#-minilock
[17]https://github.com/pypa/twine
[18]https://docs.python.org/2/distutils/index.html#distutils-index
[19]https://en.wikipedia.org/wiki/Transparency_%28human%E2%80%93computer_interaction%29
[20]https://pypi.python.org/pypi?:action=register_form
[21]https://github.com/theupdateframework/tuf/blob/develop/tuf/README.md
[22]https://github.com/theupdateframework/tuf/blob/develop/tuf/README-developer-tools.md
[23]https://en.wikipedia.org/wiki/Transaction_log
[24]http://docs.python.org/2/library/hashlib.html#hashlib.hash.hexdigest
[25]https://en.wikipedia.org/wiki/SHA-2
[26]https://mail.python.org/pipermail/distutils-sig/2013-September/022755.html
[27]https://en.wikipedia.org/wiki/Out-of-band#Authentication
[28]http://wheel.readthedocs.org/en/latest/

Acknowledgements

This material is based upon work supported by the National Science Foundation under Grants No. CNS-1345049 and CNS-0959138. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

We thank Nick Coghlan, Daniel Holth and the distutils-sig community in general for helping us to think about how to usably and efficiently integrate TUF with PyPI.

Roger Dingledine, Sebastian Hahn, Nick Mathewson, Martin Peck and Justin Samuel helped us to design TUF from its predecessor Thandy of the Tor project.

We appreciate the efforts of Konstantin Andrianov, Geremy Condra, Zane Fisher, Justin Samuel, Tian Tian, Santiago Torres, John Ward, and Yuyu Zheng to develop TUF.

pep-0481 Migrate CPython to Git, Github, and Phabricator

PEP:481
Title:Migrate CPython to Git, Github, and Phabricator
Version:$Revision$
Last-Modified:$Date$
Author:Donald Stufft <donald at stufft.io>
Status:Draft
Type:Process
Content-Type:text/x-rst
Created:29-Nov-2014
Post-History:29-Nov-2014

Abstract

This PEP proposes migrating the repository hosting of CPython and the supporting repositories to Git and Github. It also proposes adding Phabricator as an alternative to Github Pull Requests to handle reviewing changes. This particular PEP is offered as an alternative to PEP 474 and PEP 462 which aims to achieve the same overall benefits but restricts itself to tools that support Mercurial and are completely Open Source.

Rationale

CPython is an open source project which relies on a number of volunteers donating their time. As an open source project it relies on attracting new volunteers as well as retaining existing ones in order to continue to have a healthy amount of manpower available. In addition to increasing the amount of manpower that is available to the project, it also needs to allow for effective use of what manpower is available.

The current toolchain of the CPython project is a custom and unique combination of tools which mandates a workflow that is similar to one found in a lot of older projects, but which is becoming less and less popular as time goes on.

The one-off nature of the CPython toolchain and workflow means that any new contributor is going to need spend time learning the tools and workflow before they can start contributing to CPython. Once a new contributor goes through the process of learning the CPython workflow they also are unlikely to be able to take that knowledge and apply it to future projects they wish to contribute to. This acts as a barrier to contribution which will scare off potential new contributors.

In addition the tooling that CPython uses is under-maintained, antiquated, and it lacks important features that enable committers to more effectively use their time when reviewing and approving changes. The fact that it is under-maintained means that bugs are likely to last for longer, if they ever get fixed, as well as it's more likely to go down for extended periods of time. The fact that it is antiquated means that it doesn't effectively harness the capabilities of the modern web platform. Finally the fact that it lacks several important features such as a lack of pre-testing commits and the lack of an automatic merge tool means that committers have to do needless busy work to commit even the simplest of changes.

Version Control System

The first decision that needs to be made is the VCS of the primary server side repository. Currently the CPython repository, as well as a number of supporting repositories, uses Mercurial. When evaluating the VCS we must consider the capabilities of the VCS itself as well as the network effect and mindshare of the community around that VCS.

There are really only two real options for this, Mercurial and Git. Between the two of them the technical capabilities are largely equivilant. For this reason this PEP will largely ignore the technical arguments about the VCS system and will instead focus on the social aspects.

It is not possible to get exact numbers for the number of projects or people which are using a particular VCS, however we can infer this by looking at several sources of information for what VCS projects are using.

The Open Hub (previously Ohloh) statistics [1] show that 37% of the repositories indexed by The Open Hub are using Git (second only to SVN which has 48%) while Mercurial has just 2% (beating only bazaar which has 1%). This has Git being just over 18 times as popular as Mercurial on The Open Hub.

Another source of information on the popular of the difference VCSs is PyPI itself. This source is more targeted at the Python community itself since it represents projects developed for Python. Unfortunately PyPI does not have a standard location for representing this information, so this requires manual processing. If we limit our search to the top 100 projects on PyPI (ordered by download counts) we can see that 62% of them use Git while 22% of them use Mercurial while 13% use something else. This has Git being just under 3 times as popular as Mercurial for the top 100 projects on PyPI.

Obviously from these numbers Git is by far the more popular DVCS for open source projects and choosing the more popular VCS has a number of positive benefits.

For new contributors it increases the likelihood that they will have already learned the basics of Git as part of working with another project or if they are just now learning Git, that they'll be able to take that knowledge and apply it to other projects. Additionally a larger community means more people writing how to guides, answering questions, and writing articles about Git which makes it easier for a new user to find answers and information about the tool they are trying to learn.

Another benefit is that by nature of having a larger community, there will be more tooling written around it. This increases options for everything from GUI clients, helper scripts, repository hosting, etc.

Repository Hosting

This PEP proposes allowing GitHub Pull Requests to be submitted, however GitHub does not have a way to submit Pull Requests against a repository that is not hosted on GitHub. This PEP also proposes that in addition to GitHub Pull Requests Phabricator's Differential app can also be used to submit proposed changes and Phabricator does allow submitting changes against a repository that is not hosted on Phabricator.

For this reason this PEP proposes using GitHub as the canonical location of the repository with a read-only mirror located in Phabricator. If at some point in the future GitHub is no longer desired, then repository hosting can easily be moved to solely in Phabricator and the ability to accept GitHub Pull Requests dropped.

In addition to hosting the repositories on Github, a read only copy of all repositories will also be mirrored onto the PSF Infrastructure.

Code Review

Currently CPython uses a custom fork of Rietveld which has been modified to not run on Google App Engine which is really only able to be maintained currently by one person. In addition it is missing out on features that are present in many modern code review tools.

This PEP proposes allowing both Github Pull Requests and Phabricator changes to propose changes and review code. It suggests both so that contributors can select which tool best enables them to submit changes, and reviewers can focus on reviewing changes in the tooling they like best.

GitHub Pull Requests

GitHub is a very popular code hosting site and is increasingly becoming the primary place people look to contribute to a project. Enabling users to contribute through GitHub is enabling contributors to contribute using tooling that they are likely already familiar with and if they are not they are likely to be able to apply to another project.

GitHub Pull Requests have a fairly major advantage over the older "submit a patch to a bug tracker" model. It allows developers to work completely within their VCS using standard VCS tooling so it does not require creating a patch file and figuring out what the right location is to upload it to. This lowers the barrier for sending a change to be reviewed.

On the reviewing side, GitHub Pull Requests are far easier to review, they have nice syntax highlighted diffs which can operate in either unified or side by side views. They allow expanding the context on a diff up to and including the entire file. Finally they allow commenting inline and on the pull request as a whole and they present that in a nice unified way which will also hide comments which no longer apply. Github also provides a "rendered diff" view which enables easily viewing a diff of rendered markup (such as rst) instead of needing to review the diff of the raw markup.

The Pull Request work flow also makes it trivial to enable the ability to pre-test a change before actually merging it. Any particular pull request can have any number of different types of "commit statuses" applied to it, marking the commit (and thus the pull request) as either in a pending, successful, errored, or failure state. This makes it easy to see inline if the pull request is passing all of the tests, if the contributor has signed a CLA, etc.

Actually merging a Github Pull Request is quite simple, a core reviewer simply needs to press the "Merge" button once the status of all the checks on the Pull Request are green for successful.

GitHub also has a good workflow for submitting pull requests to a project completely through their web interface. This would enable the Python documentation to have "Edit on GitHub" buttons on every page and people who discover things like typos, inaccuracies, or just want to make improvements to the docs they are currently writing can simply hit that button and get an in browser editor that will let them make changes and submit a pull request all from the comfort of their browser.

Phabricator

In addition to GitHub Pull Requests this PEP also proposes setting up a Phabricator instance and pointing it at the GitHub hosted repositories. This will allow utilizing the Phabricator review applications of Differential and Audit.

Differential functions similarly to GitHub pull requests except that they require installing the arc command line tool to upload patches to Phabricator.

Whether to enable Phabricator for any particular repository can be chosen on a case by case basis, this PEP only proposes that it must be enabled for the CPython repository, however for smaller repositories such as the PEP repository it may not be worth the effort.

Criticism

X is not written in Python

One feature that the current tooling (Mercurial, Rietveld) has is that the primary language for all of the pieces are written in Python. It is this PEPs belief that we should focus on the best tools for the job and not the best tools that happen to be written in Python. Volunteer time is a precious resource to any open source project and we can best respect and utilize that time by focusing on the benefits and downsides of the tools themselves rather than what language their authors happened to write them in.

One concern is the ability to modify tools to work for us, however one of the Goals here is to not modify software to work for us and instead adapt ourselves to a more standard workflow. This standardization pays off in the ability to re-use tools out of the box freeing up developer time to actually work on Python itself as well as enabling knowledge sharing between projects.

However if we do need to modify the tooling, Git itself is largely written in C the same as CPython itself is. It can also have commands written for it using any language, including Python. Phabricator is written in PHP which is a fairly common language in the web world and fairly easy to pick up. GitHub itself is largely written in Ruby but given that it's not Open Source there is no ability to modify it so it's implementation language is completely meaningless.

GitHub is not Free/Open Source

GitHub is a big part of this proposal and someone who tends more to ideology rather than practicality may be opposed to this PEP on that grounds alone. It is this PEPs belief that while using entirely Free/Open Source software is an attractive idea and a noble goal, that valuing the time of the contributors by giving them good tooling that is well maintained and that they either already know or if they learn it they can apply to other projects is a more important concern than treating whether something is Free/Open Source is a hard requirement.

However, history has shown us that sometimes benevolent proprietary companies can stop being benevolent. This is hedged against in a few ways:

  • We are not utilizing the GitHub Issue Tracker, both because it is not powerful enough for CPython but also because for the primary CPython repository the ability to take our issues and put them somewhere else if we ever need to leave GitHub relies on GitHub continuing to allow API access.
  • We are utilizing the GitHub Pull Request workflow, however all of those changes live inside of Git. So a mirror of the GitHub repositories can easily contain all of those Pull Requests. We would potentially lose any comments if GitHub suddenly turned "evil", but the changes themselves would still exist.
  • We are utilizing the GitHub repository hosting feature, however since this is just git moving away from GitHub is as simple as pushing the repository to a different location. Data portability for the repository itself is extremely high.
  • We are also utilizing Phabricator to provide an alternative for people who do not wish to use GitHub. This also acts as a fallback option which will already be in place if we ever need to stop using GitHub.

Relying on GitHub comes with a number of benefits beyond just the benefits of the platform itself. Since it is a commercially backed venture it has a full time staff responsible for maintaining its services. This includes making sure they stay up, making sure they stay patched for various security vulnerabilities, and further improving the software and infrastructure as time goes on.

Mercurial is better than Git

Whether Mercurial or Git is better on a technical level is a highly subjective opinion. This PEP does not state whether the mechanics of Git or Mercurial is better and instead focuses on the network effect that is available for either option. Since this PEP proposes switching to Git this leaves the people who prefer Mercurial out, however those users can easily continue to work with Mercurial by using the hg-git [2] extension for Mercurial which will let it work with a repository which is Git on the serverside.

CPython Workflow is too Complicated

One sentiment that came out of previous discussions was that the multi branch model of CPython was too complicated for Github Pull Requests. It is the belief of this PEP that statement is not accurate.

Currently any particular change requires manually creating a patch for 2.7 and 3.x which won't change at all in this regards.

If someone submits a fix for the current stable branch (currently 3.4) the GitHub Pull Request workflow can be used to create, in the browser, a Pull Request to merge the current stable branch into the master branch (assuming there is no merge conflicts). If there is a merge conflict that would need to be handled locally. This provides an improvement over the current situation where the merge must always happen locally.

Finally if someone submits a fix for the current development branch currently then this has to be manually applied to the stable branch if it desired to include it there as well. This must also happen locally as well in the new workflow, however for minor changes it could easily be accomplished in the GitHub web editor.

Looking at this, I do not believe that any system can hide the complexities involved in maintaining several long running branches. The only thing that the tooling can do is make it as easy as possible to submit changes.

Example: Scientific Python

One of the key ideas behind the move to both git and Github is that a feature of a DVCS, the repository hosting, and the workflow used is the social network and size of the community using said tools. We can see this is true by looking at an example from a sub-community of the Python community: The Scientific Python community. They have already migrated most of the key pieces of the SciPy stack onto Github using the Pull Request based workflow. This process started with IPython, and as more projects moved over it became a natural default for new projects in the community.

They claim to have seen a great benefit from this move, in that it enables casual contributors to easily move between different projects within their sub-community without having to learn a special, bespoke workflow and a different toolchain for each project. They've found that when people can use their limited time on actually contributing instead of learning the different tools and workflows, not only do they contribute more to one project, but that they also expand out and contribute to other projects. This move has also been attributed to the increased tendency for members of that community to go so far as publishing their research and educational materials on Github as well.

This example showcases the real power behind moving to a highly popular toolchain and workflow, as each variance introduces yet another hurdle for new and casual contributors to get past and it makes the time spent learning that workflow less reusable with other projects.

References

[1]Open Hub Statistics <https://www.openhub.net/repositories/compare>
[2]Hg-Git mercurial plugin <https://hg-git.github.io/>

pep-0482 Literature Overview for Type Hints

PEP:482
Title:Literature Overview for Type Hints
Version:$Revision$
Last-Modified:$Date$
Author:Łukasz Langa <lukasz at langa.pl>
Discussions-To:Python-Ideas <python-ideas at python.org>
Status:Draft
Type:Informational
Content-Type:text/x-rst
Created:08-Jan-2015
Post-History:
Resolution:

Abstract

This PEP is one of three related to type hinting. This PEP gives a literature overview of related work. The main spec is PEP 484.

Existing Approaches for Python

mypy

(This section is a stub, since mypy [mypy] is essentially what we're proposing.)

Reticulated Python

Reticulated Python [reticulated] by Michael Vitousek is an example of a slightly different approach to gradual typing for Python. It is described in an actual academic paper [reticulated-paper] written by Vitousek with Jeremy Siek and Jim Baker (the latter of Jython fame).

PyCharm

PyCharm by JetBrains has been providing a way to specify and check types for about four years. The type system suggested by PyCharm [pycharm] grew from simple class types to tuple types, generic types, function types, etc. based on feedback of many users who shared their experience of using type hints in their code.

Others

TBD: Add sections on pyflakes [pyflakes], pylint [pylint], numpy [numpy], Argument Clinic [argumentclinic], pytypedecl [pytypedecl], numba [numba], obiwan [obiwan].

Existing Approaches in Other Languages

ActionScript

ActionScript [actionscript] is a class-based, single inheritance, object-oriented superset of ECMAScript. It supports inferfaces and strong runtime-checked static typing. Compilation supports a “strict dialect” where type mismatches are reported at compile-time.

Example code with types:

package {
  import flash.events.Event;

  public class BounceEvent extends Event {
    public static const BOUNCE:String = "bounce";
    private var _side:String = "none";

    public function get side():String {
      return _side;
    }

    public function BounceEvent(type:String, side:String){
      super(type, true);
      _side = side;
    }

    public override function clone():Event {
      return new BounceEvent(type, _side);
    }
  }
}

Dart

Dart [dart] is a class-based, single inheritance, object-oriented language with C-style syntax. It supports interfaces, abstract classes, reified generics, and optional typing.

Types are inferred when possible. The runtime differentiates between two modes of execution: checked mode aimed for development (catching type errors at runtime) and production mode recommended for speed execution (ignoring types and asserts).

Example code with types:

class Point {
    final num x, y;

    Point(this.x, this.y);

    num distanceTo(Point other) {
        var dx = x - other.x;
        var dy = y - other.y;
        return math.sqrt(dx * dx + dy * dy);
    }
}

Hack

Hack [hack] is a programming language that interoperates seamlessly with PHP. It provides opt-in static type checking, type aliasing, generics, nullable types, and lambdas.

Example code with types:

<?hh
class MyClass {
  private ?string $x = null;

  public function alpha(): int {
    return 1;
  }

  public function beta(): string {
    return 'hi test';
  }
}

function f(MyClass $my_inst): string {
  // Will generate a hh_client error
  return $my_inst->alpha();
}

TypeScript

TypeScript [typescript] is a typed superset of JavaScript that adds interfaces, classes, mixins and modules to the language.

Type checks are duck typed. Multiple valid function signatures are specified by supplying overloaded function declarations. Functions and classes can use generics as type parametrization. Interfaces can have optional fields. Interfaces can specify array and dictionary types. Classes can have constructors that implicitly add arguments as fields. Classes can have static fields. Classes can have private fields. Classes can have getters/setters for fields (like property). Types are inferred.

Example code with types:

interface Drivable {
    start(): void;
    drive(distance: number): boolean;
    getPosition(): number;
}

class Car implements Drivable {
    private _isRunning: boolean;
    private _distanceFromStart: number;

    constructor() {
        this._isRunning = false;
        this._distanceFromStart = 0;
    }

    public start() {
        this._isRunning = true;
    }

    public drive(distance: number): boolean {
        if (this._isRunning) {
            this._distanceFromStart += distance;
            return true;
        }
        return false;
    }

    public getPosition(): number {
        return this._distanceFromStart;
    }
}

pep-0483 The Theory of Type Hints

PEP:483
Title:The Theory of Type Hints
Version:$Revision$
Last-Modified:$Date$
Author:Guido van Rossum <guido at python.org>
Discussions-To:Python-Ideas <python-ideas at python.org>
Status:Draft
Type:Informational
Content-Type:text/x-rst
Created:19-Dec-2014
Post-History:
Resolution:

Abstract

This PEP lays out the theory referenced by PEP 484.

Introduction

This document lays out the theory of the new type hinting proposal for Python 3.5. It's not quite a full proposal or specification because there are many details that need to be worked out, but it lays out the theory without which it is hard to discuss more detailed specifications. We start by explaining gradual typing; then we state some conventions and general rules; then we define the new special types (such as Union) that can be used in annotations; and finally we define the approach to generic types. (TODO: The latter section needs more fleshing out; sorry!)

Specification

Summary of gradual typing

We define a new relationship, is-consistent-with, which is similar to is-subclass-of, except it is not transitive when the new type Any is involved. (Neither relationship is symmetric.) Assigning x to y is OK if the type of x is consistent with the type of y. (Compare this to "... if the type of x is a subclass of the type of y," which states one of the fundamentals of OO programming.) The is-consistent-with relationship is defined by three rules:

  • A type t1 is consistent with a type t2 if t1 is a subclass of t2. (But not the other way around.)
  • Any is consistent with every type. (But Any is not a subclass of every type.)
  • Every type is a subclass of Any. (Which also makes every type consistent with Any, via rule 1.)

That's all! See Jeremy Siek's blog post What is Gradual Typing for a longer explanation and motivation. Note that rule 3 places Any at the root of the class graph. This makes it very similar to object. The difference is that object is not consistent with most types (e.g. you can't use an object() instance where an int is expected). IOW both Any and object mean "any type is allowed" when used to annotate an argument, but only Any can be passed no matter what type is expected (in essence, Any shuts up complaints from the static checker).

Here's an example showing how these rules work out in practice:

Say we have an Employee class, and a subclass Manager:

  • class Employee: ...
  • class Manager(Employee): ...

Let's say variable e is declared with type Employee:

  • e = Employee() # type: Employee

Now it's okay to assign a Manager instance to e (rule 1):

  • e = Manager()

It's not okay to assign an Employee instance to a variable declared with type Manager:

  • m = Manager() # type: Manager
  • m = Employee() # Fails static check

However, suppose we have a variable whose type is Any:

  • a = some_func() # type: Any

Now it's okay to assign a to e (rule 2):

  • e = a # OK

Of course it's also okay to assign e to a (rule 3), but we didn't need the concept of consistency for that:

  • a = e # OK

Notational conventions

  • t1, t2 etc. and u1, u2 etc. are types or classes. Sometimes we write ti or tj to refer to "any of t1, t2, etc."
  • X, Y etc. are type variables (defined with TypeVar(), see below).
  • C, D etc. are classes defined with a class statement.
  • x, y etc. are objects or instances.
  • We use the terms type and class interchangeably. Note that PEP 484 makes a distinction (a type is a concept for the type checker, while a class is a runtime concept). In this PEP we're only interested in the types anyway, and if this bothers you, you can reinterpret this PEP with every occurrence of "class" replaced by "type".

General rules

  • Instance-ness is derived from class-ness, e.g. x is an instance of t1 if the type of x is a subclass of t1.
  • No types defined below (i.e. Any, Union etc.) can be instantiated. (But non-abstract subclasses of Generic can be.)
  • No types defined below can be subclassed, except for Generic and classes derived from it.
  • Where a type is expected, None can be substituted for type(None); e.g. Union[t1, None] == Union[t1, type(None)].

Types

  • Any. Every class is a subclass of Any; however, to the static type checker it is also consistent with every class (see above).
  • Union[t1, t2, ...]. Classes that are subclass of at least one of t1 etc. are subclasses of this. So are unions whose components are all subclasses of t1 etc. (Example: Union[int, str] is a subclass of Union[int, float, str].) The order of the arguments doesn't matter. (Example: Union[int, str] == Union[str, int].) If ti is itself a Union the result is flattened. (Example: Union[int, Union[float, str]] == Union[int, float, str].) If ti and tj have a subclass relationship, the less specific type survives. (Example: Union[Employee, Manager] == Union[Employee].) Union[t1] returns just t1. Union[] is illegal, so is Union[()]. Corollary: Union[..., Any, ...] returns Any; Union[..., object, ...] returns object; to cut a tie, Union[Any, object] == Union[object, Any] == Any.
  • Optional[t1]. Alias for Union[t1, None], i.e. Union[t1, type(None)].
  • Tuple[t1, t2, ..., tn]. A tuple whose items are instances of t1 etc.. Example: Tuple[int, float] means a tuple of two items, the first is an int, the second a float; e.g., (42, 3.14). Tuple[u1, u2, ..., um] is a subclass of Tuple[t1, t2, ..., tn] if they have the same length (n==m) and each ui is a subclass of ti. To spell the type of the empty tuple, use Tuple[()]. A variadic homogeneous tuple type can be written Tuple[t1, ...]. (That's three dots, a literal ellipsis; and yes, that's a valid token in Python's syntax.)
  • Callable[[t1, t2, ..., tn], tr]. A function with positional argument types t1 etc., and return type tr. The argument list may be empty (n==0). There is no way to indicate optional or keyword arguments, nor varargs, but you can say the argument list is entirely unchecked by writing Callable[..., tr] (again, a literal ellipsis). This is covariant in the return type, but contravariant in the arguments. "Covariant" here means that for two callable types that differ only in the return type, the subclass relationship for the callable types follows that of the return types. (Example: Callable[[], Manager] is a subclass of Callable[[], Employee].) "Contravariant" here means that for two callable types that differ only in the type of one argument, the subclass relationship for the callable types goes in the opposite direction as for the argument types. (Example: Callable[[Employee], None] is a subclass of Callable[[Mananger], None]. Yes, you read that right.)

We might add:

  • Intersection[t1, t2, ...]. Classes that are subclass of each of t1, etc are subclasses of this. (Compare to Union, which has at least one instead of each in its definition.) The order of the arguments doesn't matter. Nested intersections are flattened, e.g. Intersection[int, Intersection[float, str]] == Intersection[int, float, str]. An intersection of fewer types is a subclass of an intersection of more types, e.g. Intersection[int, str] is a subclass of Intersection[int, float, str]. An intersection of one argument is just that argument, e.g. Intersection[int] is int. When argument have a subclass relationship, the more specific class survives, e.g. Intersection[str, Employee, Manager] is Intersection[str, Manager]. Intersection[] is illegal, so is Intersection[()]. Corollary: Any disappears from the argument list, e.g. Intersection[int, str, Any] == Intersection[int, str]. Intersection[Any, object] is object. The interaction between Intersection and Union is complex but should be no surprise if you understand the interaction between intersections and unions in set theory (note that sets of types can be infinite in size, since there is no limit on the number of new subclasses).

Pragmatics

Some things are irrelevant to the theory but make practical use more convenient. (This is not a full list; I probably missed a few and some are still controversial or not fully specified.)

  • Type aliases, e.g.
    • Point = Tuple[float, float]
    • def distance(p: Point) -> float: ...
  • Forward references via strings, e.g.
    • class C:
      • def compare(self, other: 'C') -> int: ...
  • If a default of None is specified, the type is implicitly Optional, e.g.
    • def get(key: KT, default: VT = None) -> VT: ...
  • Don't use dynamic type expressions; use builtins and imported types only. No 'if'.
    • def display(message: str if WINDOWS else bytes): # NOT OK
  • Type declaration in comments, e.g.
    • x = [] # type: Sequence[int]
  • Type declarations using Undefined, e.g.
    • x = Undefined(str)
  • Casts using cast(T, x), e.g.
    • x = cast(Any, frobozz())
  • Other things, e.g. overloading and stub modules; best left to an actual PEP.

Generic types

(TODO: Explain more. See also the mypy docs on generics.)

  • X = TypeVar('X'). Declares a unique type variable. The name must match the variable name.
  • Y = TypeVar('Y', t1, t2, ...). Ditto, constrained to t1 etc. Behaves like Union[t1, t2, ...] for most purposes, but when used as a type variable, subclasses of t1 etc. are replaced by the most-derived base class among t1 etc.
  • Example of constrained type variables:
    • AnyStr = TypeVar('AnyStr', str, bytes)
    • def longest(a: AnyStr, b: AnyStr) -> AnyStr:
      • return a if len(a) >= len(b) else b
    • x = longest('a', 'abc') # The inferred type for x is str
    • y = longest('a', b'abc') # Fails static type check
    • In this example, both arguments to longest() must have the same type (str or bytes), and moreover, even if the arguments are instances of a common str subclass, the return type is still str, not that subclass (see next example).
  • For comparison, if the type variable was unconstrained, the common subclass would be chosen as the return type, e.g.:
    • S = TypeVar('S')
    • def longest(a: S, b: S) -> S:
      • return a if len(a) >= b else b
    • class MyStr(str): ...
    • x = longest(MyStr('a'), MyStr('abc'))
    • The inferred type of x is MyStr (whereas in the AnyStr example it would be str).
  • Also for comparison, if a Union is used, the return type also has to be a Union:
    • U = Union[str, bytes]
    • def longest(a: U, b: U) -> U:
      • return a if len(a) >= b else b
    • x = longest('a', 'abc')
    • The inferred type of x is still Union[str, bytes], even though both arguments are str.
  • class C(Generic[X, Y, ...]): ... Define a generic class C over type variables X etc. C itself becomes parameterizable, e.g. C[int, str, ...] is a specific class with substitutions X->int etc.
  • TODO: Explain use of generic types in function signatures. E.g. Sequence[X], Sequence[int], Sequence[Tuple[X, Y, Z]], and mixtures. Think about co*variance. No gimmicks like deriving from Sequence[Union[int, str]] or Sequence[Union[int, X]].

Predefined generic types and Protocols in typing.py

(See also the typing.py module.)

  • Everything from collections.abc (but Set renamed to AbstractSet).
  • Dict, List, Set, FrozenSet, a few more.
  • re.Pattern[AnyStr], re.Match[AnyStr].
  • re.IO[AnyStr], re.TextIO ~ re.IO[str], re.BinaryIO ~ re.IO[bytes].

pep-0484 Type Hints

PEP:484
Title:Type Hints
Version:$Revision$
Last-Modified:$Date$
Author:Guido van Rossum <guido at python.org>, Jukka Lehtosalo <jukka.lehtosalo at iki.fi>, Łukasz Langa <lukasz at langa.pl>
BDFL-Delegate:Mark Shannon
Discussions-To:Python-Dev <python-dev at python.org>
Status:Accepted
Type:Standards Track
Content-Type:text/x-rst
Created:29-Sep-2014
Post-History:16-Jan-2015,20-Mar-2015,17-Apr-2015,20-May-2015,22-May-2015
Resolution:https://mail.python.org/pipermail/python-dev/2015-May/140104.html

Abstract

PEP 3107 introduced syntax for function annotations, but the semantics were deliberately left undefined. There has now been enough 3rd party usage for static type analysis that the community would benefit from a standard vocabulary and baseline tools within the standard library.

This PEP introduces a provisional module to provide these standard definitions and tools, along with some conventions for situations where annotations are not available.

Note that this PEP still explicitly does NOT prevent other uses of annotations, nor does it require (or forbid) any particular processing of annotations, even when they conform to this specification. It simply enables better coordination, as PEP 333 did for web frameworks.

For example, here is a simple function whose argument and return type are declared in the annotations:

def greeting(name: str) -> str:
    return 'Hello ' + name

While these annotations are available at runtime through the usual __annotations__ attribute, no type checking happens at runtime. Instead, the proposal assumes the existence of a separate off-line type checker which users can run over their source code voluntarily. Essentially, such a type checker acts as a very powerful linter. (While it would of course be possible for individual users to employ a similar checker at run time for Design By Contract enforcement or JIT optimization, those tools are not yet as mature.)

The proposal is strongly inspired by mypy [mypy]. For example, the type "sequence of integers" can be written as Sequence[int]. The square brackets mean that no new syntax needs to be added to the language. The example here uses a custom type Sequence, imported from a pure-Python module typing. The Sequence[int] notation works at runtime by implementing __getitem__() in the metaclass (but its significance is primarily to an offline type checker).

The type system supports unions, generic types, and a special type named Any which is consistent with (i.e. assignable to and from) all types. This latter feature is taken from the idea of gradual typing. Gradual typing and the full type system are explained in PEP 483.

Other approaches from which we have borrowed or to which ours can be compared and contrasted are described in PEP 482.

Rationale and Goals

PEP 3107 added support for arbitrary annotations on parts of a function definition. Although no meaning was assigned to annotations then, there has always been an implicit goal to use them for type hinting [gvr-artima], which is listed as the first possible use case in said PEP.

This PEP aims to provide a standard syntax for type annotations, opening up Python code to easier static analysis and refactoring, potential runtime type checking, and (perhaps, in some contexts) code generation utilizing type information.

Of these goals, static analysis is the most important. This includes support for off-line type checkers such as mypy, as well as providing a standard notation that can be used by IDEs for code completion and refactoring.

Non-goals

While the proposed typing module will contain some building blocks for runtime type checking -- in particular the get_type_hints() function -- third party packages would have to be developed to implement specific runtime type checking functionality, for example using decorators or metaclasses. Using type hints for performance optimizations is left as an exercise for the reader.

It should also be emphasized that Python will remain a dynamically typed language, and the authors have no desire to ever make type hints mandatory, even by convention.

The meaning of annotations

Any function without annotations should be treated as having the most general type possible, or ignored, by any type checker. Functions with the @no_type_check decorator or with a # type: ignore comment should be treated as having no annotations.

It is recommended but not required that checked functions have annotations for all arguments and the return type. For a checked function, the default annotation for arguments and for the return type is Any. An exception is that the first argument of instance and class methods does not need to be annotated; it is assumed to have the type of the containing class for instance methods, and a type object type corresponding to the containing class object for class methods. For example, in class A the first argument of an instance method has the implicit type A. In a class method, the precise type of the first argument cannot be represented using the available type notation.

(Note that the return type of __init__ ought to be annotated with -> None. The reason for this is subtle. If __init__ assumed a return annotation of -> None, would that mean that an argument-less, un-annotated __init__ method should still be type-checked? Rather than leaving this ambiguous or introducing an exception to the exception, we simply say that __init__ ought to have a return annotation; the default behavior is thus the same as for other methods.)

A type checker is expected to check the body of a checked function for consistency with the given annotations. The annotations may also used to check correctness of calls appearing in other checked functions.

Type checkers are expected to attempt to infer as much information as necessary. The minimum requirement is to handle the builtin decorators @property, @staticmethod and @classmethod.

Type Definition Syntax

The syntax leverages PEP 3107-style annotations with a number of extensions described in sections below. In its basic form, type hinting is used by filling function annotation slots with classes:

def greeting(name: str) -> str:
    return 'Hello ' + name

This states that the expected type of the name argument is str. Analogically, the expected return type is str.

Expressions whose type is a subtype of a specific argument type are also accepted for that argument.

Acceptable type hints

Type hints may be built-in classes (including those defined in standard library or third-party extension modules), abstract base classes, types available in the types module, and user-defined classes (including those defined in the standard library or third-party modules).

While annotations are normally the best format for type hints, there are times when it is more appropriate to represent them by a special comment, or in a separately distributed stub file. (See below for examples.)

Annotations must be valid expressions that evaluate without raising exceptions at the time the function is defined (but see below for forward references).

Annotations should be kept simple or static analysis tools may not be able to interpret the values. For example, dynamically computed types are unlikely to be understood. (This is an intentionally somewhat vague requirement, specific inclusions and exclusions may be added to future versions of this PEP as warranted by the discussion.)

In addition to the above, the following special constructs defined below may be used: None, Any, Union, Tuple, Callable, all ABCs and stand-ins for concrete classes exported from typing (e.g. Sequence and Dict), type variables, and type aliases.

All newly introduced names used to support features described in following sections (such as Any and Union) are available in the typing module.

Using None

When used in a type hint, the expression None is considered equivalent to type(None).

Type aliases

Type aliases are defined by simple variable assignments:

Url = str

def retry(url: Url, retry_count: int) -> None: ...

Note that we recommend capitalizing alias names, since they represent user-defined types, which (like user-defined classes) are typically spelled that way.

Type aliases may be as complex as type hints in annotations -- anything that is acceptable as a type hint is acceptable in a type alias:

from typing import TypeVar, Iterable, Tuple

T = TypeVar('T', int, float, complex)
Vector = Iterable[Tuple[T, T]]

def inproduct(v: Vector) -> T:
    return sum(x*y for x, y in v)

This is equivalent to:

from typing import TypeVar, Iterable, Tuple

T = TypeVar('T', int, float, complex)

def inproduct(v: Iterable[Tuple[T, T]]) -> T:
    return sum(x*y for x, y in v)

Callable

Frameworks expecting callback functions of specific signatures might be type hinted using Callable[[Arg1Type, Arg2Type], ReturnType]. Examples:

from typing import Callable

def feeder(get_next_item: Callable[[], str]) -> None:
    # Body

def async_query(on_success: Callable[[int], None],
                on_error: Callable[[int, Exception], None]) -> None:
    # Body

It is possible to declare the return type of a callable without specifying the call signature by substituting a literal ellipsis (three dots) for the list of arguments:

def partial(func: Callable[..., str], *args) -> Callable[..., str]:
    # Body

Note that there are no square brackets around the ellipsis. The arguments of the callback are completely unconstrained in this case (and keyword arguments are acceptable).

Since using callbacks with keyword arguments is not perceived as a common use case, there is currently no support for specifying keyword arguments with Callable. Similarly, there is no support for specifying callback signatures with a variable number of argument of a specific type.

Because typing.Callable does double-duty as a replacement for collections.abc.Callable, isinstance(x, typing.Callable) is implemented by deferring to isinstance(x, collections.abc.Callable). However, isinstance(x, typing.Callable[...]) is not supported.

Generics

Since type information about objects kept in containers cannot be statically inferred in a generic way, abstract base classes have been extended to support subscription to denote expected types for container elements. Example:

from typing import Mapping, Set

def notify_by_email(employees: Set[Employee], overrides: Mapping[str, str]) -> None: ...

Generics can be parametrized by using a new factory available in typing called TypeVar. Example:

from typing import Sequence, TypeVar

T = TypeVar('T')      # Declare type variable

def first(l: Sequence[T]) -> T:   # Generic function
    return l[0]

In this case the contract is that the returned value is consistent with the elements held by the collection.

A TypeVar() expression must always directly be assigned to a variable (it should not be used as part of a larger expression). The argument to TypeVar() must be a string equal to the variable name to which it is assigned. Type variables must not be redefined.

TypeVar supports constraining parametric types to a fixed set of possible types. For example, we can define a type variable that ranges over just str and bytes. By default, a type variable ranges over all possible types. Example of constraining a type variable:

from typing import TypeVar

AnyStr = TypeVar('AnyStr', str, bytes)

def concat(x: AnyStr, y: AnyStr) -> AnyStr:
    return x + y

The function concat can be called with either two str arguments or two bytes arguments, but not with a mix of str and bytes arguments.

There should be at least two constraints, if any; specifying a single constraint is disallowed.

Subtypes of types constrained by a type variable should be treated as their respective explicitly listed base types in the context of the type variable. Consider this example:

class MyStr(str): ...

x = concat(MyStr('apple'), MyStr('pie'))

The call is valid but the type variable AnyStr will be set to str and not MyStr. In effect, the inferred type of the return value assigned to x will also be str.

Additionally, Any is a valid value for every type variable. Consider the following:

def count_truthy(elements: List[Any]) -> int:
    return sum(1 for elem in elements if element)

This is equivalent to omitting the generic notation and just saying elements: List.

User-defined generic types

You can include a Generic base class to define a user-defined class as generic. Example:

from typing import TypeVar, Generic

T = TypeVar('T')

class LoggedVar(Generic[T]):
    def __init__(self, value: T, name: str, logger: Logger) -> None:
        self.name = name
        self.logger = logger
        self.value = value

    def set(self, new: T) -> None:
        self.log('Set ' + repr(self.value))
        self.value = new

    def get(self) -> T:
        self.log('Get ' + repr(self.value))
        return self.value

    def log(self, message: str) -> None:
        self.logger.info('{}: {}'.format(self.name message))

Generic[T] as a base class defines that the class LoggedVar takes a single type parameter T. This also makes T valid as a type within the class body.

The Generic base class uses a metaclass that defines __getitem__ so that LoggedVar[t] is valid as a type:

from typing import Iterable

def zero_all_vars(vars: Iterable[LoggedVar[int]]) -> None:
    for var in vars:
        var.set(0)

A generic type can have any number of type variables, and type variables may be constrained. This is valid:

from typing import TypeVar, Generic
...

T = TypeVar('T')
S = TypeVar('S')

class Pair(Generic[T, S]):
    ...

Each type variable argument to Generic must be distinct. This is thus invalid:

from typing import TypeVar, Generic
...

T = TypeVar('T')

class Pair(Generic[T, T]):   # INVALID
    ...

You can use multiple inheritance with Generic:

from typing import TypeVar, Generic, Sized

T = TypeVar('T')

class LinkedList(Sized, Generic[T]):
    ...

Subclassing a generic class without specifying type parameters assumes Any for each position. In the following example, MyIterable is not generic but implicitly inherits from Iterable[Any]:

from typing import Iterable

class MyIterable(Iterable): # Same as Iterable[Any]
...

Generic metaclasses are not supported.

Instantiating generic classes and type erasure

Generic types like List or Sequence cannot be instantiated. However, user-defined classes derived from them can be instantiated. Suppose we write a Node class inheriting from Generic[T]:

from typing import TypeVar, Generic

T = TypeVar('T')

class Node(Generic[T]):
    ...

Now there are two ways we can instantiate this class; the type inferred by a type checker may be different depending on the form we use. The first way is to give the value of the type parameter explicitly -- this overrides whatever type inference the type checker would otherwise perform:

x = Node[T]() # The type inferred for x is Node[T].

y = Node[int]() # The type inferred for y is Node[int].

If no explicit types are given, the type checker is given some freedom. Consider this code:

x = Node()

The inferred type could be Node[Any], as there isn't enough context to infer a more precise type. Alternatively, a type checker may reject the line and require an explicit annotation, like this:

x = Node() # type: Node[int] # Inferred type is Node[int].

A type checker with more powerful type inference could look at how x is used elsewhere in the file and try to infer a more precise type such as Node[int] even without an explicit type annotation. However, it is probably impossible to make such type inference work well in all cases, since Python programs can be very dynamic.

This PEP doesn't specify the details of how type inference should work. We allow different tools to experiment with various approaches. We may give more explicit rules in future revisions.

At runtime the type is not preserved, and the class of x is just Node in all cases. This behavior is called "type erasure"; it is common practice in languages with generics (e.g. Java, TypeScript).

Arbitrary generic types as base classes

Generic[T] is only valid as a base class -- it's not a proper type. However, user-defined generic types such as LinkedList[T] from the above example and built-in generic types and ABCs such as List[T] and Iterable[T] are valid both as types and as base classes. For example, we can define a subclass of Dict that specializes type arguments:

from typing import Dict, List, Optional

class Node:
    ...

class SymbolTable(Dict[str, List[Node]]):
    def push(self, name: str, node: Node) -> None:
        self.setdefault(name, []).append(node)

    def pop(self, name: str) -> Node:
        return self[name].pop()

    def lookup(self, name: str) -> Optional[Node]:
        nodes = self.get(name)
        if nodes:
            return nodes[-1]
        return None

SymbolTable is a subclass of dict and a subtype of Dict[str, List[Node]].

If a generic base class has a type variable as a type argument, this makes the defined class generic. For example, we can define a generic LinkedList class that is iterable and a container:

from typing import TypeVar, Iterable, Container

T = TypeVar('T')

class LinkedList(Iterable[T], Container[T]):
    ...

Now LinkedList[int] is a valid type. Note that we can use T multiple times in the base class list, as long as we don't use the same type variable T multiple times within Generic[...].

Also consider the following example:

from typing import TypeVar, Mapping

T = TypeVar('T')

class MyDict(Mapping[str, T]):
    ...

In this case MyDict has a single parameter, T.

Abstract generic types

The metaclass used by Generic is a subclass of abc.ABCMeta. A generic class can be an ABC by including abstract methods or properties, and generic classes can also have ABCs as base classes without a metaclass conflict.

Type variables with an upper bound

A type variable may specify an upper bound using bound=<type>. This means that an actual type substituted (explicitly or implictly) for the type variable must be a subclass of the boundary type. A common example is the definition of a Comparable type that works well enough to catch the most common errors:

from typing import TypeVar

class Comparable(metaclass=ABCMeta):
    @abstractmethod
    def __lt__(self, other: Any) -> bool: ...
    ... # __gt__ etc. as well

CT = TypeVar('CT', bound=Comparable)

def min(x: CT, y: CT) -> CT:
    if x < y:
        return x
    else:
        return y

min(1, 2) # ok, return type int
min('x', 'y') # ok, return type str

(Note that this is not ideal -- for example min('x', 1) is invalid at runtime but a type checker would simply infer the return type Comparable. Unfortunately, addressing this would require introducing a much more powerful and also much more complicated concept, F-bounded polymorphism. We may revisit this in the future.)

An upper bound cannot be combined with type constraints (as in used AnyStr, see the example earlier); type constraints cause the inferred type to be _exactly_ one of the constraint types, while an upper bound just requires that the actual type is a subclass of the boundary type.

Covariance and contravariance

Consider a class Employee with a subclass Manager. Now suppose we have a function with an argument annotated with List[Employee]. Should we be allowed to call this function with a variable of type List[Manager] as its argument? Many people would answer "yes, of course" without even considering the consequences. But unless we know more about the function, a type checker should reject such a call: the function might append an Employee instance to the list, which would violate the variable's type in the caller.

It turns out such an argument acts _contravariantly_, whereas the intuitive answer (which is correct in case the function doesn't mutate its argument!) requires the argument to act _covariantly_. A longer introduction to these concepts can be found on Wikipedia [wiki-variance]; here we just show how to control a type checker's behavior.

By default type variables are considered _invariant_, which means that arguments for arguments annotated with types like List[Employee] must exactly match the type annotation -- no subclasses or superclasses of the type parameter (in this example Employee) are allowed.

To facilitate the declaration of container types where covariant type checking is acceptable, a type variable can be declared using covariant=True. For the (rare) case where contravariant behavior is desirable, pass contravariant=True. At most one of these may be passed.

A typical example involves defining an immutable (or read-only) container class:

from typing import TypeVar, Generic, Iterable, Iterator

T = TypeVar('T', covariant=True)

class ImmutableList(Generic[T]):
    def __init__(self, items: Iterable[T]) -> None: ...
    def __iter__(self) -> Iterator[T]: ...
    ...

class Employee: ...

class Manager(Employee): ...

def dump_employees(emps: ImmutableList[Employee]) -> None:
    for emp in emps:
        ...

mgrs = ImmutableList([Manager()])  # type: ImmutableList[Manager]
dump_employees(mgrs)  # OK

The read-only collection classes in typing are all defined using a covariant type variable (e.g. Mapping and Sequence). The mutable collection classes (e.g. MutableMapping and MutableSequence) are defined using regular invariant type variables. The one example of a contravariant type variable is the Generator type, which is contravariant in the send() argument type (see below).

Note: variance affects type parameters for generic types -- it does not affect regular parameters. For example, the following example is fine:

from typing import TypeVar

class Employee: ...

class Manager(Employee): ...

E = TypeVar('E', bound=Employee)  # Invariant

def dump_employee(e: E) -> None: ...

dump_employee(Manager())  # OK

The numeric tower

PEP 3141 defines Python's numeric tower, and the stdlib module numbers implements the corresponding ABCs (Number, Complex, Real, Rational and Integral). There are some issues with these ABCs, but the built-in concrete numeric classes complex, float and int are ubiquitous (especially the latter two :-).

Rather than requiring that users write import numbers and then use numbers.Float etc., this PEP proposes a straightforward shortcut that is almost as effective: when an argument is annotated as having type float, an argument of type int is acceptable; similar, for an argument annotated as having type complex, arguments of type float or int are acceptable. This does not handle classes implementing the corresponding ABCs or the fractions.Fraction class, but we believe those use cases are exceedingly rare.

The bytes types

There are three different builtin classes used for arrays of bytes (not counting the classes available in the array module): bytes, bytearray and memoryview. Of these, bytes and bytearray have many behaviors in common (though not all -- bytearray is mutable).

While there is an ABC ByteString defined in collections.abc and a corresponding type in typing, functions accepting bytes (of some form) are so common that it would be cumbersome to have to write typing.ByteString everywhere. So, as a shortcut similar to that for the builtin numeric classes, when an argument is annotated as having type bytes, arguments of type bytearray or memoryview are acceptable. (Again, there are situations where this isn't sound, but we believe those are exceedingly rare in practice.)

Forward references

When a type hint contains names that have not been defined yet, that definition may be expressed as a string literal, to be resolved later.

A situation where this occurs commonly is the definition of a container class, where the class being defined occurs in the signature of some of the methods. For example, the following code (the start of a simple binary tree implementation) does not work:

class Tree:
    def __init__(self, left: Tree, right: Tree):
        self.left = left
        self.right = right

To address this, we write:

class Tree:
    def __init__(self, left: 'Tree', right: 'Tree'):
        self.left = left
        self.right = right

The string literal should contain a valid Python expression (i.e., compile(lit, '', 'eval') should be a valid code object) and it should evaluate without errors once the module has been fully loaded. The local and global namespace in which it is evaluated should be the same namespaces in which default arguments to the same function would be evaluated.

Moreover, the expression should be parseable as a valid type hint, i.e., it is constrained by the rules from the section Acceptable type hints above.

It is allowable to use string literals as part of a type hint, for example:

class Tree:
    ...
    def leaves(self) -> List['Tree']:
        ...

A common use for forward references is when e.g. Django models are needed in the signatures. Typically, each model is in a separate file, and has methods that arguments whose type involves other models. Because of the way circular imports work in Python, it is often not possible to import all the needed models directly:

# File models/a.py
from models.b import B
class A(Model):
    def foo(self, b: B): ...

# File models/b.py
from models.a import A
class B(Model):
    def bar(self, a: A): ...

# File main.py
from models.a import A
from models.b import B

Assuming main is imported first, this will fail with an ImportError at the line from models.a import A in models/b.py, which is being imported from models/a.py before a has defined class A. The solution is to switch to module-only imports and reference the models by their _module_._class_ name:

# File models/a.py
from models import b
class A(Model):
    def foo(self, b: 'b.B'): ...

# File models/b.py
from models import a
class B(Model):
    def bar(self, a: 'a.A'): ...

# File main.py
from models.a import A
from models.b import B

Union types

Since accepting a small, limited set of expected types for a single argument is common, there is a new special factory called Union. Example:

from typing import Union

def handle_employees(e: Union[Employee, Sequence[Employee]]) -> None:
    if isinstance(e, Employee):
        e = [e]
    ...

A type factored by Union[T1, T2, ...] responds True to issubclass checks for T1 and any of its subtypes, T2 and any of its subtypes, and so on.

One common case of union types are optional types. By default, None is an invalid value for any type, unless a default value of None has been provided in the function definition. Examples:

def handle_employee(e: Union[Employee, None]) -> None: ...

As a shorthand for Union[T1, None] you can write Optional[T1]; for example, the above is equivalent to:

from typing import Optional

def handle_employee(e: Optional[Employee]) -> None: ...

An optional type is also automatically assumed when the default value is None, for example:

def handle_employee(e: Employee = None): ...

This is equivalent to:

def handle_employee(e: Optional[Employee] = None) -> None: ...

The Any type

A special kind of type is Any. Every type is a subtype of Any. This is also true for the builtin type object. However, to the static type checker these are completely different.

When the type of a value is object, the type checker will reject almost all operations on it, and assigning it to a variable (or using it as a return value) of a more specialized type is a type error. On the other hand, when a value has type Any, the type checker will allow all operations on it, and a value of type Any can be assigned to a variable (or used as a return value) of a more constrained type.

Version and platform checking

Type checkers are expected to understand simple version and platform checks, e.g.:

import sys

if sys.version_info[0] >= 3:
    # Python 3 specific definitions
else:
    # Python 2 specific definitions

if sys.platform == 'win32':
    # Windows specific definitions
else:
    # Posix specific definitions

Don't expect a checker to understand obfuscations like "".join(reversed(sys.platform)) == "xunil".

Default argument values

In stubs it may be useful to declare an argument as having a default without specifying the actual default value. For example:

def foo(x: AnyStr, y: AnyStr = ...) -> AnyStr: ...

What should the default value look like? Any of the options "", b"" or None fails to satisfy the type constraint (actually, None will modify the type to become Optional[AnyStr]).

In such cases the default value may be specified as a literal ellipsis, i.e. the above example is literally what you would write.

Compatibility with other uses of function annotations

A number of existing or potential use cases for function annotations exist, which are incompatible with type hinting. These may confuse a static type checker. However, since type hinting annotations have no runtime behavior (other than evaluation of the annotation expression and storing annotations in the __annotations__ attribute of the function object), this does not make the program incorrect -- it just may cause a type checker to emit spurious warnings or errors.

To mark portions of the program that should not be covered by type hinting, you can use one or more of the following:

  • a # type: ignore comment;
  • a @no_type_check decorator on a class or function;
  • a custom class or function decorator marked with @no_type_check_decorator.

For more details see later sections.

In order for maximal compatibility with offline type checking it may eventually be a good idea to change interfaces that rely on annotations to switch to a different mechanism, for example a decorator. In Python 3.5 there is no pressure to do this, however. See also the longer discussion under Rejected alternatives below.

Type comments

No first-class syntax support for explicitly marking variables as being of a specific type is added by this PEP. To help with type inference in complex cases, a comment of the following format may be used:

x = []   # type: List[Employee]
x, y, z = [], [], []  # type: List[int], List[int], List[str]
x, y, z = [], [], []  # type: (List[int], List[int], List[str])
x = [
   1,
   2,
]  # type: List[int]

Type comments should be put on the last line of the statement that contains the variable definition. They can also be placed on with statements and for statements, right after the colon.

Examples of type comments on with and for statements:

with frobnicate() as foo:  # type: int
    # Here foo is an int
    ...

for x, y in points:  # type: float, float
    # Here x and y are floats
    ...

In stubs it may be useful to declare the existence of a variable without giving it an initial value. This can be done using a literal ellipsis:

from typing import IO

stream = ...  # type: IO[str]

In non-stub code, there is a similar special case:

from typing import IO

stream = None # type: IO[str]

Type checkers should not complain about this (despite the value None not matching the given type), nor should they change the inferred type to Optional[...] (despite the rule that does this for annotated arguments with a default value of None). The assumption here is that other code will ensure that the variable is given a value of the proper type, and all uses can assume that the variable has the given type.

The # type: ignore comment should be put on the line that the error refers to:

import http.client
errors = {
    'not_found': http.client.NOT_FOUND  # type: ignore
}

A # type: ignore comment on a line by itself disables all type checking for the rest of the file.

If type hinting proves useful in general, a syntax for typing variables may be provided in a future Python version.

Casts

Occasionally the type checker may need a different kind of hint: the programmer may know that an expression is of a more constrained type than a type checker may be able to infer. For example:

from typing import List, cast

def find_first_str(a: List[object]) -> str:
    index = next(i for i, x in enumerate(a) if isinstance(x, str))
    # We only get here if there's at least one string in a
    return cast(str, a[index])

Some type checkers may not be able to infer that the type of a[index] is str and only infer object or Any", but we know that (if the code gets to that point) it must be a string. The cast(t, x) call tells the type checker that we are confident that the type of x is t. At runtime a cast always returns the expression unchanged -- it does not check the type, and it does not convert or coerce the value.

Casts differ from type comments (see the previous section). When using a type comment, the type checker should still verify that the inferred type is consistent with the stated type. When using a cast, the type checker should blindly believe the programmer. Also, casts can be used in expressions, while type comments only apply to assignments.

Stub Files

Stub files are files containing type hints that are only for use by the type checker, not at runtime. There are several use cases for stub files:

  • Extension modules
  • Third-party modules whose authors have not yet added type hints
  • Standard library modules for which type hints have not yet been written
  • Modules that must be compatible with Python 2 and 3
  • Modules that use annotations for other purposes

Stub files have the same syntax as regular Python modules. There is one feature of the typing module that may only be used in stub files: the @overload decorator described below.

The type checker should only check function signatures in stub files; It is recommended that function bodies in stub files just be a single ellipsis (...).

The type checker should have a configurable search path for stub files. If a stub file is found the type checker should not read the corresponding "real" module.

While stub files are syntactically valid Python modules, they use the .pyi extension to make it possible to maintain stub files in the same directory as the corresponding real module. This also reinforces the notion that no runtime behavior should be expected of stub files.

Additional notes on stub files:

  • Modules and variables imported into the stub are not considered exported from the stub unless the import uses the import ... as ... form.

Function overloading

The @overload decorator allows describing functions that support multiple different combinations of argument types. This pattern is used frequently in builtin modules and types. For example, the __getitem__() method of the bytes type can be described as follows:

from typing import overload

class bytes:
    ...
    @overload
    def __getitem__(self, i: int) -> int: ...
    @overload
    def __getitem__(self, s: slice) -> bytes: ...

This description is more precise than would be possible using unions (which cannot express the relationship between the argument and return types):

from typing import Union

class bytes:
    ...
    def __getitem__(self, a: Union[int, slice]) -> Union[int, bytes]: ...

Another example where @overload comes in handy is the type of the builtin map() function, which takes a different number of arguments depending on the type of the callable:

from typing import Callable, Iterable, Iterator, Tuple, TypeVar, overload

T1 = TypeVar('T1')
T2 = TypeVar('T2)
S = TypeVar('S')

@overload
def map(func: Callable[[T1], S], iter1: Iterable[T1]) -> Iterator[S]: ...
@overload
def map(func: Callable[[T1, T2], S],
        iter1: Iterable[T1], iter2: Iterable[T2]) -> Iterator[S]: ...
# ... and we could add more items to support more than two iterables

Note that we could also easily add items to support map(None, ...):

@overload
def map(func: None, iter1: Iterable[T1]) -> Iterable[T1]: ...
@overload
def map(func: None,
        iter1: Iterable[T1],
        iter2: Iterable[T2]) -> Iterable[Tuple[T1, T2]]: ...

The @overload decorator may only be used in stub files. While it would be possible to provide a multiple dispatch implementation using this syntax, its implementation would require using sys._getframe(), which is frowned upon. Also, designing and implementing an efficient multiple dispatch mechanism is hard, which is why previous attempts were abandoned in favor of functools.singledispatch(). (See PEP 443, especially its section "Alternative approaches".) In the future we may come up with a satisfactory multiple dispatch design, but we don't want such a design to be constrained by the overloading syntax defined for type hints in stub files. In the meantime, using the @overload decorator or calling overload() directly raises RuntimeError.

A constrained TypeVar type can often be used instead of using the @overload decorator. For example, the definitions of concat1 and concat2 in this stub file are equivalent:

from typing import TypeVar

AnyStr = TypeVar('AnyStr', str, bytes)

def concat1(x: AnyStr, y: AnyStr) -> AnyStr: ...

@overload def concat2(x: str, y: str) -> str: ... @overload def concat2(x: bytes, y: bytes) -> bytes: ...

Some functions, such as map or bytes.__getitem__ above, can't be represented precisely using type variables. However, unlike @overload, type variables can also be used outside stub files. We recommend that @overload is only used in cases where a type variable is not sufficient, due to its special stub-only status.

Another important difference between type variables such as AnyStr and using @overload is that the prior can also be used to define constraints for generic class type parameters. For example, the type parameter of the generic class typing.IO is constrained (only IO[str], IO[bytes] and IO[Any] are valid):

class IO(Generic[AnyStr]): ...

Storing and distributing stub files

The easiest form of stub file storage and distribution is to put them alongside Python modules in the same directory. This makes them easy to find by both programmers and the tools. However, since package maintainers are free not to add type hinting to their packages, third-party stubs installable by pip from PyPI are also supported. In this case we have to consider three issues: naming, versioning, installation path.

This PEP does not provide a recommendation on a naming scheme that should be used for third-party stub file packages. Discoverability will hopefully be based on package popularity, like with Django packages for example.

Third-party stubs have to be versioned using the lowest version of the source package that is compatible. Example: FooPackage has versions 1.0, 1.1, 1.2, 1.3, 2.0, 2.1, 2.2. There are API changes in versions 1.1, 2.0 and 2.2. The stub file package maintainer is free to release stubs for all versions but at least 1.0, 1.1, 2.0 and 2.2 are needed to enable the end user type check all versions. This is because the user knows that the closest lower or equal version of stubs is compatible. In the provided example, for FooPackage 1.3 the user would choose stubs version 1.1.

Note that if the user decides to use the "latest" available source package, using the "latest" stub files should generally also work if they're updated often.

Third-party stub packages can use any location for stub storage. Type checkers should search for them using PYTHONPATH. A default fallback directory that is always checked is shared/typehints/python3.5/ (or 3.6, etc.). Since there can only be one package installed for a given Python version per environment, no additional versioning is performed under that directory (just like bare directory installs by pip in site-packages). Stub file package authors might use the following snippet in setup.py:

...
data_files=[
    (
        'shared/typehints/python{}.{}'.format(*sys.version_info[:2]),
        pathlib.Path(SRC_PATH).glob('**/*.pyi'),
    ),
],
...

The Typeshed Repo

There is a shared repository where useful stubs are being collected [typeshed]. Note that stubs for a given package will not be included here without the explicit consent of the package owner. Further policies regarding the stubs collected here will be decided at a later time, after discussion on python-dev, and reported in the typeshed repo's README.

Exceptions

No syntax for listing explicitly raised exceptions is proposed. Currently the only known use case for this feature is documentational, in which case the recommendation is to put this information in a docstring.

The typing Module

To open the usage of static type checking to Python 3.5 as well as older versions, a uniform namespace is required. For this purpose, a new module in the standard library is introduced called typing.

It defines the fundamental building blocks for constructing types (e.g. Any), types representing generic variants of builtin collections (e.g. List), types representing generic collection ABCs (e.g. Sequence), and a small collection of convenience definitions.

Fundamental building blocks:

  • Any, used as def get(key: str) -> Any: ...
  • Union, used as Union[Type1, Type2, Type3]
  • Callable, used as Callable[[Arg1Type, Arg2Type], ReturnType]
  • Tuple, used by listing the element types, for example Tuple[int, int, str]. Arbitrary-length homogeneous tuples can be expressed using one type and ellipsis, for example Tuple[int, ...]. (The ... here are part of the syntax, a literal ellipsis.)
  • TypeVar, used as X = TypeVar('X', Type1, Type2, Type3) or simply Y = TypeVar('Y') (see above for more details)
  • Generic, used to create user-defined generic classes

Generic variants of builtin collections:

  • Dict, used as Dict[key_type, value_type]
  • List, used as List[element_type]
  • Set, used as Set[element_type]. See remark for AbstractSet below.
  • FrozenSet, used as FrozenSet[element_type]

Note: Dict, List, Set and FrozenSet are mainly useful for annotating return values. For arguments, prefer the abstract collection types defined below, e.g. Mapping, Sequence or AbstractSet.

Generic variants of container ABCs (and a few non-containers):

  • ByteString
  • Callable (see above, listed here for completeness)
  • Container
  • Generator, used as Generator[yield_type, send_type, return_type]. This represents the return value of generator functions. It is a subtype of Iterable and it has additional type variables for the type accepted by the send() method (which is contravariant -- a generator that accepts sending it Employee instance is valid in a context where a generator is required that accepts sending it Manager instances) and the return type of the generator.
  • Hashable (not generic, but present for completeness)
  • ItemsView
  • Iterable
  • Iterator
  • KeysView
  • Mapping
  • MappingView
  • MutableMapping
  • MutableSequence
  • MutableSet
  • Sequence
  • Set, renamed to AbstractSet. This name change was required because Set in the typing module means set() with generics.
  • Sized (not generic, but present for completeness)
  • ValuesView

A few one-off types are defined that test for single special methods (similar to Hashable or Sized):

  • Reversible, to test for __reversed__
  • SupportsAbs, to test for __abs__
  • SupportsComplex, to test for __complex__
  • SupportsFloat, to test for __float__
  • SupportsInt, to test for __int__
  • SupportsRound, to test for __round__
  • SupportsBytes, to test for __bytes__

Convenience definitions:

  • Optional, defined by Optional[t] == Union[t, type(None)]
  • AnyStr, defined as TypeVar('AnyStr', str, bytes)
  • NamedTuple, used as NamedTuple(type_name, [(field_name, field_type), ...]) and equivalent to collections.namedtuple(type_name, [field_name, ...]). This is useful to declare the types of the fields of a a named tuple type.
  • cast(), described earlier
  • @no_type_check, a decorator to disable type checking per class or function (see below)
  • @no_type_check_decorator, a decorator to create your own decorators with the same meaning as @no_type_check (see below)
  • @overload, described earlier
  • get_type_hints(), a utility function to retrieve the type hints from a function or method. Given a function or method object, it returns a dict with the same format as __annotations__, but evaluating forward references (which are given as string literals) as expressions in the context of the original function or method definition.

Types available in the typing.io submodule:

  • IO (generic over AnyStr)
  • BinaryIO (a simple subtype of IO[bytes])
  • TextIO (a simple subtype of IO[str])

Types available in the typing.re submodule:

  • Match and Pattern, types of re.match() and re.compile() results (generic over AnyStr)

Rejected Alternatives

During discussion of earlier drafts of this PEP, various objections were raised and alternatives were proposed. We discuss some of these here and explain why we reject them.

Several main objections were raised.

Which brackets for generic type parameters?

Most people are familiar with the use of angular brackets (e.g. List<int>) in languages like C++, Java, C# and Swift to express the parametrization of generic types. The problem with these is that they are really hard to parse, especially for a simple-minded parser like Python. In most languages the ambiguities are usually dealt with by only allowing angular brackets in specific syntactic positions, where general expressions aren't allowed. (And also by using very powerful parsing techniques that can backtrack over an arbitrary section of code.)

But in Python, we'd like type expressions to be (syntactically) the same as other expressions, so that we can use e.g. variable assignment to create type aliases. Consider this simple type expression:

List<int>

From the Python parser's perspective, the expression begins with the same four tokens (NAME, LESS, NAME, GREATER) as a chained comparison:

a < b > c  # I.e., (a < b) and (b > c)

We can even make up an example that could be parsed both ways:

a < b > [ c ]

Assuming we had angular brackets in the language, this could be interpreted as either of the following two:

(a<b>)[c]      # I.e., (a<b>).__getitem__(c)
a < b > ([c])  # I.e., (a < b) and (b > [c])

It would surely be possible to come up with a rule to disambiguate such cases, but to most users the rules would feel arbitrary and complex. It would also require us to dramatically change the CPython parser (and every other parser for Python). It should be noted that Python's current parser is intentionally "dumb" -- a simple grammar is easier for users to reason about.

For all these reasons, square brackets (e.g. List[int]) are (and have long been) the preferred syntax for generic type parameters. They can be implemented by defining the __getitem__() method on the metaclass, and no new syntax is required at all. This option works in all recent versions of Python (starting with Python 2.2). Python is not alone in this syntactic choice -- generic classes in Scala also use square brackets.

What about existing uses of annotations?

One line of argument points out that PEP 3107 explicitly supports the use of arbitrary expressions in function annotations. The new proposal is then considered incompatible with the specification of PEP 3107.

Our response to this is that, first of all, the current proposal does not introduce any direct incompatibilities, so programs using annotations in Python 3.4 will still work correctly and without prejudice in Python 3.5.

We do hope that type hints will eventually become the sole use for annotations, but this will require additional discussion and a deprecation period after the initial roll-out of the typing module with Python 3.5. The current PEP will have provisional status (see PEP 411) until Python 3.6 is released. The fastest conceivable scheme would introduce silent deprecation of non-type-hint annotations in 3.6, full deprecation in 3.7, and declare type hints as the only allowed use of annotations in Python 3.8. This should give authors of packages that use annotations plenty of time to devise another approach, even if type hints become an overnight success.

Another possible outcome would be that type hints will eventually become the default meaning for annotations, but that there will always remain an option to disable them. For this purpose the current proposal defines a decorator @no_type_check which disables the default interpretation of annotations as type hints in a given class or function. It also defines a meta-decorator @no_type_check_decorator which can be used to decorate a decorator (!), causing annotations in any function or class decorated with the latter to be ignored by the type checker.

There are also # type: ignore comments, and static checkers should support configuration options to disable type checking in selected packages.

Despite all these options, proposals have been circulated to allow type hints and other forms of annotations to coexist for individual arguments. One proposal suggests that if an annotation for a given argument is a dictionary literal, each key represents a different form of annotation, and the key 'type' would be use for type hints. The problem with this idea and its variants is that the notation becomes very "noisy" and hard to read. Also, in most cases where existing libraries use annotations, there would be little need to combine them with type hints. So the simpler approach of selectively disabling type hints appears sufficient.

The problem of forward declarations

The current proposal is admittedly sub-optimal when type hints must contain forward references. Python requires all names to be defined by the time they are used. Apart from circular imports this is rarely a problem: "use" here means "look up at runtime", and with most "forward" references there is no problem in ensuring that a name is defined before the function using it is called.

The problem with type hints is that annotations (per PEP 3107, and similar to default values) are evaluated at the time a function is defined, and thus any names used in an annotation must be already defined when the function is being defined. A common scenario is a class definition whose methods need to reference the class itself in their annotations. (More general, it can also occur with mutually recursive classes.) This is natural for container types, for example:

class Node:
    """Binary tree node."""

    def __init__(self, left: Node, right: Node):
        self.left = left
        self.right = right

As written this will not work, because of the peculiarity in Python that class names become defined once the entire body of the class has been executed. Our solution, which isn't particularly elegant, but gets the job done, is to allow using string literals in annotations. Most of the time you won't have to use this though -- most uses of type hints are expected to reference builtin types or types defined in other modules.

A counterproposal would change the semantics of type hints so they aren't evaluated at runtime at all (after all, type checking happens off-line, so why would type hints need to be evaluated at runtime at all). This of course would run afoul of backwards compatibility, since the Python interpreter doesn't actually know whether a particular annotation is meant to be a type hint or something else.

A compromise is possible where a __future__ import could enable turning all annotations in a given module into string literals, as follows:

from __future__ import annotations

class ImSet:
    def add(self, a: ImSet) -> List[ImSet]: ...

assert ImSet.add.__annotations__ == {'a': 'ImSet', 'return': 'List[ImSet]'}

Such a __future__ import statement may be proposed in a separate PEP.

The double colon

A few creative souls have tried to invent solutions for this problem. For example, it was proposed to use a double colon (::) for type hints, solving two problems at once: disambiguating between type hints and other annotations, and changing the semantics to preclude runtime evaluation. There are several things wrong with this idea, however.

  • It's ugly. The single colon in Python has many uses, and all of them look familiar because they resemble the use of the colon in English text. This is a general rule of thumb by which Python abides for most forms of punctuation; the exceptions are typically well known from other programming languages. But this use of :: is unheard of in English, and in other languages (e.g. C++) it is used as a scoping operator, which is a very different beast. In contrast, the single colon for type hints reads naturally -- and no wonder, since it was carefully designed for this purpose (the idea long predates PEP 3107 [gvr-artima]). It is also used in the same fashion in other languages from Pascal to Swift.
  • What would you do for return type annotations?
  • It's actually a feature that type hints are evaluated at runtime.
    • Making type hints available at runtime allows runtime type checkers to be built on top of type hints.
    • It catches mistakes even when the type checker is not run. Since it is a separate program, users may choose not to run it (or even install it), but might still want to use type hints as a concise form of documentation. Broken type hints are no use even for documentation.
  • Because it's new syntax, using the double colon for type hints would limit them to code that works with Python 3.5 only. By using existing syntax, the current proposal can easily work for older versions of Python 3. (And in fact mypy supports Python 3.2 and newer.)
  • If type hints become successful we may well decide to add new syntax in the future to declare the type for variables, for example var age: int = 42. If we were to use a double colon for argument type hints, for consistency we'd have to use the same convention for future syntax, perpetuating the ugliness.

Other forms of new syntax

A few other forms of alternative syntax have been proposed, e.g. the introduction of a where keyword [roberge], and Cobra-inspired requires clauses. But these all share a problem with the double colon: they won't work for earlier versions of Python 3. The same would apply to a new __future__ import.

Other backwards compatible conventions

The ideas put forward include:

  • A decorator, e.g. @typehints(name=str, returns=str). This could work, but it's pretty verbose (an extra line, and the argument names must be repeated), and a far cry in elegance from the PEP 3107 notation.
  • Stub files. We do want stub files, but they are primarily useful for adding type hints to existing code that doesn't lend itself to adding type hints, e.g. 3rd party packages, code that needs to support both Python 2 and Python 3, and especially extension modules. For most situations, having the annotations in line with the function definitions makes them much more useful.
  • Docstrings. There is an existing convention for docstrings, based on the Sphinx notation (:type arg1: description). This is pretty verbose (an extra line per parameter), and not very elegant. We could also make up something new, but the annotation syntax is hard to beat (because it was designed for this very purpose).

It's also been proposed to simply wait another release. But what problem would that solve? It would just be procrastination.

PEP Development Process

A live draft for this PEP lives on GitHub [github]. There is also an issue tracker [issues], where much of the technical discussion takes place.

The draft on GitHub is updated regularly in small increments. The official PEPS repo [peps] is (usually) only updated when a new draft is posted to python-dev.

Acknowledgements

This document could not be completed without valuable input, encouragement and advice from Jim Baker, Jeremy Siek, Michael Matson Vitousek, Andrey Vlasovskikh, Radomir Dopieralski, Peter Ludemann, and the BDFL-Delegate, Mark Shannon.

Influences include existing languages, libraries and frameworks mentioned in PEP 482. Many thanks to their creators, in alphabetical order: Stefan Behnel, William Edwards, Greg Ewing, Larry Hastings, Anders Hejlsberg, Alok Menghrajani, Travis E. Oliphant, Joe Pamer, Raoul-Gabriel Urma, and Julien Verlaguet.

pep-0485 A Function for testing approximate equality

PEP:485
Title:A Function for testing approximate equality
Version:$Revision$
Last-Modified:$Date$
Author:Christopher Barker <Chris.Barker at noaa.gov>
Status:Final
Type:Standards Track
Content-Type:text/x-rst
Created:20-Jan-2015
Python-Version:3.5
Post-History:
Resolution:https://mail.python.org/pipermail/python-dev/2015-February/138598.html

Abstract

This PEP proposes the addition of an is_close() function to the standard library math module that determines whether one value is approximately equal or "close" to another value.

Rationale

Floating point values contain limited precision, which results in their being unable to exactly represent some values, and for errors to accumulate with repeated computation. As a result, it is common advice to only use an equality comparison in very specific situations. Often a inequality comparison fits the bill, but there are times (often in testing) where the programmer wants to determine whether a computed value is "close" to an expected value, without requiring them to be exactly equal. This is common enough, particularly in testing, and not always obvious how to do it, that it would be useful addition to the standard library.

Existing Implementations

The standard library includes the unittest.TestCase.assertAlmostEqual method, but it:

  • Is buried in the unittest.TestCase class
  • Is an assertion, so you can't use it as a general test at the command line, etc. (easily)
  • Is an absolute difference test. Often the measure of difference requires, particularly for floating point numbers, a relative error, i.e. "Are these two values within x% of each-other?", rather than an absolute error. Particularly when the magnitude of the values is unknown a priori.

The numpy package has the allclose() and isclose() functions, but they are only available with numpy.

The statistics package tests include an implementation, used for its unit tests.

One can also find discussion and sample implementations on Stack Overflow and other help sites.

Many other non-python systems provide such a test, including the Boost C++ library and the APL language [4].

These existing implementations indicate that this is a common need and not trivial to write oneself, making it a candidate for the standard library.

Proposed Implementation

NOTE: this PEP is the result of extended discussions on the python-ideas list [1].

The new function will go into the math module, and have the following signature:

isclose(a, b, rel_tol=1e-9, abs_tol=0.0)

a and b: are the two values to be tested to relative closeness

rel_tol: is the relative tolerance -- it is the amount of error allowed, relative to the larger absolute value of a or b. For example, to set a tolerance of 5%, pass tol=0.05. The default tolerance is 1e-9, which assures that the two values are the same within about 9 decimal digits. rel_tol must be greater than 0.0

abs_tol: is an minimum absolute tolerance level -- useful for comparisons near zero.

Modulo error checking, etc, the function will return the result of:

abs(a-b) <= max( rel_tol * max(abs(a), abs(b)), abs_tol )

The name, isclose, is selected for consistency with the existing isnan and isinf.

Handling of non-finite numbers

The IEEE 754 special values of NaN, inf, and -inf will be handled according to IEEE rules. Specifically, NaN is not considered close to any other value, including NaN. inf and -inf are only considered close to themselves.

Non-float types

The primary use-case is expected to be floating point numbers. However, users may want to compare other numeric types similarly. In theory, it should work for any type that supports abs(), multiplication, comparisons, and subtraction. However, the implimentation in the math module is written in C, and thus can not (easily) use python's duck typing. Rather, the values passed into the funciton will be converted to the float type before the calculation is performed. Passing in types (or values) that cannot be converted to floats will raise an appropirate Exception (TypeError, ValueError, or OverflowError).

The code will be tested to accommodate at least some values of these types:

  • Decimal
  • int
  • Fraction
  • complex: For complex, a companion function will be added to the cmath module. In cmath.isclose(), the tolerances are specified as floats, and the absolute value of the complex values will be used for scaling and comparison. If a complex tolerance is passed in, the absolute value will be used as the tolerance.

NOTE: it may make sense to add a Decimal.isclose() that works properly and completely with the decimal type, but that is not included as part of this PEP.

Behavior near zero

Relative comparison is problematic if either value is zero. By definition, no value is small relative to zero. And computationally, if either value is zero, the difference is the absolute value of the other value, and the computed absolute tolerance will be rel_tol times that value. When rel_tol is less than one, the difference will never be less than the tolerance.

However, while mathematically correct, there are many use cases where a user will need to know if a computed value is "close" to zero. This calls for an absolute tolerance test. If the user needs to call this function inside a loop or comprehension, where some, but not all, of the expected values may be zero, it is important that both a relative tolerance and absolute tolerance can be tested for with a single function with a single set of parameters.

There is a similar issue if the two values to be compared straddle zero: if a is approximately equal to -b, then a and b will never be computed as "close".

To handle this case, an optional parameter, abs_tol can be used to set a minimum tolerance used in the case of very small or zero computed relative tolerance. That is, the values will be always be considered close if the difference between them is less than abs_tol

The default absolute tolerance value is set to zero because there is no value that is appropriate for the general case. It is impossible to know an appropriate value without knowing the likely values expected for a given use case. If all the values tested are on order of one, then a value of about 1e-9 might be appropriate, but that would be far too large if expected values are on order of 1e-9 or smaller.

Any non-zero default might result in user's tests passing totally inappropriately. If, on the other hand, a test against zero fails the first time with defaults, a user will be prompted to select an appropriate value for the problem at hand in order to get the test to pass.

NOTE: that the author of this PEP has resolved to go back over many of his tests that use the numpy allclose() function, which provides a default absolute tolerance, and make sure that the default value is appropriate.

If the user sets the rel_tol parameter to 0.0, then only the absolute tolerance will effect the result. While not the goal of the function, it does allow it to be used as a purely absolute tolerance check as well.

Implementation

A sample implementation in python is available (as of Jan 22, 2015) on gitHub:

https://github.com/PythonCHB/close_pep/blob/master/is_close.py

This implementation has a flag that lets the user select which relative tolerance test to apply -- this PEP does not suggest that that be retained, but rather that the weak test be selected.

There are also drafts of this PEP and test code, etc. there:

https://github.com/PythonCHB/close_pep

Relative Difference

There are essentially two ways to think about how close two numbers are to each-other:

Absolute difference: simply abs(a-b)

Relative difference: abs(a-b)/scale_factor [2].

The absolute difference is trivial enough that this proposal focuses on the relative difference.

Usually, the scale factor is some function of the values under consideration, for instance:

  1. The absolute value of one of the input values
  2. The maximum absolute value of the two
  3. The minimum absolute value of the two.
  4. The absolute value of the arithmetic mean of the two

These leads to the following possibilities for determining if two values, a and b, are close to each other.

  1. abs(a-b) <= tol*abs(a)
  2. abs(a-b) <= tol * max( abs(a), abs(b) )
  3. abs(a-b) <= tol * min( abs(a), abs(b) )
  4. abs(a-b) <= tol * (a + b)/2

NOTE: (2) and (3) can also be written as:

  1. (abs(a-b) <= abs(tol*a)) or (abs(a-b) <= abs(tol*b))
  2. (abs(a-b) <= abs(tol*a)) and (abs(a-b) <= abs(tol*b))

(Boost refers to these as the "weak" and "strong" formulations [3]) These can be a tiny bit more computationally efficient, and thus are used in the example code.

Each of these formulations can lead to slightly different results. However, if the tolerance value is small, the differences are quite small. In fact, often less than available floating point precision.

How much difference does it make?

When selecting a method to determine closeness, one might want to know how much of a difference it could make to use one test or the other -- i.e. how many values are there (or what range of values) that will pass one test, but not the other.

The largest difference is between options (2) and (3) where the allowable absolute difference is scaled by either the larger or smaller of the values.

Define delta to be the difference between the allowable absolute tolerance defined by the larger value and that defined by the smaller value. That is, the amount that the two input values need to be different in order to get a different result from the two tests. tol is the relative tolerance value.

Assume that a is the larger value and that both a and b are positive, to make the analysis a bit easier. delta is therefore:

delta = tol * (a-b)

or:

delta / tol = (a-b)

The largest absolute difference that would pass the test: (a-b), equals the tolerance times the larger value:

(a-b) = tol * a

Substituting into the expression for delta:

delta / tol = tol * a

so:

delta = tol**2 * a

For example, for a = 10, b = 9, tol = 0.1 (10%):

maximum tolerance tol * a == 0.1 * 10 == 1.0

minimum tolerance tol * b == 0.1 * 9.0 == 0.9

delta = (1.0 - 0.9) = 0.1 or tol**2 * a = 0.1**2 * 10 = .1

The absolute difference between the maximum and minimum tolerance tests in this case could be substantial. However, the primary use case for the proposed function is testing the results of computations. In that case a relative tolerance is likely to be selected of much smaller magnitude.

For example, a relative tolerance of 1e-8 is about half the precision available in a python float. In that case, the difference between the two tests is 1e-8**2 * a or 1e-16 * a, which is close to the limit of precision of a python float. If the relative tolerance is set to the proposed default of 1e-9 (or smaller), the difference between the two tests will be lost to the limits of precision of floating point. That is, each of the four methods will yield exactly the same results for all values of a and b.

In addition, in common use, tolerances are defined to 1 significant figure -- that is, 1e-9 is specifying about 9 decimal digits of accuracy. So the difference between the various possible tests is well below the precision to which the tolerance is specified.

Symmetry

A relative comparison can be either symmetric or non-symmetric. For a symmetric algorithm:

isclose(a,b) is always the same as isclose(b,a)

If a relative closeness test uses only one of the values (such as (1) above), then the result is asymmetric, i.e. isclose(a,b) is not necessarily the same as isclose(b,a).

Which approach is most appropriate depends on what question is being asked. If the question is: "are these two numbers close to each other?", there is no obvious ordering, and a symmetric test is most appropriate.

However, if the question is: "Is the computed value within x% of this known value?", then it is appropriate to scale the tolerance to the known value, and an asymmetric test is most appropriate.

From the previous section, it is clear that either approach would yield the same or similar results in the common use cases. In that case, the goal of this proposal is to provide a function that is least likely to produce surprising results.

The symmetric approach provide an appealing consistency -- it mirrors the symmetry of equality, and is less likely to confuse people. A symmetric test also relieves the user of the need to think about the order in which to set the arguments. It was also pointed out that there may be some cases where the order of evaluation may not be well defined, for instance in the case of comparing a set of values all against each other.

There may be cases when a user does need to know that a value is within a particular range of a known value. In that case, it is easy enough to simply write the test directly:

if a-b <= tol*a:

(assuming a > b in this case). There is little need to provide a function for this particular case.

This proposal uses a symmetric test.

Which symmetric test?

There are three symmetric tests considered:

The case that uses the arithmetic mean of the two values requires that the value be either added together before dividing by 2, which could result in extra overflow to inf for very large numbers, or require each value to be divided by two before being added together, which could result in underflow to zero for very small numbers. This effect would only occur at the very limit of float values, but it was decided there was no benefit to the method worth reducing the range of functionality or adding the complexity of checking values to determine the order of computation.

This leaves the boost "weak" test (2)-- or using the smaller value to scale the tolerance, or the Boost "strong" (3) test, which uses the smaller of the values to scale the tolerance. For small tolerance, they yield the same result, but this proposal uses the boost "weak" test case: it is symmetric and provides a more useful result for very large tolerances.

Large Tolerances

The most common use case is expected to be small tolerances -- on order of the default 1e-9. However there may be use cases where a user wants to know if two fairly disparate values are within a particular range of each other: "is a within 200% (rel_tol = 2.0) of b? In this case, the strong test would never indicate that two values are within that range of each other if one of them is zero. The weak case, however would use the larger (non-zero) value for the test, and thus return true if one value is zero. For example: is 0 within 200% of 10? 200% of ten is 20, so the range within 200% of ten is -10 to +30. Zero falls within that range, so it will return True.

Defaults

Default values are required for the relative and absolute tolerance.

Relative Tolerance Default

The relative tolerance required for two values to be considered "close" is entirely use-case dependent. Nevertheless, the relative tolerance needs to be greater than 1e-16 (approximate precision of a python float). The value of 1e-9 was selected because it is the largest relative tolerance for which the various possible methods will yield the same result, and it is also about half of the precision available to a python float. In the general case, a good numerical algorithm is not expected to lose more than about half of available digits of accuracy, and if a much larger tolerance is acceptable, the user should be considering the proper value in that case. Thus 1-e9 is expected to "just work" for many cases.

Absolute tolerance default

The absolute tolerance value will be used primarily for comparing to zero. The absolute tolerance required to determine if a value is "close" to zero is entirely use-case dependent. There is also essentially no bounds to the useful range -- expected values would conceivably be anywhere within the limits of a python float. Thus a default of 0.0 is selected.

If, for a given use case, a user needs to compare to zero, the test will be guaranteed to fail the first time, and the user can select an appropriate value.

It was suggested that comparing to zero is, in fact, a common use case (evidence suggest that the numpy functions are often used with zero). In this case, it would be desirable to have a "useful" default. Values around 1-e8 were suggested, being about half of floating point precision for values of around value 1.

However, to quote The Zen: "In the face of ambiguity, refuse the temptation to guess." Guessing that users will most often be concerned with values close to 1.0 would lead to spurious passing tests when used with smaller values -- this is potentially more damaging than requiring the user to thoughtfully select an appropriate value.

Expected Uses

The primary expected use case is various forms of testing -- "are the results computed near what I expect as a result?" This sort of test may or may not be part of a formal unit testing suite. Such testing could be used one-off at the command line, in an iPython notebook, part of doctests, or simple asserts in an if __name__ == "__main__" block.

It would also be an appropriate function to use for the termination criteria for a simple iterative solution to an implicit function:

guess = something
while True:
    new_guess = implicit_function(guess, *args)
    if isclose(new_guess, guess):
        break
    guess = new_guess

Inappropriate uses

One use case for floating point comparison is testing the accuracy of a numerical algorithm. However, in this case, the numerical analyst ideally would be doing careful error propagation analysis, and should understand exactly what to test for. It is also likely that ULP (Unit in the Last Place) comparison may be called for. While this function may prove useful in such situations, It is not intended to be used in that way without careful consideration.

Other Approaches

unittest.TestCase.assertAlmostEqual

(https://docs.python.org/3/library/unittest.html#unittest.TestCase.assertAlmostEqual)

Tests that values are approximately (or not approximately) equal by computing the difference, rounding to the given number of decimal places (default 7), and comparing to zero.

This method is purely an absolute tolerance test, and does not address the need for a relative tolerance test.

numpy isclose()

http://docs.scipy.org/doc/numpy-dev/reference/generated/numpy.isclose.html

The numpy package provides the vectorized functions isclose() and allclose(), for similar use cases as this proposal:

isclose(a, b, rtol=1e-05, atol=1e-08, equal_nan=False)

Returns a boolean array where two arrays are element-wise equal within a tolerance.

The tolerance values are positive, typically very small numbers. The relative difference (rtol * abs(b)) and the absolute difference atol are added together to compare against the absolute difference between a and b

In this approach, the absolute and relative tolerance are added together, rather than the or method used in this proposal. This is computationally more simple, and if relative tolerance is larger than the absolute tolerance, then the addition will have no effect. However, if the absolute and relative tolerances are of similar magnitude, then the allowed difference will be about twice as large as expected.

This makes the function harder to understand, with no computational advantage in this context.

Even more critically, if the values passed in are small compared to the absolute tolerance, then the relative tolerance will be completely swamped, perhaps unexpectedly.

This is why, in this proposal, the absolute tolerance defaults to zero -- the user will be required to choose a value appropriate for the values at hand.

Boost floating-point comparison

The Boost project ( [3] ) provides a floating point comparison function. It is a symmetric approach, with both "weak" (larger of the two relative errors) and "strong" (smaller of the two relative errors) options. This proposal uses the Boost "weak" approach. There is no need to complicate the API by providing the option to select different methods when the results will be similar in most cases, and the user is unlikely to know which to select in any case.

Alternate Proposals

A Recipe

The primary alternate proposal was to not provide a standard library function at all, but rather, provide a recipe for users to refer to. This would have the advantage that the recipe could provide and explain the various options, and let the user select that which is most appropriate. However, that would require anyone needing such a test to, at the very least, copy the function into their code base, and select the comparison method to use.

zero_tol

One possibility was to provide a zero tolerance parameter, rather than the absolute tolerance parameter. This would be an absolute tolerance that would only be applied in the case of one of the arguments being exactly zero. This would have the advantage of retaining the full relative tolerance behavior for all non-zero values, while allowing tests against zero to work. However, it would also result in the potentially surprising result that a small value could be "close" to zero, but not "close" to an even smaller value. e.g., 1e-10 is "close" to zero, but not "close" to 1e-11.

No absolute tolerance

Given the issues with comparing to zero, another possibility would have been to only provide a relative tolerance, and let comparison to zero fail. In this case, the user would need to do a simple absolute test: abs(val) < zero_tol in the case where the comparison involved zero.

However, this would not allow the same call to be used for a sequence of values, such as in a loop or comprehension. Making the function far less useful. It is noted that the default abs_tol=0.0 achieves the same effect if the default is not overridden.

Other tests

The other tests considered are all discussed in the Relative Error section above.

pep-0486 Make the Python Launcher aware of virtual environments

PEP:486
Title:Make the Python Launcher aware of virtual environments
Version:$Revision$
Last-Modified:$Date$
Author:Paul Moore <p.f.moore at gmail.com>
Status:Final
Type:Standards Track
Content-Type:text/x-rst
Created:12-Feb-2015
Python-Version:3.5
Post-History:12-Feb-2015
Resolution:https://mail.python.org/pipermail/python-dev/2015-February/138579.html

Abstract

The Windows installers for Python include a launcher that locates the correct Python interpreter to run (see PEP 397). However, the launcher is not aware of virtual environments (virtualenv [1] or PEP 405 based), and so cannot be used to run commands from the active virtualenv.

This PEP proposes making the launcher "virtualenv aware". This means that when run without specifying an explicit Python interpreter to use, the launcher will use the currently active virtualenv, if any, before falling back to the configured default Python.

Rationale

Windows users with multiple copies of Python installed need a means of selecting which one to use. The Python launcher provides this facility by means of a py command that can be used to run either a configured "default" Python or a specific interpreter, by means of command line arguments. So typical usage would be:

# Run the Python interactive interpreter
py

# Execute an installed module
py -m pip install pytest
py -m pytest

When using virtual environments, the py launcher is unaware that a virtualenv is active, and will continue to use the system Python. So different command invocations are needed to run the same commands in a virtualenv:

# Run the Python interactive interpreter
python

# Execute an installed module (these could use python -m,
# which is longer to type but is a little more similar to the
# launcher approach)
pip install pytest
py.test

Having to use different commands is is error-prone, and in many cases the error is difficult to spot immediately. The PEP proposes making the py command usable with virtual environments, so that the first form of command can be used in all cases.

Implementation

Both virtualenv and the core venv module set an environment variable VIRTUAL_ENV when activating a virtualenv. This PEP proposes that the launcher checks for the VIRTUAL_ENV environment variable whenever it would run the "default" Python interpreter for the system (i.e., when no specific version flags such as py -2.7 are used) and if present, run the Python interpreter for the virtualenv rather than the default system Python.

The "default" Python interpreter referred to above is (as per PEP 397) either the latest version of Python installed on the system, or a version configured via the py.ini configuration file. When the user specifies an explicit Python version on the command line, this will always be used (as at present).

Impact on Script Launching

As well as interactive use, the launcher is used as the Windows file association for Python scripts. In that case, a "shebang" (#!) line at the start of the script is used to identify the interpreter to run. A fully-qualified path can be used, or a version-specific Python (python3 or python2, or even python3.5), or the generic python, which means to use the default interpreter.

The launcher also looks for the specific shebang line #!/usr/bin/env python. On Unix, the env program searches for a command on $PATH and runs the command so located. Similarly, with this shebang line, the launcher will look for a copy of python.exe on the user's current %PATH% and will run that copy.

As activating a virtualenv means that it is added to PATH, no special handling is needed to run scripts with the active virtualenv - they just need to use the #!/usr/bin/env python shebang line, exactly as on Unix. (If there is no activated virtualenv, and no python.exe on PATH, the launcher will look for a default Python exactly as if the shebang line had said #!python).

Exclusions

The PEP makes no attempt to promote the use of the launcher for running Python on Windows. Most existing documentation assumes the user of python as the command to run Python, and (for example) pip to run an installed Python command. This documentation is not expected to change, and users who choose to manage their PATH environment variable can continue to use this form. The focus of this PEP is purely on allowing users who prefer to use the launcher when dealing with their system Python installations, to be able to continue to do so when using virtual environments.

Reference Implementation

A patch implementing the proposed behaviour is available at http://bugs.python.org/issue23465

pep-0487 Simpler customisation of class creation

PEP:487
Title:Simpler customisation of class creation
Version:$Revision$
Last-Modified:$Date$
Author:Martin Teichmann <lkb.teichmann at gmail.com>,
Status:Draft
Type:Standards Track
Content-Type:text/x-rst
Created:27-Feb-2015
Python-Version:3.5
Post-History:27-Feb-2015
Replaces:422

Abstract

Currently, customising class creation requires the use of a custom metaclass. This custom metaclass then persists for the entire lifecycle of the class, creating the potential for spurious metaclass conflicts.

This PEP proposes to instead support a wide range of customisation scenarios through a new namespace parameter in the class header, and a new __init_subclass__ hook in the class body.

The new mechanism should be easier to understand and use than implementing a custom metaclass, and thus should provide a gentler introduction to the full power Python's metaclass machinery.

Connection to other PEP

This is a competing proposal to PEP 422 by Nick Coughlan and Daniel Urban. It shares both most of the PEP text and proposed code, but has major differences in how to achieve its goals.

Background

For an already created class cls, the term "metaclass" has a clear meaning: it is the value of type(cls).

During class creation, it has another meaning: it is also used to refer to the metaclass hint that may be provided as part of the class definition. While in many cases these two meanings end up referring to one and the same object, there are two situations where that is not the case:

  • If the metaclass hint refers to an instance of type, then it is considered as a candidate metaclass along with the metaclasses of all of the parents of the class being defined. If a more appropriate metaclass is found amongst the candidates, then it will be used instead of the one given in the metaclass hint.
  • Otherwise, an explicit metaclass hint is assumed to be a factory function and is called directly to create the class object. In this case, the final metaclass will be determined by the factory function definition. In the typical case (where the factory functions just calls type, or, in Python 3.3 or later, types.new_class) the actual metaclass is then determined based on the parent classes.

It is notable that only the actual metaclass is inherited - a factory function used as a metaclass hook sees only the class currently being defined, and is not invoked for any subclasses.

In Python 3, the metaclass hint is provided using the metaclass=Meta keyword syntax in the class header. This allows the __prepare__ method on the metaclass to be used to create the locals() namespace used during execution of the class body (for example, specifying the use of collections.OrderedDict instead of a regular dict).

In Python 2, there was no __prepare__ method (that API was added for Python 3 by PEP 3115). Instead, a class body could set the __metaclass__ attribute, and the class creation process would extract that value from the class namespace to use as the metaclass hint. There is published code [1] that makes use of this feature.

Another new feature in Python 3 is the zero-argument form of the super() builtin, introduced by PEP 3135. This feature uses an implicit __class__ reference to the class being defined to replace the "by name" references required in Python 2. Just as code invoked during execution of a Python 2 metaclass could not call methods that referenced the class by name (as the name had not yet been bound in the containing scope), similarly, Python 3 metaclasses cannot call methods that rely on the implicit __class__ reference (as it is not populated until after the metaclass has returned control to the class creation machinery).

Finally, when a class uses a custom metaclass, it can pose additional challenges to the use of multiple inheritance, as a new class cannot inherit from parent classes with unrelated metaclasses. This means that it is impossible to add a metaclass to an already published class: such an addition is a backwards incompatible change due to the risk of metaclass conflicts.

Proposal

This PEP proposes that a new mechanism to customise class creation be added to Python 3.5 that meets the following criteria:

  1. Integrates nicely with class inheritance structures (including mixins and multiple inheritance),
  2. Integrates nicely with the implicit __class__ reference and zero-argument super() syntax introduced by PEP 3135,
  3. Can be added to an existing base class without a significant risk of introducing backwards compatibility problems, and
  4. Restores the ability for class namespaces to have some influence on the class creation process (above and beyond populating the namespace itself), but potentially without the full flexibility of the Python 2 style __metaclass__ hook.

Those goals can be achieved by adding two functionalities:

  1. A __init_subclass__ hook that initializes all subclasses of a given class, and
  2. A new keyword parameter namespace to the class creation statement, that gives an initialization of the namespace.

As an example, the first proposal looks as follows:

 class SpamBase:
     # this is implicitly a @classmethod
     def __init_subclass__(cls, ns, **kwargs):
         # This is invoked after a subclass is created, but before
         # explicit decorators are called.
         # The usual super() mechanisms are used to correctly support
         # multiple inheritance.
# ns is the classes namespace
         # **kwargs are the keyword arguments to the subclasses'
         # class creation statement
         super().__init_subclass__(cls, ns, **kwargs)

 class Spam(SpamBase):
     pass
 # the new hook is called on Spam

To simplify the cooperative multiple inheritance case, object will gain a default implementation of the hook that does nothing:

class object:
    def __init_subclass__(cls, ns):
        pass

Note that this method has no keyword arguments, meaning that all methods which are more specialized have to process all keyword arguments.

This general proposal is not a new idea (it was first suggested for inclusion in the language definition more than 10 years ago [2], and a similar mechanism has long been supported by Zope's ExtensionClass [3]), but the situation has changed sufficiently in recent years that the idea is worth reconsidering for inclusion.

The second part of the proposal is to have a namespace keyword argument to the class declaration statement. If present, its value will be called without arguments to initialize a subclasses namespace, very much like a metaclass __prepare__ method would do.

In addition, the introduction of the metaclass __prepare__ method in PEP 3115 allows a further enhancement that was not possible in Python 2: this PEP also proposes that type.__prepare__ be updated to accept a factory function as a namespace keyword-only argument. If present, the value provided as the namespace argument will be called without arguments to create the result of type.__prepare__ instead of using a freshly created dictionary instance. For example, the following will use an ordered dictionary as the class namespace:

class OrderedBase(namespace=collections.OrderedDict):
     pass

class Ordered(OrderedBase):
     # cls.__dict__ is still a read-only proxy to the class namespace,
     # but the underlying storage is an OrderedDict instance

Note

This PEP, along with the existing ability to use __prepare__ to share a single namespace amongst multiple class objects, highlights a possible issue with the attribute lookup caching: when the underlying mapping is updated by other means, the attribute lookup cache is not invalidated correctly (this is a key part of the reason class __dict__ attributes produce a read-only view of the underlying storage).

Since the optimisation provided by that cache is highly desirable, the use of a preexisting namespace as the class namespace may need to be declared as officially unsupported (since the observed behaviour is rather strange when the caches get out of sync).

Key Benefits

Easier use of custom namespaces for a class

Currently, to use a different type (such as collections.OrderedDict) for a class namespace, or to use a pre-populated namespace, it is necessary to write and use a custom metaclass. With this PEP, using a custom namespace becomes as simple as specifying an appropriate factory function in the class header.

Easier inheritance of definition time behaviour

Understanding Python's metaclasses requires a deep understanding of the type system and the class construction process. This is legitimately seen as challenging, due to the need to keep multiple moving parts (the code, the metaclass hint, the actual metaclass, the class object, instances of the class object) clearly distinct in your mind. Even when you know the rules, it's still easy to make a mistake if you're not being extremely careful.

Understanding the proposed implicit class initialization hook only requires ordinary method inheritance, which isn't quite as daunting a task. The new hook provides a more gradual path towards understanding all of the phases involved in the class definition process.

Reduced chance of metaclass conflicts

One of the big issues that makes library authors reluctant to use metaclasses (even when they would be appropriate) is the risk of metaclass conflicts. These occur whenever two unrelated metaclasses are used by the desired parents of a class definition. This risk also makes it very difficult to add a metaclass to a class that has previously been published without one.

By contrast, adding an __init_subclass__ method to an existing type poses a similar level of risk to adding an __init__ method: technically, there is a risk of breaking poorly implemented subclasses, but when that occurs, it is recognised as a bug in the subclass rather than the library author breaching backwards compatibility guarantees.

Integrates cleanly with PEP 3135

Given that the method is called on already existing classes, the new hook will be able to freely invoke class methods that rely on the implicit __class__ reference introduced by PEP 3135, including methods that use the zero argument form of super().

Replaces many use cases for dynamic setting of __metaclass__

For use cases that don't involve completely replacing the defined class, Python 2 code that dynamically set __metaclass__ can now dynamically set __init_subclass__ instead. For more advanced use cases, introduction of an explicit metaclass (possibly made available as a required base class) will still be necessary in order to support Python 3.

A path of introduction into Python

Most of the benefits of this PEP can already be implemented using a simple metaclass. For the __init_subclass__ hook this works all the way down to python 2.7, while the namespace needs python 3.0 to work. Such a class has been uploaded to PyPI [4].

The only drawback of such a metaclass are the mentioned problems with metaclasses and multiple inheritance. Two classes using such a metaclass can only be combined, if they use exactly the same such metaclass. This fact calls for the inclusion of such a class into the standard library, let's call it SubclassMeta, with a base class using it called SublassInit. Once all users use this standard library metaclass, classes from different packages can easily be combined.

But still such classes cannot be easily combined with other classes using other metaclasses. Authors of metaclasses should bear that in mind and inherit from the standard metaclass if it seems useful for users of the metaclass to add more functionality. Ultimately, if the need for combining with other metaclasses is strong enough, the proposed functionality may be introduced into python's type.

Those arguments strongly hint to the following procedure to include the proposed functionality into python:

  1. The metaclass implementing this proposal is put onto PyPI, so that it can be used and scrutinized.
  2. Once the code is properly mature, it can be added to the python standard library. There should be a new module called metaclass which collects tools for metaclass authors, as well as a documentation of the best practices of how to write metaclasses.
  3. If the need of combining this metaclass with other metaclasses is strong enough, it may be included into python itself.

New Ways of Using Classes

This proposal has many usecases like the following. In the examples, we still inherit from the SubclassInit base class. This would become unnecessary once this PEP is included in Python directly.

Subclass registration

Especially when writing a plugin system, one likes to register new subclasses of a plugin baseclass. This can be done as follows:

class PluginBase(SubclassInit):
    subclasses = []

    def __init_subclass__(cls, ns, **kwargs):
        super().__init_subclass__(ns, **kwargs)
        cls.subclasses.append(cls)

One should note that this also works nicely as a mixin class.

Trait descriptors

There are many designs of python descriptors in the wild which, for example, check boundaries of values. Often those "traits" need some support of a metaclass to work. This is how this would look like with this PEP:

 class Trait:
     def __get__(self, instance, owner):
         return instance.__dict__[self.key]

     def __set__(self, instance, value):
         instance.__dict__[self.key] = value

 class Int(Trait):
     def __set__(self, instance, value):
         # some boundary check code here
         super().__set__(instance, value)

 class HasTraits(SubclassInit):
     def __init_subclass__(cls, ns, **kwargs):
super().__init_subclass__(ns, **kwargs)
         for k, v in ns.items():
             if isinstance(v, Trait):
                 v.key = k

The new namespace keyword in the class header enables a number of interesting options for controlling the way a class is initialised, including some aspects of the object models of both Javascript and Ruby.

Order preserving classes

class OrderedClassBase(namespace=collections.OrderedDict):
    pass

class OrderedClass(OrderedClassBase):
    a = 1
    b = 2
    c = 3

Prepopulated namespaces

seed_data = dict(a=1, b=2, c=3)
class PrepopulatedClass(namespace=seed_data.copy):
    pass

Cloning a prototype class

class NewClass(namespace=Prototype.__dict__.copy):
    pass

Rejected Design Options

Calling the hook on the class itself

Adding an __autodecorate__ hook that would be called on the class itself was the proposed idea of PEP 422. Most examples work the same way or even better if the hook is called on the subclass. In general, it is much easier to explicitly call the hook on the class in which it is defined (to opt-in to such a behavior) than to opt-out, meaning that one does not want the hook to be called on the class it is defined in.

This becomes most evident if the class in question is designed as a mixin: it is very unlikely that the code of the mixin is to be executed for the mixin class itself, as it is not supposed to be a complete class on its own.

The original proposal also made major changes in the class initialization process, rendering it impossible to back-port the proposal to older python versions.

Other variants of calling the hook

Other names for the hook were presented, namely __decorate__ or __autodecorate__. This proposal opts for __init_subclass__ as it is very close to the __init__ method, just for the subclass, while it is not very close to decorators, as it does not return the class.

Requiring an explicit decorator on __init_subclass__

One could require the explicit use of @classmethod on the __init_subclass__ decorator. It was made implicit since there's no sensible interpretation for leaving it out, and that case would need to be detected anyway in order to give a useful error message.

This decision was reinforced after noticing that the user experience of defining __prepare__ and forgetting the @classmethod method decorator is singularly incomprehensible (particularly since PEP 3115 documents it as an ordinary method, and the current documentation doesn't explicitly say anything one way or the other).

Passing in the namespace directly rather than a factory function

At one point, PEP 422 proposed that the class namespace be passed directly as a keyword argument, rather than passing a factory function. However, this encourages an unsupported behaviour (that is, passing the same namespace to multiple classes, or retaining direct write access to a mapping used as a class namespace), so the API was switched to the factory function version.

Possible Extensions

Some extensions to this PEP are imaginable, which are postponed to a later pep:

  • A __new_subclass__ method could be defined which acts like a __new__ for classes. This would be very close to __autodecorate__ in PEP 422.
  • __subclasshook__ could be made a classmethod in a class instead of a method in the metaclass.

pep-0488 Elimination of PYO files

PEP:488
Title:Elimination of PYO files
Version:$Revision$
Last-Modified:$Date$
Author:Brett Cannon <brett at python.org>
Status:Final
Type:Standards Track
Content-Type:text/x-rst
Created:20-Feb-2015
Python-Version:3.5
Post-History:2015-03-06 2015-03-13 2015-03-20

Abstract

This PEP proposes eliminating the concept of PYO files from Python. To continue the support of the separation of bytecode files based on their optimization level, this PEP proposes extending the PYC file name to include the optimization level in the bytecode repository directory when there are optimizations applied.

Rationale

As of today, bytecode files come in two flavours: PYC and PYO. A PYC file is the bytecode file generated and read from when no optimization level is specified at interpreter startup (i.e., -O is not specified). A PYO file represents the bytecode file that is read/written when any optimization level is specified (i.e., when -O or -OO is specified). This means that while PYC files clearly delineate the optimization level used when they were generated -- namely no optimizations beyond the peepholer -- the same is not true for PYO files. To put this in terms of optimization levels and the file extension:

  • 0: .pyc
  • 1 (-O): .pyo
  • 2 (-OO): .pyo

The reuse of the .pyo file extension for both level 1 and 2 optimizations means that there is no clear way to tell what optimization level was used to generate the bytecode file. In terms of reading PYO files, this can lead to an interpreter using a mixture of optimization levels with its code if the user was not careful to make sure all PYO files were generated using the same optimization level (typically done by blindly deleting all PYO files and then using the compileall module to compile all-new PYO files [1]). This issue is only compounded when people optimize Python code beyond what the interpreter natively supports, e.g., using the astoptimizer project [2].

In terms of writing PYO files, the need to delete all PYO files every time one either changes the optimization level they want to use or are unsure of what optimization was used the last time PYO files were generated leads to unnecessary file churn. The change proposed by this PEP also allows for all optimization levels to be pre-compiled for bytecode files ahead of time, something that is currently impossible thanks to the reuse of the .pyo file extension for multiple optimization levels.

As for distributing bytecode-only modules, having to distribute both .pyc and .pyo files is unnecessary for the common use-case of code obfuscation and smaller file deployments. This means that bytecode-only modules will only load from their non-optimized .pyc file name.

Proposal

To eliminate the ambiguity that PYO files present, this PEP proposes eliminating the concept of PYO files and their accompanying .pyo file extension. To allow for the optimization level to be unambiguous as well as to avoid having to regenerate optimized bytecode files needlessly in the __pycache__ directory, the optimization level used to generate the bytecode file will be incorporated into the bytecode file name. When no optimization level is specified, the pre-PEP .pyc file name will be used (i.e., no optimization level will be specified in the file name). For example, a source file named foo.py in CPython 3.5 could have the following bytecode files based on the interpreter's optimization level (none, -O, and -OO):

  • 0: foo.cpython-35.pyc (i.e., no change)
  • 1: foo.cpython-35.opt-1.pyc
  • 2: foo.cpython-35.opt-2.pyc

Currently bytecode file names are created by importlib.util.cache_from_source(), approximately using the following expression defined by PEP 3147 [3], [4], [5]:

'{name}.{cache_tag}.pyc'.format(name=module_name,
                                cache_tag=sys.implementation.cache_tag)

This PEP proposes to change the expression when an optimization level is specified to:

'{name}.{cache_tag}.opt-{optimization}.pyc'.format(
        name=module_name,
        cache_tag=sys.implementation.cache_tag,
        optimization=str(sys.flags.optimize))

The "opt-" prefix was chosen so as to provide a visual separator from the cache tag. The placement of the optimization level after the cache tag was chosen to preserve lexicographic sort order of bytecode file names based on module name and cache tag which will not vary for a single interpreter. The "opt-" prefix was chosen over "o" so as to be somewhat self-documenting. The "opt-" prefix was chosen over "O" so as to not have any confusion in case "0" was the leading prefix of the optimization level.

A period was chosen over a hyphen as a separator so as to distinguish clearly that the optimization level is not part of the interpreter version as specified by the cache tag. It also lends to the use of the period in the file name to delineate semantically different concepts.

For example, if -OO had been passed to the interpreter then instead of importlib.cpython-35.pyo the file name would be importlib.cpython-35.opt-2.pyc.

Leaving out the new opt- tag when no optimization level is applied should increase backwards-compatibility. This is also more understanding of Python implementations which have no use for optimization levels (e.g., PyPy[10]_).

It should be noted that this change in no way affects the performance of import. Since the import system looks for a single bytecode file based on the optimization level of the interpreter already and generates a new bytecode file if it doesn't exist, the introduction of potentially more bytecode files in the __pycache__ directory has no effect in terms of stat calls. The interpreter will continue to look for only a single bytecode file based on the optimization level and thus no increase in stat calls will occur.

The only potentially negative result of this PEP is the probable increase in the number of .pyc files and thus increase in storage use. But for platforms where this is an issue, sys.dont_write_bytecode exists to turn off bytecode generation so that it can be controlled offline.

Implementation

An implementation of this PEP is available [11].

importlib

As importlib.util.cache_from_source() is the API that exposes bytecode file paths as well as being directly used by importlib, it requires the most critical change. As of Python 3.4, the function's signature is:

importlib.util.cache_from_source(path, debug_override=None)

This PEP proposes changing the signature in Python 3.5 to:

importlib.util.cache_from_source(path, debug_override=None, *, optimization=None)

The introduced optimization keyword-only parameter will control what optimization level is specified in the file name. If the argument is None then the current optimization level of the interpreter will be assumed (including no optimization). Any argument given for optimization will be passed to str() and must have str.isalnum() be true, else ValueError will be raised (this prevents invalid characters being used in the file name). If the empty string is passed in for optimization then the addition of the optimization will be suppressed, reverting to the file name format which predates this PEP.

It is expected that beyond Python's own two optimization levels, third-party code will use a hash of optimization names to specify the optimization level, e.g. hashlib.sha256(','.join(['no dead code', 'const folding'])).hexdigest(). While this might lead to long file names, it is assumed that most users never look at the contents of the __pycache__ directory and so this won't be an issue.

The debug_override parameter will be deprecated. A False value will be equivalent to optimization=1 while a True value will represent optimization='' (a None argument will continue to mean the same as for optimization). A deprecation warning will be raised when debug_override is given a value other than None, but there are no plans for the complete removal of the parameter at this time (but removal will be no later than Python 4).

The various module attributes for importlib.machinery which relate to bytecode file suffixes will be updated [7]. The DEBUG_BYTECODE_SUFFIXES and OPTIMIZED_BYTECODE_SUFFIXES will both be documented as deprecated and set to the same value as BYTECODE_SUFFIXES (removal of DEBUG_BYTECODE_SUFFIXES and OPTIMIZED_BYTECODE_SUFFIXES is not currently planned, but will be not later than Python 4).

All various finders and loaders will also be updated as necessary, but updating the previous mentioned parts of importlib should be all that is required.

Rest of the standard library

The various functions exposed by the py_compile and compileall functions will be updated as necessary to make sure they follow the new bytecode file name semantics [6], [1]. The CLI for the compileall module will not be directly affected (the -b flag will be implicit as it will no longer generate .pyo files when -O is specified).

Compatibility Considerations

Any code directly manipulating bytecode files from Python 3.2 on will need to consider the impact of this change on their code (prior to Python 3.2 -- including all of Python 2 -- there was no __pycache__ which already necessitates bifurcating bytecode file handling support). If code was setting the debug_override argument to importlib.util.cache_from_source() then care will be needed if they want the path to a bytecode file with an optimization level of 2. Otherwise only code not using importlib.util.cache_from_source() will need updating.

As for people who distribute bytecode-only modules (i.e., use a bytecode file instead of a source file), they will have to choose which optimization level they want their bytecode files to be since distributing a .pyo file with a .pyc file will no longer be of any use. Since people typically only distribute bytecode files for code obfuscation purposes or smaller distribution size then only having to distribute a single .pyc should actually be beneficial to these use-cases. And since the magic number for bytecode files changed in Python 3.5 to support PEP 465 there is no need to support pre-existing .pyo files [8].

Rejected Ideas

Completely dropping optimization levels from CPython

Some have suggested that instead of accommodating the various optimization levels in CPython, we should instead drop them entirely. The argument is that significant performance gains would occur from runtime optimizations through something like a JIT and not through pre-execution bytecode optimizations.

This idea is rejected for this PEP as that ignores the fact that there are people who do find the pre-existing optimization levels for CPython useful. It also assumes that no other Python interpreter would find what this PEP proposes useful.

Alternative formatting of the optimization level in the file name

Using the "opt-" prefix and placing the optimization level between the cache tag and file extension is not critical. All options which have been considered are:

  • importlib.cpython-35.opt-1.pyc
  • importlib.cpython-35.opt1.pyc
  • importlib.cpython-35.o1.pyc
  • importlib.cpython-35.O1.pyc
  • importlib.cpython-35.1.pyc
  • importlib.cpython-35-O1.pyc
  • importlib.O1.cpython-35.pyc
  • importlib.o1.cpython-35.pyc
  • importlib.1.cpython-35.pyc

These were initially rejected either because they would change the sort order of bytecode files, possible ambiguity with the cache tag, or were not self-documenting enough. An informal poll was taken and people clearly preferred the formatting proposed by the PEP [9]. Since this topic is non-technical and of personal choice, the issue is considered solved.

Embedding the optimization level in the bytecode metadata

Some have suggested that rather than embedding the optimization level of bytecode in the file name that it be included in the file's metadata instead. This would mean every interpreter had a single copy of bytecode at any time. Changing the optimization level would thus require rewriting the bytecode, but there would also only be a single file to care about.

This has been rejected due to the fact that Python is often installed as a root-level application and thus modifying the bytecode file for modules in the standard library are always possible. In this situation integrators would need to guess at what a reasonable optimization level was for users for any/all situations. By allowing multiple optimization levels to co-exist simultaneously it frees integrators from having to guess what users want and allows users to utilize the optimization level they want.

pep-0489 Multi-phase extension module initialization

PEP:489
Title:Multi-phase extension module initialization
Version:$Revision$
Last-Modified:$Date$
Author:Petr Viktorin <encukou at gmail.com>, Stefan Behnel <stefan_ml at behnel.de>, Nick Coghlan <ncoghlan at gmail.com>
BDFL-Delegate:Eric Snow <ericsnowcurrently@gmail.com>
Discussions-To:import-sig at python.org
Status:Final
Type:Standards Track
Content-Type:text/x-rst
Created:11-Aug-2013
Python-Version:3.5
Post-History:23-Aug-2013, 20-Feb-2015, 16-Apr-2015, 7-May-2015, 18-May-2015
Resolution:https://mail.python.org/pipermail/python-dev/2015-May/140108.html

Abstract

This PEP proposes a redesign of the way in which built-in and extension modules interact with the import machinery. This was last revised for Python 3.0 in PEP 3121, but did not solve all problems at the time. The goal is to solve import-related problems by bringing extension modules closer to the way Python modules behave; specifically to hook into the ModuleSpec-based loading mechanism introduced in PEP 451.

This proposal draws inspiration from PyType_Spec of PEP 384 to allow extension authors to only define features they need, and to allow future additions to extension module declarations.

Extensions modules are created in a two-step process, fitting better into the ModuleSpec architecture, with parallels to __new__ and __init__ of classes.

Extension modules can safely store arbitrary C-level per-module state in the module that is covered by normal garbage collection and supports reloading and sub-interpreters. Extension authors are encouraged to take these issues into account when using the new API.

The proposal also allows extension modules with non-ASCII names.

Not all problems tackled in PEP 3121 are solved in this proposal. In particular, problems with run-time module lookup (PyState_FindModule) are left to a future PEP.

Motivation

Python modules and extension modules are not being set up in the same way. For Python modules, the module object is created and set up first, then the module code is being executed (PEP 302). A ModuleSpec object (PEP 451) is used to hold information about the module, and passed to the relevant hooks.

For extensions (i.e. shared libraries) and built-in modules, the module init function is executed straight away and does both the creation and initialization. The initialization function is not passed the ModuleSpec, or any information it contains, such as the __file__ or fully-qualified name. This hinders relative imports and resource loading.

In Py3, modules are also not being added to sys.modules, which means that a (potentially transitive) re-import of the module will really try to re-import it and thus run into an infinite loop when it executes the module init function again. Without access to the fully-qualified module name, it is not trivial to correctly add the module to sys.modules either. This is specifically a problem for Cython generated modules, for which it's not uncommon that the module init code has the same level of complexity as that of any 'regular' Python module. Also, the lack of __file__ and __name__ information hinders the compilation of "__init__.py" modules, i.e. packages, especially when relative imports are being used at module init time.

Furthermore, the majority of currently existing extension modules has problems with sub-interpreter support and/or interpreter reloading, and, while it is possible with the current infrastructure to support these features, it is neither easy nor efficient. Addressing these issues was the goal of PEP 3121, but many extensions, including some in the standard library, took the least-effort approach to porting to Python 3, leaving these issues unresolved. This PEP keeps backwards compatibility, which should reduce pressure and give extension authors adequate time to consider these issues when porting.

The current process

Currently, extension and built-in modules export an initialization function named "PyInit_modulename", named after the file name of the shared library. This function is executed by the import machinery and must return a fully initialized module object. The function receives no arguments, so it has no way of knowing about its import context.

During its execution, the module init function creates a module object based on a PyModuleDef object. It then continues to initialize it by adding attributes to the module dict, creating types, etc.

In the back, the shared library loader keeps a note of the fully qualified module name of the last module that it loaded, and when a module gets created that has a matching name, this global variable is used to determine the fully qualified name of the module object. This is not entirely safe as it relies on the module init function creating its own module object first, but this assumption usually holds in practice.

The proposal

The initialization function (PyInit_modulename) will be allowed to return a pointer to a PyModuleDef object. The import machinery will be in charge of constructing the module object, calling hooks provided in the PyModuleDef in the relevant phases of initialization (as described below).

This multi-phase initialization is an additional possibility. Single-phase initialization, the current practice of returning a fully initialized module object, will still be accepted, so existing code will work unchanged, including binary compatibility.

The PyModuleDef structure will be changed to contain a list of slots, similarly to PEP 384's PyType_Spec for types. To keep binary compatibility, and avoid needing to introduce a new structure (which would introduce additional supporting functions and per-module storage), the currently unused m_reload pointer of PyModuleDef will be changed to hold the slots. The structures are defined as:

typedef struct {
    int slot;
    void *value;
} PyModuleDef_Slot;

typedef struct PyModuleDef {
    PyModuleDef_Base m_base;
    const char* m_name;
    const char* m_doc;
    Py_ssize_t m_size;
    PyMethodDef *m_methods;
    PyModuleDef_Slot *m_slots;  /* changed from `inquiry m_reload;` */
    traverseproc m_traverse;
    inquiry m_clear;
    freefunc m_free;
} PyModuleDef;

The m_slots member must be either NULL, or point to an array of PyModuleDef_Slot structures, terminated by a slot with id set to 0 (i.e. {0, NULL}).

To specify a slot, a unique slot ID must be provided. New Python versions may introduce new slot IDs, but slot IDs will never be recycled. Slots may get deprecated, but will continue to be supported throughout Python 3.x.

A slot's value pointer may not be NULL, unless specified otherwise in the slot's documentation.

The following slots are currently available, and described later:

  • Py_mod_create
  • Py_mod_exec

Unknown slot IDs will cause the import to fail with SystemError.

When using multi-phase initialization, the m_name field of PyModuleDef will not be used during importing; the module name will be taken from the ModuleSpec.

Before it is returned from PyInit_*, the PyModuleDef object must be initialized using the newly added PyModuleDef_Init function. This sets the object type (which cannot be done statically on certain compilers), refcount, and internal bookkeeping data (m_index). For example, an extension module "example" would be exported as:

static PyModuleDef example_def = {...}

PyMODINIT_FUNC
PyInit_example(void)
{
    return PyModuleDef_Init(&example_def);
}

The PyModuleDef object must be available for the lifetime of the module created from it – usually, it will be declared statically.

Pseudo-code Overview

Here is an overview of how the modified importers will operate. Details such as logging or handling of errors and invalid states are left out, and C code is presented with a concise Python-like syntax.

The framework that calls the importers is explained in PEP 451 [8].

importlib/_bootstrap.py:

class BuiltinImporter:
    def create_module(self, spec):
        module = _imp.create_builtin(spec)

    def exec_module(self, module):
        _imp.exec_dynamic(module)

    def load_module(self, name):
        # use a backwards compatibility shim
        _load_module_shim(self, name)

importlib/_bootstrap_external.py:

class ExtensionFileLoader:
    def create_module(self, spec):
        module = _imp.create_dynamic(spec)

    def exec_module(self, module):
        _imp.exec_dynamic(module)

    def load_module(self, name):
        # use a backwards compatibility shim
        _load_module_shim(self, name)

Python/import.c (the _imp module):

def create_dynamic(spec):
    name = spec.name
    path = spec.origin

    # Find an already loaded module that used single-phase init.
    # For multi-phase initialization, mod is NULL, so a new module
    # is always created.
    mod = _PyImport_FindExtensionObject(name, name)
    if mod:
        return mod

    return _PyImport_LoadDynamicModuleWithSpec(spec)

def exec_dynamic(module):
    if not isinstance(module, types.ModuleType):
        # non-modules are skipped -- PyModule_GetDef fails on them
        return

    def = PyModule_GetDef(module)
    state = PyModule_GetState(module)
    if state is NULL:
        PyModule_ExecDef(module, def)

def create_builtin(spec):
    name = spec.name

    # Find an already loaded module that used single-phase init.
    # For multi-phase initialization, mod is NULL, so a new module
    # is always created.
    mod = _PyImport_FindExtensionObject(name, name)
    if mod:
        return mod

    for initname, initfunc in PyImport_Inittab:
        if name == initname:
            m = initfunc()
            if isinstance(m, PyModuleDef):
                def = m
                return PyModule_FromDefAndSpec(def, spec)
            else:
                # fall back to single-phase initialization
                module = m
                _PyImport_FixupExtensionObject(module, name, name)
                return module

Python/importdl.c:

def _PyImport_LoadDynamicModuleWithSpec(spec):
    path = spec.origin
    package, dot, name = spec.name.rpartition('.')

    # see the "Non-ASCII module names" section for export_hook_name
    hook_name = export_hook_name(name)

    # call platform-specific function for loading exported function
    # from shared library
    exportfunc = _find_shared_funcptr(hook_name, path)

    m = exportfunc()
    if isinstance(m, PyModuleDef):
        def = m
        return PyModule_FromDefAndSpec(def, spec)

    module = m

    # fall back to single-phase initialization
    ....

Objects/moduleobject.c:

def PyModule_FromDefAndSpec(def, spec):
    name = spec.name
    create = None
    for slot, value in def.m_slots:
        if slot == Py_mod_create:
            create = value
    if create:
        m = create(spec, def)
    else:
        m = PyModule_New(name)

    if isinstance(m, types.ModuleType):
        m.md_state = None
        m.md_def = def

    if def.m_methods:
        PyModule_AddFunctions(m, def.m_methods)
    if def.m_doc:
        PyModule_SetDocString(m, def.m_doc)

def PyModule_ExecDef(module, def):
    if isinstance(module, types.module_type):
        if module.md_state is NULL:
            # allocate a block of zeroed-out memory
            module.md_state = _alloc(module.md_size)

    if def.m_slots is NULL:
        return

    for slot, value in def.m_slots:
        if slot == Py_mod_exec:
            value(module)

Module Creation Phase

Creation of the module object – that is, the implementation of ExecutionLoader.create_module – is governed by the Py_mod_create slot.

The Py_mod_create slot

The Py_mod_create slot is used to support custom module subclasses. The value pointer must point to a function with the following signature:

PyObject* (*PyModuleCreateFunction)(PyObject *spec, PyModuleDef *def)

The function receives a ModuleSpec instance, as defined in PEP 451, and the PyModuleDef structure. It should return a new module object, or set an error and return NULL.

This function is not responsible for setting import-related attributes specified in PEP 451 [1] (such as __name__ or __loader__) on the new module.

There is no requirement for the returned object to be an instance of types.ModuleType. Any type can be used, as long as it supports setting and getting attributes, including at least the import-related attributes. However, only ModuleType instances support module-specific functionality such as per-module state and processing of execution slots. If something other than a ModuleType subclass is returned, no execution slots may be defined; if any are, a SystemError is raised.

Note that when this function is called, the module's entry in sys.modules is not populated yet. Attempting to import the same module again (possibly transitively), may lead to an infinite loop. Extension authors are advised to keep Py_mod_create minimal, an in particular to not call user code from it.

Multiple Py_mod_create slots may not be specified. If they are, import will fail with SystemError.

If Py_mod_create is not specified, the import machinery will create a normal module object using PyModule_New. The name is taken from spec.

Post-creation steps

If the Py_mod_create function returns an instance of types.ModuleType or a subclass (or if a Py_mod_create slot is not present), the import machinery will associate the PyModuleDef with the module. This also makes the PyModuleDef accessible to execution phase, the PyModule_GetDef function, and garbage collection routines (traverse, clear, free).

If the Py_mod_create function does not return a module subclass, then m_size must be 0, and m_traverse, m_clear and m_free must all be NULL. Otherwise, SystemError is raised.

Additionally, initial attributes specified in the PyModuleDef are set on the module object, regardless of its type:

  • The docstring is set from m_doc, if non-NULL.
  • The module's functions are initialized from m_methods, if any.

Module Execution Phase

Module execution -- that is, the implementation of ExecutionLoader.exec_module -- is governed by "execution slots". This PEP only adds one, Py_mod_exec, but others may be added in the future.

The execution phase is done on the PyModuleDef associated with the module object. For objects that are not a subclass of PyModule_Type (for which PyModule_GetDef would fail), the execution phase is skipped.

Execution slots may be specified multiple times, and are processed in the order they appear in the slots array. When using the default import machinery, they are processed after import-related attributes specified in PEP 451 [1] (such as __name__ or __loader__) are set and the module is added to sys.modules.

Pre-Execution steps

Before processing the execution slots, per-module state is allocated for the module. From this point on, per-module state is accessible through PyModule_GetState.

The Py_mod_exec slot

The entry in this slot must point to a function with the following signature:

int (*PyModuleExecFunction)(PyObject* module)

It will be called to initialize a module. Usually, this amounts to setting the module's initial attributes. The "module" argument receives the module object to initialize.

The function must return 0 on success, or, on error, set an exception and return -1.

If PyModuleExec replaces the module's entry in sys.modules, the new object will be used and returned by importlib machinery after all execution slots are processed. This is a feature of the import machinery itself. The slots themselves are all processed using the module returned from the creation phase; sys.modules is not consulted during the execution phase. (Note that for extension modules, implementing Py_mod_create is usually a better solution for using custom module objects.)

Legacy Init

The backwards-compatible single-phase initialization continues to be supported. In this scheme, the PyInit function returns a fully initialized module rather than a PyModuleDef object. In this case, the PyInit hook implements the creation phase, and the execution phase is a no-op.

Modules that need to work unchanged on older versions of Python should stick to single-phase initialization, because the benefits it brings can't be back-ported. Here is an example of a module that supports multi-phase initialization, and falls back to single-phase when compiled for an older version of CPython. It is included mainly as an illustration of the changes needed to enable multi-phase init:

#include <Python.h>

static int spam_exec(PyObject *module) {
    PyModule_AddStringConstant(module, "food", "spam");
    return 0;
}

#ifdef Py_mod_exec
static PyModuleDef_Slot spam_slots[] = {
    {Py_mod_exec, spam_exec},
    {0, NULL}
};
#endif

static PyModuleDef spam_def = {
    PyModuleDef_HEAD_INIT,                      /* m_base */
    "spam",                                     /* m_name */
    PyDoc_STR("Utilities for cooking spam"),    /* m_doc */
    0,                                          /* m_size */
    NULL,                                       /* m_methods */
#ifdef Py_mod_exec
    spam_slots,                                 /* m_slots */
#else
    NULL,
#endif
    NULL,                                       /* m_traverse */
    NULL,                                       /* m_clear */
    NULL,                                       /* m_free */
};

PyMODINIT_FUNC
PyInit_spam(void) {
#ifdef Py_mod_exec
    return PyModuleDef_Init(&spam_def);
#else
    PyObject *module;
    module = PyModule_Create(&spam_def);
    if (module == NULL) return NULL;
    if (spam_exec(module) != 0) {
        Py_DECREF(module);
        return NULL;
    }
    return module;
#endif
}

Built-In modules

Any extension module can be used as a built-in module by linking it into the executable, and including it in the inittab (either at runtime with PyImport_AppendInittab, or at configuration time, using tools like freeze).

To keep this possibility, all changes to extension module loading introduced in this PEP will also apply to built-in modules. The only exception is non-ASCII module names, explained below.

Subinterpreters and Interpreter Reloading

Extensions using the new initialization scheme are expected to support subinterpreters and multiple Py_Initialize/Py_Finalize cycles correctly, avoiding the issues mentioned in Python documentation [9]. The mechanism is designed to make this easy, but care is still required on the part of the extension author. No user-defined functions, methods, or instances may leak to different interpreters. To achieve this, all module-level state should be kept in either the module dict, or in the module object's storage reachable by PyModule_GetState. A simple rule of thumb is: Do not define any static data, except built-in types with no mutable or user-settable class attributes.

Functions incompatible with multi-phase initialization

The PyModule_Create function will fail when used on a PyModuleDef structure with a non-NULL m_slots pointer. The function doesn't have access to the ModuleSpec object necessary for multi-phase initialization.

The PyState_FindModule function will return NULL, and PyState_AddModule and PyState_RemoveModule will also fail on modules with non-NULL m_slots. PyState registration is disabled because multiple module objects may be created from the same PyModuleDef.

Module state and C-level callbacks

Due to the unavailability of PyState_FindModule, any function that needs access to module-level state (including functions, classes or exceptions defined at the module level) must receive a reference to the module object (or the particular object it needs), either directly or indirectly. This is currently difficult in two situations:

  • Methods of classes, which receive a reference to the class, but not to the class's module
  • Libraries with C-level callbacks, unless the callbacks can receive custom data set at callback registration

Fixing these cases is outside of the scope of this PEP, but will be needed for the new mechanism to be useful to all modules. Proper fixes have been discussed on the import-sig mailing list [7].

As a rule of thumb, modules that rely on PyState_FindModule are, at the moment, not good candidates for porting to the new mechanism.

New Functions

A new function and macro implementing the module creation phase will be added. These are similar to PyModule_Create and PyModule_Create2, except they take an additional ModuleSpec argument, and handle module definitions with non-NULL slots:

PyObject * PyModule_FromDefAndSpec(PyModuleDef *def, PyObject *spec)
PyObject * PyModule_FromDefAndSpec2(PyModuleDef *def, PyObject *spec,
                                    int module_api_version)

A new function implementing the module execution phase will be added. This allocates per-module state (if not allocated already), and always processes execution slots. The import machinery calls this method when a module is executed, unless the module is being reloaded:

PyAPI_FUNC(int) PyModule_ExecDef(PyObject *module, PyModuleDef *def)

Another function will be introduced to initialize a PyModuleDef object. This idempotent function fills in the type, refcount, and module index. It returns its argument cast to PyObject*, so it can be returned directly from a PyInit function:

PyObject * PyModuleDef_Init(PyModuleDef *);

Additionally, two helpers will be added for setting the docstring and methods on a module:

int PyModule_SetDocString(PyObject *, const char *)
int PyModule_AddFunctions(PyObject *, PyMethodDef *)

Export Hook Name

As portable C identifiers are limited to ASCII, module names must be encoded to form the PyInit hook name.

For ASCII module names, the import hook is named PyInit_<modulename>, where <modulename> is the name of the module.

For module names containing non-ASCII characters, the import hook is named PyInitU_<encodedname>, where the name is encoded using CPython's "punycode" encoding (Punycode [4] with a lowercase suffix), with hyphens ("-") replaced by underscores ("_").

In Python:

def export_hook_name(name):
    try:
        suffix = b'_' + name.encode('ascii')
    except UnicodeEncodeError:
        suffix = b'U_' + name.encode('punycode').replace(b'-', b'_')
    return b'PyInit' + suffix

Examples:

Module name Init hook name
spam PyInit_spam
lančmít PyInitU_lanmt_2sa6t
スパム PyInitU_zck5b2b

For modules with non-ASCII names, single-phase initialization is not supported.

In the initial implementation of this PEP, built-in modules with non-ASCII names will not be supported.

Module Reloading

Reloading an extension module using importlib.reload() will continue to have no effect, except re-setting import-related attributes.

Due to limitations in shared library loading (both dlopen on POSIX and LoadModuleEx on Windows), it is not generally possible to load a modified library after it has changed on disk.

Use cases for reloading other than trying out a new version of the module are too rare to require all module authors to keep reloading in mind. If reload-like functionality is needed, authors can export a dedicated function for it.

Multiple modules in one library

To support multiple Python modules in one shared library, the library can export additional PyInit* symbols besides the one that corresponds to the library's filename.

Note that this mechanism can currently only be used to load extra modules, but not to find them. (This is a limitation of the loader mechanism, which this PEP does not try to modify.) To work around the lack of a suitable finder, code like the following can be used:

import importlib.machinery
import importlib.util
loader = importlib.machinery.ExtensionFileLoader(name, path)
spec = importlib.util.spec_from_loader(name, loader)
module = importlib.util.module_from_spec(spec)
loader.exec_module(module)
return module

On platforms that support symbolic links, these may be used to install one library under multiple names, exposing all exported modules to normal import machinery.

Testing and initial implementations

For testing, a new built-in module _testmultiphase will be created. The library will export several additional modules using the mechanism described in "Multiple modules in one library".

The _testcapi module will be unchanged, and will use single-phase initialization indefinitely (or until it is no longer supported).

The array and xx* modules will be converted to use multi-phase initialization as part of the initial implementation.

Summary of API Changes and Additions

New functions:

  • PyModule_FromDefAndSpec (macro)
  • PyModule_FromDefAndSpec2
  • PyModule_ExecDef
  • PyModule_SetDocString
  • PyModule_AddFunctions
  • PyModuleDef_Init

New macros:

  • Py_mod_create
  • Py_mod_exec

New types:

  • PyModuleDef_Type will be exposed

New structures:

  • PyModuleDef_Slot

Other changes:

PyModuleDef.m_reload changes to PyModuleDef.m_slots.

BuiltinImporter and ExtensionFileLoader will now implement create_module and exec_module.

The internal _imp module will have backwards incompatible changes: create_builtin, create_dynamic, and exec_dynamic will be added; init_builtin, load_dynamic will be removed.

The undocumented functions imp.load_dynamic and imp.init_builtin will be replaced by backwards-compatible shims.

Backwards Compatibility

Existing modules will continue to be source- and binary-compatible with new versions of Python. Modules that use multi-phase initialization will not be compatible with versions of Python that do not implement this PEP.

The functions init_builtin and load_dynamic will be removed from the _imp module (but not from the imp module).

All changed loaders (BuiltinImporter and ExtensionFileLoader) will remain backwards-compatible; the load_module method will be replaced by a shim.

Internal functions of Python/import.c and Python/importdl.c will be removed. (Specifically, these are _PyImport_GetDynLoadFunc, _PyImport_GetDynLoadWindows, and _PyImport_LoadDynamicModule.)

Possible Future Extensions

The slots mechanism, inspired by PyType_Slot from PEP 384, allows later extensions.

Some extension modules exports many constants; for example _ssl has a long list of calls in the form:

PyModule_AddIntConstant(m, "SSL_ERROR_ZERO_RETURN",
                        PY_SSL_ERROR_ZERO_RETURN);

Converting this to a declarative list, similar to PyMethodDef, would reduce boilerplate, and provide free error-checking which is often missing.

String constants and types can be handled similarly. (Note that non-default bases for types cannot be portably specified statically; this case would need a Py_mod_exec function that runs before the slots are added. The free error-checking would still be beneficial, though.)

Another possibility is providing a "main" function that would be run when the module is given to Python's -m switch. For this to work, the runpy module will need to be modified to take advantage of ModuleSpec-based loading introduced in PEP 451. Also, it will be necessary to add a mechanism for setting up a module according to slots it wasn't originally defined with.

Implementation

Work-in-progress implementation is available in a Github repository [5]; a patchset is at [6].

Previous Approaches

Stefan Behnel's initial proto-PEP [2] had a "PyInit_modulename" hook that would create a module class, whose __init__ would be then called to create the module. This proposal did not correspond to the (then nonexistent) PEP 451, where module creation and initialization is broken into distinct steps. It also did not support loading an extension into pre-existing module objects.

Nick Coghlan proposed "Create" and "Exec" hooks, and wrote a prototype implementation [3]. At this time PEP 451 was still not implemented, so the prototype does not use ModuleSpec.

The original version of this PEP used Create and Exec hooks, and allowed loading into arbitrary pre-constructed objects with Exec hook. The proposal made extension module initialization closer to how Python modules are initialized, but it was later recognized that this isn't an important goal. The current PEP describes a simpler solution.

A further iteration used a "PyModuleExport" hook as an alternative to PyInit, where PyInit was used for existing scheme, and PyModuleExport for multi-phase. However, not being able to determine the hook name based on module name complicated automatic generation of PyImport_Inittab by tools like freeze. Keeping only the PyInit hook name, even if it's not entirely appropriate for exporting a definition, yielded a much simpler solution.

pep-0490 Chain exceptions at C level

PEP:490
Title:Chain exceptions at C level
Version:$Revision$
Last-Modified:$Date$
Author:Victor Stinner <victor.stinner at gmail.com>
Status:Draft
Type:Standards Track
Content-Type:text/x-rst
Created:25-March-2015
Python-Version:3.6

Abstract

Chain exceptions at C level, as already done at Python level.

Rationale

Python 3 introduced a new killer feature: exceptions are chained by default, PEP 3134.

Example:

try:
    raise TypeError("err1")
except TypeError:
    raise ValueError("err2")

Output:

Traceback (most recent call last):
  File "test.py", line 2, in <module>
    raise TypeError("err1")
TypeError: err1

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "test.py", line 4, in <module>
    raise ValueError("err2")
ValueError: err2

Exceptions are chained by default in Python code, but not in extensions written in C.

A new private _PyErr_ChainExceptions() function was introduced in Python 3.4.3 and 3.5 to chain exceptions. Currently, it must be called explicitly to chain exceptions and its usage is not trivial.

Example of _PyErr_ChainExceptions() usage from the zipimport module to chain the previous OSError to a new ZipImportError exception:

PyObject *exc, *val, *tb;
PyErr_Fetch(&exc, &val, &tb);
PyErr_Format(ZipImportError, "can't open Zip file: %R", archive);
_PyErr_ChainExceptions(exc, val, tb);

This PEP proposes to also chain exceptions automatically at C level to stay consistent and give more information on failures to help debugging. The previous example becomes simply:

PyErr_Format(ZipImportError, "can't open Zip file: %R", archive);

Proposal

Modify PyErr_*() functions to chain exceptions

Modify C functions raising exceptions of the Python C API to automatically chain exceptions: modify PyErr_SetString(), PyErr_Format(), PyErr_SetNone(), etc.

Modify functions to not chain exceptions

Keeping the previous exception is not always interesting when the new exception contains information of the previous exception or even more information, especially when the two exceptions have the same type.

Example of an useless exception chain with int(str):

TypeError: a bytes-like object is required, not 'type'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: int() argument must be a string, a bytes-like object or a number, not 'type'

The new TypeError exception contains more information than the previous exception. The previous exception should be hidden.

The PyErr_Clear() function can be called to clear the current exception before raising a new exception, to not chain the current exception with a new exception.

Modify functions to chain exceptions

Some functions save and then restore the current exception. If a new exception is raised, the exception is currently displayed into sys.stderr or ignored depending on the function. Some of these functions should be modified to chain exceptions instead.

Examples of function ignoring the new exception(s):

  • ptrace_enter_call(): ignore exception
  • subprocess_fork_exec(): ignore exception raised by enable_gc()
  • t_bootstrap() of the _thread module: ignore exception raised by trying to display the bootstrap function to sys.stderr
  • PyDict_GetItem(), _PyDict_GetItem_KnownHash(): ignore exception raised by looking for a key in the dictionary
  • _PyErr_TrySetFromCause(): ignore exception
  • PyFrame_LocalsToFast(): ignore exception raised by dict_to_map()
  • _PyObject_Dump(): ignore exception. _PyObject_Dump() is used to debug, to inspect a running process, it should not modify the Python state.
  • Py_ReprLeave(): ignore exception "because there is no way to report them"
  • type_dealloc(): ignore exception raised by remove_all_subclasses()
  • PyObject_ClearWeakRefs(): ignore exception?
  • call_exc_trace(), call_trace_protected(): ignore exception
  • remove_importlib_frames(): ignore exception
  • do_mktuple(), helper used by Py_BuildValue() for example: ignore exception?
  • flush_io(): ignore exception
  • sys_write(), sys_format(): ignore exception
  • _PyTraceback_Add(): ignore exception
  • PyTraceBack_Print(): ignore exception

Examples of function displaying the new exception to sys.stderr:

  • atexit_callfuncs(): display exceptions with PyErr_Display() and return the latest exception, the function calls multiple callbacks and only returns the latest exception
  • sock_dealloc(): log the ResourceWarning exception with PyErr_WriteUnraisable()
  • slot_tp_del(): display exception with PyErr_WriteUnraisable()
  • _PyGen_Finalize(): display gen_close() exception with PyErr_WriteUnraisable()
  • slot_tp_finalize(): display exception raised by the __del__() method with PyErr_WriteUnraisable()
  • PyErr_GivenExceptionMatches(): display exception raised by PyType_IsSubtype() with PyErr_WriteUnraisable()

Backward compatibility

A side effect of chaining exceptions is that exceptions store traceback objects which store frame objects which store local variables. Local variables are kept alive by exceptions. A common issue is a reference cycle between local variables and exceptions: an exception is stored in a local variable and the frame indirectly stored in the exception. The cycle only impacts applications storing exceptions.

The reference cycle can now be fixed with the new traceback.TracebackException object introduced in Python 3.5. It stores informations required to format a full textual traceback without storing local variables.

The asyncio is impacted by the reference cycle issue. This module is also maintained outside Python standard library to release a version for Python 3.3. traceback.TracebackException will maybe be backported in a private asyncio module to fix reference cycle issues.

Alternatives

No change

A new private _PyErr_ChainExceptions() function is enough to chain manually exceptions.

Exceptions will only be chained explicitly where it makes sense.

New helpers to chain exceptions

Functions like PyErr_SetString() don't chain automatically exceptions. To make the usage of _PyErr_ChainExceptions() easier, new private functions are added:

  • _PyErr_SetStringChain(exc_type, message)
  • _PyErr_FormatChain(exc_type, format, ...)
  • _PyErr_SetNoneChain(exc_type)
  • _PyErr_SetObjectChain(exc_type, exc_value)

Helper functions to raise specific exceptions like _PyErr_SetKeyError(key) or PyErr_SetImportError(message, name, path) don't chain exceptions. The generic _PyErr_ChainExceptions(exc_type, exc_value, exc_tb) should be used to chain exceptions with these helper functions.

Appendix

PEPs

Python C API

The header file Include/pyerror.h declares functions related to exceptions.

Functions raising exceptions:

  • PyErr_SetNone(exc_type)
  • PyErr_SetObject(exc_type, exc_value)
  • PyErr_SetString(exc_type, message)
  • PyErr_Format(exc, format, ...)

Helpers to raise specific exceptions:

  • PyErr_BadArgument()
  • PyErr_BadInternalCall()
  • PyErr_NoMemory()
  • PyErr_SetFromErrno(exc)
  • PyErr_SetFromWindowsErr(err)
  • PyErr_SetImportError(message, name, path)
  • _PyErr_SetKeyError(key)
  • _PyErr_TrySetFromCause(prefix_format, ...)

Manage the current exception:

  • PyErr_Clear(): clear the current exception, like except: pass
  • PyErr_Fetch(exc_type, exc_value, exc_tb)
  • PyErr_Restore(exc_type, exc_value, exc_tb)
  • PyErr_GetExcInfo(exc_type, exc_value, exc_tb)
  • PyErr_SetExcInfo(exc_type, exc_value, exc_tb)

Others function to handle exceptions:

  • PyErr_ExceptionMatches(exc): check to implement except exc:  ...
  • PyErr_GivenExceptionMatches(exc1, exc2)
  • PyErr_NormalizeException(exc_type, exc_value, exc_tb)
  • _PyErr_ChainExceptions(exc_type, exc_value, exc_tb)

pep-0491 The Wheel Binary Package Format 1.9

PEP:491
Title:The Wheel Binary Package Format 1.9
Version:$Revision$
Last-Modified:$Date$
Author:Daniel Holth <dholth at gmail.com>
Discussions-To:<distutils-sig at python.org>
Status:Draft
Type:Standards Track
Content-Type:text/x-rst
Created:16 April 2015

Abstract

This PEP describes the second version of a built-package format for Python called "wheel". Wheel provides a Python-specific, relocatable package format that allows people to install software more quickly and predictably than re-building from source each time.

A wheel is a ZIP-format archive with a specially formatted file name and the .whl extension. It contains a single distribution nearly as it would be installed according to PEP 376 with a particular installation scheme. Simple wheels can be unpacked onto sys.path and used directly but wheels are usually installed with a specialized installer.

This version of the wheel specification adds support for installing distributions into many different directories, and adds a way to find those files after they have been installed.

Rationale

Wheel 1.0 is best at installing files into site-packages and a few other locations specified by distutils, but users would like to install files from single distribution into many directories -- perhaps separate locations for docs, data, and code. Unfortunately not everyone agrees on where these install locations should be relative to the root directory. This version of the format adds many more categories, each of which can be installed to a different destination based on policy. Since it might also be important to locate the installed files at runtime, this version of the format also adds a way to record the installed paths in a way that can be read by the installed software.

Details

Installing a wheel 'distribution-1.0-py32-none-any.whl'

Wheel installation notionally consists of two phases:

  • Unpack.
    1. Parse distribution-1.0.dist-info/WHEEL.
    2. Check that installer is compatible with Wheel-Version. Warn if minor version is greater, abort if major version is greater.
    3. If Root-Is-Purelib == 'true', unpack archive into purelib (site-packages).
    4. Else unpack archive into platlib (site-packages).
  • Spread.
    1. Unpacked archive includes distribution-1.0.dist-info/ and (if there is data) distribution-1.0.data/.
    2. Move each subtree of distribution-1.0.data/ onto its destination path. Each subdirectory of distribution-1.0.data/ is a key into a dict of destination directories, such as distribution-1.0.data/(purelib|platlib|headers|scripts|data).
    3. Update scripts starting with #!python to point to the correct interpreter. (Note: Python scripts are usually handled by package metadata, and not included verbatim in wheel.)
    4. Update distribution-1.0.dist.info/RECORD with the installed paths.
    5. If empty, remove the distribution-1.0.data directory.
    6. Compile any installed .py to .pyc. (Uninstallers should be smart enough to remove .pyc even if it is not mentioned in RECORD.)

In practice, installers will usually extract files directly from the archive to their destinations without writing a temporary distribution-1.0.data/ directory.

File Format

File name convention

The wheel filename is {distribution}-{version}(-{build tag})?-{python tag}-{abi tag}-{platform tag}.whl.

distribution
Distribution name, e.g. 'django', 'pyramid'.
version
Distribution version, e.g. 1.0.
build tag
Optional build number. Must start with a digit. A tie breaker if two wheels have the same version. Sort as the empty string if unspecified, else sort the initial digits as a number, and the remainder lexicographically.
language implementation and version tag
E.g. 'py27', 'py2', 'py3'.
abi tag
E.g. 'cp33m', 'abi3', 'none'.
platform tag
E.g. 'linux_x86_64', 'any'.

For example, distribution-1.0-1-py27-none-any.whl is the first build of a package called 'distribution', and is compatible with Python 2.7 (any Python 2.7 implementation), with no ABI (pure Python), on any CPU architecture.

The last three components of the filename before the extension are called "compatibility tags." The compatibility tags express the package's basic interpreter requirements and are detailed in PEP 425.

Escaping and Unicode

Each component of the filename is escaped by replacing runs of non-alphanumeric characters with an underscore _:

re.sub("[^\w\d.]+", "_", distribution, re.UNICODE)

The archive filename is Unicode. The packaging tools may only support ASCII package names, but Unicode filenames are supported in this specification.

The filenames inside the archive are encoded as UTF-8. Although some ZIP clients in common use do not properly display UTF-8 filenames, the encoding is supported by both the ZIP specification and Python's zipfile.

File contents

The contents of a wheel file, where {distribution} is replaced with the name of the package, e.g. beaglevote and {version} is replaced with its version, e.g. 1.0.0, consist of:

  1. /, the root of the archive, contains all files to be installed in purelib or platlib as specified in WHEEL. purelib and platlib are usually both site-packages.

  2. {distribution}-{version}.dist-info/ contains metadata.

  3. {distribution}-{version}.data/ contains one subdirectory for each non-empty install scheme key not already covered, where the subdirectory name is an index into a dictionary of install paths (e.g. data, scripts, include, purelib, platlib).

  4. Python scripts must appear in scripts and begin with exactly b'#!python' in order to enjoy script wrapper generation and #!python rewriting at install time. They may have any or no extension.

  5. {distribution}-{version}.dist-info/METADATA is Metadata version 1.1 or greater format metadata.

  6. {distribution}-{version}.dist-info/WHEEL is metadata about the archive itself in the same basic key: value format:

    Wheel-Version: 1.9
    Generator: bdist_wheel 1.9
    Root-Is-Purelib: true
    Tag: py2-none-any
    Tag: py3-none-any
    Build: 1
    Install-Paths-To: wheel/_paths.py
    Install-Paths-To: wheel/_paths.json
    
  7. Wheel-Version is the version number of the Wheel specification.

  8. Generator is the name and optionally the version of the software that produced the archive.

  9. Root-Is-Purelib is true if the top level directory of the archive should be installed into purelib; otherwise the root should be installed into platlib.

  10. Tag is the wheel's expanded compatibility tags; in the example the filename would contain py2.py3-none-any.

  11. Build is the build number and is omitted if there is no build number.

  12. Install-Paths-To is a location relative to the archive that will be overwritten with the install-time paths of each category in the install scheme. See the install paths section. May appear 0 or more times.

  13. A wheel installer should warn if Wheel-Version is greater than the version it supports, and must fail if Wheel-Version has a greater major version than the version it supports.

  14. Wheel, being an installation format that is intended to work across multiple versions of Python, does not generally include .pyc files.

  15. Wheel does not contain setup.py or setup.cfg.

The .dist-info directory

  1. Wheel .dist-info directories include at a minimum METADATA, WHEEL, and RECORD.
  2. METADATA is the package metadata, the same format as PKG-INFO as found at the root of sdists.
  3. WHEEL is the wheel metadata specific to a build of the package.
  4. RECORD is a list of (almost) all the files in the wheel and their secure hashes. Unlike PEP 376, every file except RECORD, which cannot contain a hash of itself, must include its hash. The hash algorithm must be sha256 or better; specifically, md5 and sha1 are not permitted, as signed wheel files rely on the strong hashes in RECORD to validate the integrity of the archive.
  5. PEP 376's INSTALLER and REQUESTED are not included in the archive.
  6. RECORD.jws is used for digital signatures. It is not mentioned in RECORD.
  7. RECORD.p7s is allowed as a courtesy to anyone who would prefer to use S/MIME signatures to secure their wheel files. It is not mentioned in RECORD.
  8. During extraction, wheel installers verify all the hashes in RECORD against the file contents. Apart from RECORD and its signatures, installation will fail if any file in the archive is not both mentioned and correctly hashed in RECORD.

The .data directory

Any file that is not normally installed inside site-packages goes into the .data directory, named as the .dist-info directory but with the .data/ extension:

distribution-1.0.dist-info/

distribution-1.0.data/

The .data directory contains subdirectories with the scripts, headers, documentation and so forth from the distribution. During installation the contents of these subdirectories are moved onto their destination paths.

If a subdirectory is not found in the install scheme, the installer should emit a warning, and it should be installed at distribution-1.0.data/... as if the package was unpacked by a standard unzip tool.

Install paths

In addition to the distutils install paths, wheel now includes the listed categories based on GNU autotools. This expanded scheme should help installers to implement system policy, but installers may root each category at any location.

A UNIX install scheme might map the categories to their installation patnhs like this:

{
    'bindir': '$eprefix/bin',
    'sbindir': '$eprefix/sbin',
    'libexecdir': '$eprefix/libexec',
    'sysconfdir': '$prefix/etc',
    'sharedstatedir': '$prefix/com',
    'localstatedir': '$prefix/var',
    'libdir': '$eprefix/lib',
    'static_libdir': r'$prefix/lib',
    'includedir': '$prefix/include',
    'datarootdir': '$prefix/share',
    'datadir': '$datarootdir',
    'mandir': '$datarootdir/man',
    'infodir': '$datarootdir/info',
    'localedir': '$datarootdir/locale',
    'docdir': '$datarootdir/doc/$dist_name',
    'htmldir': '$docdir',
    'dvidir': '$docdir',
    'psdir': '$docdir',
    'pdfdir': '$docdir',
    'pkgdatadir': '$datadir/$dist_name'
}

If a package needs to find its files at runtime, it can request they be written to a specified file or files by the installer and included in those same files inside the archive itself, relative to their location within the archive (so a wheel is still installed correctly if unpacked with a standard unzip tool, or perhaps not unpacked at all).

If the WHEEL metadata contains these files:

Install-Paths-To: wheel/_paths.py Install-Paths-To: wheel/_paths.json

Then the wheel installer, when it is about to unpack wheel/_paths.py from the archive, replaces it with the actual paths used at install time. The paths may be absolute or relative to the generated file.

If the filename ends with .py then a Python script is written. The script MUST be executed to get the paths, but it will probably look like this:

data='../wheel-0.26.0.dev1.data/data'
headers='../wheel-0.26.0.dev1.data/headers'
platlib='../wheel-0.26.0.dev1.data/platlib'
purelib='../wheel-0.26.0.dev1.data/purelib'
scripts='../wheel-0.26.0.dev1.data/scripts'
# ...

If the filename ends with .json then a JSON document is written:

{ "data": "../wheel-0.26.0.dev1.data/data", ... }

Only the categories actually used by a particular wheel must be written to this file.

These files are designed to be written to a location that can be found by the installed package without introducing any dependency on a packaging library.

Signed wheel files

Wheel files include an extended RECORD that enables digital signatures. PEP 376's RECORD is altered to include a secure hash digestname=urlsafe_b64encode_nopad(digest) (urlsafe base64 encoding with no trailing = characters) as the second column instead of an md5sum. All possible entries are hashed, including any generated files such as .pyc files, but not RECORD which cannot contain its own hash. For example:

file.py,sha256=AVTFPZpEKzuHr7OvQZmhaU3LvwKz06AJw8mT\_pNh2yI,3144
distribution-1.0.dist-info/RECORD,,

The signature file(s) RECORD.jws and RECORD.p7s are not mentioned in RECORD at all since they can only be added after RECORD is generated. Every other file in the archive must have a correct hash in RECORD or the installation will fail.

If JSON web signatures are used, one or more JSON Web Signature JSON Serialization (JWS-JS) signatures is stored in a file RECORD.jws adjacent to RECORD. JWS is used to sign RECORD by including the SHA-256 hash of RECORD as the signature's JSON payload:

{ "hash": "sha256=ADD-r2urObZHcxBW3Cr-vDCu5RJwT4CaRTHiFmbcIYY" }

(The hash value is the same format used in RECORD.)

If RECORD.p7s is used, it must contain a detached S/MIME format signature of RECORD.

A wheel installer is not required to understand digital signatures but MUST verify the hashes in RECORD against the extracted file contents. When the installer checks file hashes against RECORD, a separate signature checker only needs to establish that RECORD matches the signature.

See

Comparison to .egg

  1. Wheel is an installation format; egg is importable. Wheel archives do not need to include .pyc and are less tied to a specific Python version or implementation. Wheel can install (pure Python) packages built with previous versions of Python so you don't always have to wait for the packager to catch up.
  2. Wheel uses .dist-info directories; egg uses .egg-info. Wheel is compatible with the new world of Python packaging and the new concepts it brings.
  3. Wheel has a richer file naming convention for today's multi-implementation world. A single wheel archive can indicate its compatibility with a number of Python language versions and implementations, ABIs, and system architectures. Historically the ABI has been specific to a CPython release, wheel is ready for the stable ABI.
  4. Wheel is lossless. The first wheel implementation bdist_wheel always generates egg-info, and then converts it to a .whl. It is also possible to convert existing eggs and bdist_wininst distributions.
  5. Wheel is versioned. Every wheel file contains the version of the wheel specification and the implementation that packaged it. Hopefully the next migration can simply be to Wheel 2.0.
  6. Wheel is a reference to the other Python.

FAQ

Wheel defines a .data directory. Should I put all my data there?

This specification does not have an opinion on how you should organize your code. The .data directory is just a place for any files that are not normally installed inside site-packages or on the PYTHONPATH. In other words, you may continue to use pkgutil.get_data(package, resource) even though those files will usually not be distributed in wheel's .data directory.

Why does wheel include attached signatures?

Attached signatures are more convenient than detached signatures because they travel with the archive. Since only the individual files are signed, the archive can be recompressed without invalidating the signature or individual files can be verified without having to download the whole archive.

Why does wheel allow JWS signatures?

The JOSE specifications of which JWS is a part are designed to be easy to implement, a feature that is also one of wheel's primary design goals. JWS yields a useful, concise pure-Python implementation.

Why does wheel also allow S/MIME signatures?

S/MIME signatures are allowed for users who need or want to use existing public key infrastructure with wheel.

Signed packages are only a basic building block in a secure package update system. Wheel only provides the building block.

What's the deal with "purelib" vs. "platlib"?

Wheel preserves the "purelib" vs. "platlib" distinction, which is significant on some platforms. For example, Fedora installs pure Python packages to '/usr/lib/pythonX.Y/site-packages' and platform dependent packages to '/usr/lib64/pythonX.Y/site-packages'.

A wheel with "Root-Is-Purelib: false" with all its files in {name}-{version}.data/purelib is equivalent to a wheel with "Root-Is-Purelib: true" with those same files in the root, and it is legal to have files in both the "purelib" and "platlib" categories.

In practice a wheel should have only one of "purelib" or "platlib" depending on whether it is pure Python or not and those files should be at the root with the appropriate setting given for "Root-is-purelib".

Is it possible to import Python code directly from a wheel file?

Technically, due to the combination of supporting installation via simple extraction and using an archive format that is compatible with zipimport, a subset of wheel files do support being placed directly on sys.path. However, while this behaviour is a natural consequence of the format design, actually relying on it is generally discouraged.

Firstly, wheel is designed primarily as a distribution format, so skipping the installation step also means deliberately avoiding any reliance on features that assume full installation (such as being able to use standard tools like pip and virtualenv to capture and manage dependencies in a way that can be properly tracked for auditing and security update purposes, or integrating fully with the standard build machinery for C extensions by publishing header files in the appropriate place).

Secondly, while some Python software is written to support running directly from a zip archive, it is still common for code to be written assuming it has been fully installed. When that assumption is broken by trying to run the software from a zip archive, the failures can often be obscure and hard to diagnose (especially when they occur in third party libraries). The two most common sources of problems with this are the fact that importing C extensions from a zip archive is not supported by CPython (since doing so is not supported directly by the dynamic loading machinery on any platform) and that when running from a zip archive the __file__ attribute no longer refers to an ordinary filesystem path, but to a combination path that includes both the location of the zip archive on the filesystem and the relative path to the module inside the archive. Even when software correctly uses the abstract resource APIs internally, interfacing with external components may still require the availability of an actual on-disk file.

Like metaclasses, monkeypatching and metapath importers, if you're not already sure you need to take advantage of this feature, you almost certainly don't need it. If you do decide to use it anyway, be aware that many projects will require a failure to be reproduced with a fully installed package before accepting it as a genuine bug.

Appendix

Example urlsafe-base64-nopad implementation:

# urlsafe-base64-nopad for Python 3
import base64

def urlsafe_b64encode_nopad(data):
    return base64.urlsafe_b64encode(data).rstrip(b'=')

def urlsafe_b64decode_nopad(data):
    pad = b'=' * (4 - (len(data) & 3))
    return base64.urlsafe_b64decode(data + pad)

pep-0492 Coroutines with async and await syntax

PEP:492
Title:Coroutines with async and await syntax
Version:$Revision$
Last-Modified:$Date$
Author:Yury Selivanov <yselivanov at sprymix.com>
Discussions-To:<python-dev at python.org>
Status:Final
Type:Standards Track
Content-Type:text/x-rst
Created:09-Apr-2015
Python-Version:3.5
Post-History:17-Apr-2015, 21-Apr-2015, 27-Apr-2015, 29-Apr-2015, 05-May-2015

Abstract

The growth of Internet and general connectivity has triggered the proportionate need for responsive and scalable code. This proposal aims to answer that need by making writing explicitly asynchronous, concurrent Python code easier and more Pythonic.

It is proposed to make coroutines a proper standalone concept in Python, and introduce new supporting syntax. The ultimate goal is to help establish a common, easily approachable, mental model of asynchronous programming in Python and make it as close to synchronous programming as possible.

This PEP assumes that the asynchronous tasks are scheduled and coordinated by an Event Loop similar to that of stdlib module asyncio.events.AbstractEventLoop. While the PEP is not tied to any specific Event Loop implementation, it is relevant only to the kind of coroutine that uses yield as a signal to the scheduler, indicating that the coroutine will be waiting until an event (such as IO) is completed.

We believe that the changes proposed here will help keep Python relevant and competitive in a quickly growing area of asynchronous programming, as many other languages have adopted, or are planning to adopt, similar features: [2], [5], [6], [7], [8], [10].

Rationale and Goals

Current Python supports implementing coroutines via generators (PEP 342), further enhanced by the yield from syntax introduced in PEP 380. This approach has a number of shortcomings:

  • It is easy to confuse coroutines with regular generators, since they share the same syntax; this is especially true for new developers.
  • Whether or not a function is a coroutine is determined by a presence of yield or yield from statements in its body, which can lead to unobvious errors when such statements appear in or disappear from function body during refactoring.
  • Support for asynchronous calls is limited to expressions where yield is allowed syntactically, limiting the usefulness of syntactic features, such as with and for statements.

This proposal makes coroutines a native Python language feature, and clearly separates them from generators. This removes generator/coroutine ambiguity, and makes it possible to reliably define coroutines without reliance on a specific library. This also enables linters and IDEs to improve static code analysis and refactoring.

Native coroutines and the associated new syntax features make it possible to define context manager and iteration protocols in asynchronous terms. As shown later in this proposal, the new async with statement lets Python programs perform asynchronous calls when entering and exiting a runtime context, and the new async for statement makes it possible to perform asynchronous calls in iterators.

Specification

This proposal introduces new syntax and semantics to enhance coroutine support in Python.

This specification presumes knowledge of the implementation of coroutines in Python (PEP 342 and PEP 380). Motivation for the syntax changes proposed here comes from the asyncio framework (PEP 3156) and the "Cofunctions" proposal (PEP 3152, now rejected in favor of this specification).

From this point in this document we use the word native coroutine to refer to functions declared using the new syntax. generator-based coroutine is used where necessary to refer to coroutines that are based on generator syntax. coroutine is used in contexts where both definitions are applicable.

New Coroutine Declaration Syntax

The following new syntax is used to declare a native coroutine:

async def read_data(db):
    pass

Key properties of coroutines:

  • async def functions are always coroutines, even if they do not contain await expressions.

  • It is a SyntaxError to have yield or yield from expressions in an async function.

  • Internally, two new code object flags were introduced:

    • CO_COROUTINE is used to mark native coroutines (defined with new syntax.)
    • CO_ITERABLE_COROUTINE is used to make generator-based coroutines compatible with native coroutines (set by types.coroutine() function).

    All coroutines have CO_GENERATOR flag set.

  • Regular generators, when called, return a generator object; similarly, coroutines return a coroutine object.

  • StopIteration exceptions are not propagated out of coroutines, and are replaced with a RuntimeError. For regular generators such behavior requires a future import (see PEP 479).

  • When a coroutine is garbage collected, a RuntimeWarning is raised if it was never awaited on (see also Debugging Features.)

  • See also Coroutine objects section.

types.coroutine()

A new function coroutine(gen) is added to the types module. It allows interoperability between existing generator-based coroutines in asyncio and native coroutines introduced by this PEP:

@types.coroutine
def process_data(db):
    data = yield from read_data(db)
    ...

The function applies CO_ITERABLE_COROUTINE flag to generator- function's code object, making it return a coroutine object.

The function can be used as a decorator, since it modifies generator- functions in-place and returns them.

Note, that the CO_COROUTINE flag is not applied by types.coroutine() to make it possible to separate native coroutines defined with new syntax, from generator-based coroutines.

Await Expression

The following new await expression is used to obtain a result of coroutine execution:

async def read_data(db):
    data = await db.fetch('SELECT ...')
    ...

await, similarly to yield from, suspends execution of read_data coroutine until db.fetch awaitable completes and returns the result data.

It uses the yield from implementation with an extra step of validating its argument. await only accepts an awaitable, which can be one of:

  • A native coroutine object returned from a native coroutine function.

  • A generator-based coroutine object returned from a generator function decorated with types.coroutine().

  • An object with an __await__ method returning an iterator.

    Any yield from chain of calls ends with a yield. This is a fundamental mechanism of how Futures are implemented. Since, internally, coroutines are a special kind of generators, every await is suspended by a yield somewhere down the chain of await calls (please refer to PEP 3156 for a detailed explanation.)

    To enable this behavior for coroutines, a new magic method called __await__ is added. In asyncio, for instance, to enable Future objects in await statements, the only change is to add __await__ = __iter__ line to asyncio.Future class.

    Objects with __await__ method are called Future-like objects in the rest of this PEP.

    Also, please note that __aiter__ method (see its definition below) cannot be used for this purpose. It is a different protocol, and would be like using __iter__ instead of __call__ for regular callables.

    It is a TypeError if __await__ returns anything but an iterator.

  • Objects defined with CPython C API with a tp_as_async->am_await function, returning an iterator (similar to __await__ method).

It is a SyntaxError to use await outside of an async def function (like it is a SyntaxError to use yield outside of def function.)

It is a TypeError to pass anything other than an awaitable object to an await expression.

Updated operator precedence table

await keyword is defined as follows:

power ::=  await ["**" u_expr]
await ::=  ["await"] primary

where "primary" represents the most tightly bound operations of the language. Its syntax is:

primary ::=  atom | attributeref | subscription | slicing | call

See Python Documentation [12] and Grammar Updates section of this proposal for details.

The key await difference from yield and yield from operators is that await expressions do not require parentheses around them most of the times.

Also, yield from allows any expression as its argument, including expressions like yield from a() + b(), that would be parsed as yield from (a() + b()), which is almost always a bug. In general, the result of any arithmetic operation is not an awaitable object. To avoid this kind of mistakes, it was decided to make await precedence lower than [], (), and ., but higher than ** operators.

Operator Description
yield x, yield from x Yield expression
lambda Lambda expression
if -- else Conditional expression
or Boolean OR
and Boolean AND
not x Boolean NOT
in, not in, is, is not, <, <=, >, >=, !=, == Comparisons, including membership tests and identity tests
| Bitwise OR
^ Bitwise XOR
& Bitwise AND
<<, >> Shifts
+, - Addition and subtraction
*, @, /, //, % Multiplication, matrix multiplication, division, remainder
+x, -x, ~x Positive, negative, bitwise NOT
** Exponentiation
await x Await expression
x[index], x[index:index], x(arguments...), x.attribute Subscription, slicing, call, attribute reference
(expressions...), [expressions...], {key: value...}, {expressions...} Binding or tuple display, list display, dictionary display, set display

Examples of "await" expressions

Valid syntax examples:

Expression Will be parsed as
if await fut: pass if (await fut): pass
if await fut + 1: pass if (await fut) + 1: pass
pair = await fut, 'spam' pair = (await fut), 'spam'
with await fut, open(): pass with (await fut), open(): pass
await foo()['spam'].baz()() await ( foo()['spam'].baz()() )
return await coro() return ( await coro() )
res = await coro() ** 2 res = (await coro()) ** 2
func(a1=await coro(), a2=0) func(a1=(await coro()), a2=0)
await foo() + await bar() (await foo()) + (await bar())
-await foo() -(await foo())

Invalid syntax examples:

Expression Should be written as
await await coro() await (await coro())
await -coro() await (-coro())

Asynchronous Context Managers and "async with"

An asynchronous context manager is a context manager that is able to suspend execution in its enter and exit methods.

To make this possible, a new protocol for asynchronous context managers is proposed. Two new magic methods are added: __aenter__ and __aexit__. Both must return an awaitable.

An example of an asynchronous context manager:

class AsyncContextManager:
    async def __aenter__(self):
        await log('entering context')

    async def __aexit__(self, exc_type, exc, tb):
        await log('exiting context')

New Syntax

A new statement for asynchronous context managers is proposed:

async with EXPR as VAR:
    BLOCK

which is semantically equivalent to:

mgr = (EXPR)
aexit = type(mgr).__aexit__
aenter = type(mgr).__aenter__(mgr)
exc = True

VAR = await aenter
try:
    BLOCK
except:
    if not await aexit(mgr, *sys.exc_info()):
        raise
else:
    await aexit(mgr, None, None, None)

As with regular with statements, it is possible to specify multiple context managers in a single async with statement.

It is an error to pass a regular context manager without __aenter__ and __aexit__ methods to async with. It is a SyntaxError to use async with outside of an async def function.

Example

With asynchronous context managers it is easy to implement proper database transaction managers for coroutines:

async def commit(session, data):
    ...

    async with session.transaction():
        ...
        await session.update(data)
        ...

Code that needs locking also looks lighter:

async with lock:
    ...

instead of:

with (yield from lock):
    ...

Asynchronous Iterators and "async for"

An asynchronous iterable is able to call asynchronous code in its iter implementation, and asynchronous iterator can call asynchronous code in its next method. To support asynchronous iteration:

  1. An object must implement an __aiter__ method returning an awaitable resulting in an asynchronous iterator object.
  2. An asynchronous iterator object must implement an __anext__ method returning an awaitable.
  3. To stop iteration __anext__ must raise a StopAsyncIteration exception.

An example of asynchronous iterable:

class AsyncIterable:
    async def __aiter__(self):
        return self

    async def __anext__(self):
        data = await self.fetch_data()
        if data:
            return data
        else:
            raise StopAsyncIteration

    async def fetch_data(self):
        ...

New Syntax

A new statement for iterating through asynchronous iterators is proposed:

async for TARGET in ITER:
    BLOCK
else:
    BLOCK2

which is semantically equivalent to:

iter = (ITER)
iter = await type(iter).__aiter__(iter)
running = True
while running:
    try:
        TARGET = await type(iter).__anext__(iter)
    except StopAsyncIteration:
        running = False
    else:
        BLOCK
else:
    BLOCK2

It is a TypeError to pass a regular iterable without __aiter__ method to async for. It is a SyntaxError to use async for outside of an async def function.

As for with regular for statement, async for has an optional else clause.

Example 1

With asynchronous iteration protocol it is possible to asynchronously buffer data during iteration:

async for data in cursor:
    ...

Where cursor is an asynchronous iterator that prefetches N rows of data from a database after every N iterations.

The following code illustrates new asynchronous iteration protocol:

class Cursor:
    def __init__(self):
        self.buffer = collections.deque()

    def _prefetch(self):
        ...

    async def __aiter__(self):
        return self

    async def __anext__(self):
        if not self.buffer:
            self.buffer = await self._prefetch()
            if not self.buffer:
                raise StopAsyncIteration
        return self.buffer.popleft()

then the Cursor class can be used as follows:

async for row in Cursor():
    print(row)

which would be equivalent to the following code:

i = await Cursor().__aiter__()
while True:
    try:
        row = await i.__anext__()
    except StopAsyncIteration:
        break
    else:
        print(row)

Example 2

The following is a utility class that transforms a regular iterable to an asynchronous one. While this is not a very useful thing to do, the code illustrates the relationship between regular and asynchronous iterators.

class AsyncIteratorWrapper:
    def __init__(self, obj):
        self._it = iter(obj)

    async def __aiter__(self):
        return self

    async def __anext__(self):
        try:
            value = next(self._it)
        except StopIteration:
            raise StopAsyncIteration
        return value

async for letter in AsyncIteratorWrapper("abc"):
    print(letter)

Why StopAsyncIteration?

Coroutines are still based on generators internally. So, before PEP 479, there was no fundamental difference between

def g1():
    yield from fut
    return 'spam'

and

def g2():
    yield from fut
    raise StopIteration('spam')

And since PEP 479 is accepted and enabled by default for coroutines, the following example will have its StopIteration wrapped into a RuntimeError

async def a1():
    await fut
    raise StopIteration('spam')

The only way to tell the outside code that the iteration has ended is to raise something other than StopIteration. Therefore, a new built-in exception class StopAsyncIteration was added.

Moreover, with semantics from PEP 479, all StopIteration exceptions raised in coroutines are wrapped in RuntimeError.

Coroutine objects

Differences from generators

This section applies only to native coroutines with CO_COROUTINE flag, i.e. defined with the new async def syntax.

The behavior of existing *generator-based coroutines* in asyncio remains unchanged.

Great effort has been made to make sure that coroutines and generators are treated as distinct concepts:

  1. Native coroutine objects do not implement __iter__ and __next__ methods. Therefore, they cannot be iterated over or passed to iter(), list(), tuple() and other built-ins. They also cannot be used in a for..in loop.

    An attempt to use __iter__ or __next__ on a native coroutine object will result in a TypeError.

  2. Plain generators cannot yield from native coroutines: doing so will result in a TypeError.

  3. generator-based coroutines (for asyncio code must be decorated with @asyncio.coroutine) can yield from native coroutine objects.

  4. inspect.isgenerator() and inspect.isgeneratorfunction() return False for native coroutine objects and native coroutine functions.

Coroutine object methods

Coroutines are based on generators internally, thus they share the implementation. Similarly to generator objects, coroutines have throw(), send() and close() methods. StopIteration and GeneratorExit play the same role for coroutines (although PEP 479 is enabled by default for coroutines). See PEP 342, PEP 380, and Python Documentation [11] for details.

throw(), send() methods for coroutines are used to push values and raise errors into Future-like objects.

Debugging Features

A common beginner mistake is forgetting to use yield from on coroutines:

@asyncio.coroutine
def useful():
    asyncio.sleep(1) # this will do noting without 'yield from'

For debugging this kind of mistakes there is a special debug mode in asyncio, in which @coroutine decorator wraps all functions with a special object with a destructor logging a warning. Whenever a wrapped generator gets garbage collected, a detailed logging message is generated with information about where exactly the decorator function was defined, stack trace of where it was collected, etc. Wrapper object also provides a convenient __repr__ function with detailed information about the generator.

The only problem is how to enable these debug capabilities. Since debug facilities should be a no-op in production mode, @coroutine decorator makes the decision of whether to wrap or not to wrap based on an OS environment variable PYTHONASYNCIODEBUG. This way it is possible to run asyncio programs with asyncio's own functions instrumented. EventLoop.set_debug, a different debug facility, has no impact on @coroutine decorator's behavior.

With this proposal, coroutines is a native, distinct from generators, concept. In addition to a RuntimeWarning being raised on coroutines that were never awaited, it is proposed to add two new functions to the sys module: set_coroutine_wrapper and get_coroutine_wrapper. This is to enable advanced debugging facilities in asyncio and other frameworks (such as displaying where exactly coroutine was created, and a more detailed stack trace of where it was garbage collected).

New Standard Library Functions

  • types.coroutine(gen). See types.coroutine() section for details.
  • inspect.iscoroutine(obj) returns True if obj is a coroutine object.
  • inspect.iscoroutinefunction(obj) returns True if obj is a coroutine function.
  • inspect.isawaitable(obj) returns True if obj can be used in await expression. See Await Expression for details.
  • sys.set_coroutine_wrapper(wrapper) allows to intercept creation of coroutine objects. wrapper must be either a callable that accepts one argument (a coroutine object), or None. None resets the wrapper. If called twice, the new wrapper replaces the previous one. The function is thread-specific. See Debugging Features for more details.
  • sys.get_coroutine_wrapper() returns the current wrapper object. Returns None if no wrapper was set. The function is thread-specific. See Debugging Features for more details.

New Abstract Base Classes

In order to allow better integration with existing frameworks (such as Tornado, see [13]) and compilers (such as Cython, see [16]), two new Abstract Base Classes (ABC) are added:

  • collections.abc.Awaitable ABC for Future-like classes, that implement __await__ method.
  • collections.abc.Coroutine ABC for coroutine objects, that implement send(value), throw(type, exc, tb), close() and __await__() methods.

To allow easy testing if objects support asynchronous iteration, two more ABCs are added:

  • collections.abc.AsyncIterable -- tests for __aiter__ method.
  • collections.abc.AsyncIterator -- tests for __aiter__ and __anext__ methods.

Glossary

Native coroutine function
A coroutine function is declared with async def. It uses await and return value; see New Coroutine Declaration Syntax for details.
Native coroutine
Returned from a native coroutine function. See Await Expression for details.
Generator-based coroutine function
Coroutines based on generator syntax. Most common example are functions decorated with @asyncio.coroutine.
Generator-based coroutine
Returned from a generator-based coroutine function.
Coroutine
Either native coroutine or generator-based coroutine.
Coroutine object
Either native coroutine object or generator-based coroutine object.
Future-like object
An object with an __await__ method, or a C object with tp_as_async->am_await function, returning an iterator. Can be consumed by an await expression in a coroutine. A coroutine waiting for a Future-like object is suspended until the Future-like object's __await__ completes, and returns the result. See Await Expression for details.
Awaitable
A Future-like object or a coroutine object. See Await Expression for details.
Asynchronous context manager
An asynchronous context manager has __aenter__ and __aexit__ methods and can be used with async with. See Asynchronous Context Managers and "async with" for details.
Asynchronous iterable
An object with an __aiter__ method, which must return an asynchronous iterator object. Can be used with async for. See Asynchronous Iterators and "async for" for details.
Asynchronous iterator
An asynchronous iterator has an __anext__ method. See Asynchronous Iterators and "async for" for details.

List of functions and methods

Method Can contain Can't contain
async def func await, return value yield, yield from
async def __a*__ await, return value yield, yield from
def __a*__ return awaitable await
def __await__ yield, yield from, return iterable await
generator yield, yield from, return value await

Where:

  • "async def func": native coroutine;
  • "async def __a*__": __aiter__, __anext__, __aenter__, __aexit__ defined with the async keyword;
  • "def __a*__": __aiter__, __anext__, __aenter__, __aexit__ defined without the async keyword, must return an awaitable;
  • "def __await__": __await__ method to implement Future-like objects;
  • generator: a "regular" generator, function defined with def and which contains a least one yield or yield from expression.

Transition Plan

To avoid backwards compatibility issues with async and await keywords, it was decided to modify tokenizer.c in such a way, that it:

  • recognizes async def NAME tokens combination;
  • keeps track of regular def and async def indented blocks;
  • while tokenizing async def block, it replaces 'async' NAME token with ASYNC, and 'await' NAME token with AWAIT;
  • while tokenizing def block, it yields 'async' and 'await' NAME tokens as is.

This approach allows for seamless combination of new syntax features (all of them available only in async functions) with any existing code.

An example of having "async def" and "async" attribute in one piece of code:

class Spam:
    async = 42

async def ham():
    print(getattr(Spam, 'async'))

# The coroutine can be executed and will print '42'

Backwards Compatibility

This proposal preserves 100% backwards compatibility.

asyncio

asyncio module was adapted and tested to work with coroutines and new statements. Backwards compatibility is 100% preserved, i.e. all existing code will work as-is.

The required changes are mainly:

  1. Modify @asyncio.coroutine decorator to use new types.coroutine() function.
  2. Add __await__ = __iter__ line to asyncio.Future class.
  3. Add ensure_future() as an alias for async() function. Deprecate async() function.

asyncio migration strategy

Because plain generators cannot yield from native coroutine objects (see Differences from generators section for more details), it is advised to make sure that all generator-based coroutines are decorated with @asyncio.coroutine before starting to use the new syntax.

async/await in CPython code base

There is no use of await names in CPython.

async is mostly used by asyncio. We are addressing this by renaming async() function to ensure_future() (see asyncio section for details.)

Another use of async keyword is in Lib/xml/dom/xmlbuilder.py, to define an async = False attribute for DocumentLS class. There is no documentation or tests for it, it is not used anywhere else in CPython. It is replaced with a getter, that raises a DeprecationWarning, advising to use async_ attribute instead. 'async' attribute is not documented and is not used in CPython code base.

Grammar Updates

Grammar changes are fairly minimal:

decorated: decorators (classdef | funcdef | async_funcdef)
async_funcdef: ASYNC funcdef

compound_stmt: (if_stmt | while_stmt | for_stmt | try_stmt | with_stmt
                | funcdef | classdef | decorated | async_stmt)

async_stmt: ASYNC (funcdef | with_stmt | for_stmt)

power: atom_expr ['**' factor]
atom_expr: [AWAIT] atom trailer*

Transition Period Shortcomings

There is just one.

Until async and await are not proper keywords, it is not possible (or at least very hard) to fix tokenizer.c to recognize them on the same line with def keyword:

# async and await will always be parsed as variables

async def outer():                             # 1
    def nested(a=(await fut)):
        pass

async def foo(): return (await fut)            # 2

Since await and async in such cases are parsed as NAME tokens, a SyntaxError will be raised.

To workaround these issues, the above examples can be easily rewritten to a more readable form:

async def outer():                             # 1
    a_default = await fut
    def nested(a=a_default):
        pass

async def foo():                               # 2
    return (await fut)

This limitation will go away as soon as async and await are proper keywords.

Deprecation Plans

async and await names will be softly deprecated in CPython 3.5 and 3.6. In 3.7 we will transform them to proper keywords. Making async and await proper keywords before 3.7 might make it harder for people to port their code to Python 3.

Design Considerations

PEP 3152

PEP 3152 by Gregory Ewing proposes a different mechanism for coroutines (called "cofunctions"). Some key points:

  1. A new keyword codef to declare a cofunction. Cofunction is always a generator, even if there is no cocall expressions inside it. Maps to async def in this proposal.

  2. A new keyword cocall to call a cofunction. Can only be used inside a cofunction. Maps to await in this proposal (with some differences, see below.)

  3. It is not possible to call a cofunction without a cocall keyword.

  4. cocall grammatically requires parentheses after it:

    atom: cocall | <existing alternatives for atom>
    cocall: 'cocall' atom cotrailer* '(' [arglist] ')'
    cotrailer: '[' subscriptlist ']' | '.' NAME
    
  5. cocall f(*args, **kwds) is semantically equivalent to yield from f.__cocall__(*args, **kwds).

Differences from this proposal:

  1. There is no equivalent of __cocall__ in this PEP, which is called and its result is passed to yield from in the cocall expression. await keyword expects an awaitable object, validates the type, and executes yield from on it. Although, __await__ method is similar to __cocall__, but is only used to define Future-like objects.

  2. await is defined in almost the same way as yield from in the grammar (it is later enforced that await can only be inside async def). It is possible to simply write await future, whereas cocall always requires parentheses.

  3. To make asyncio work with PEP 3152 it would be required to modify @asyncio.coroutine decorator to wrap all functions in an object with a __cocall__ method, or to implement __cocall__ on generators. To call cofunctions from existing generator-based coroutines it would be required to use costart(cofunc, *args, **kwargs) built-in.

  4. Since it is impossible to call a cofunction without a cocall keyword, it automatically prevents the common mistake of forgetting to use yield from on generator-based coroutines. This proposal addresses this problem with a different approach, see Debugging Features.

  5. A shortcoming of requiring a cocall keyword to call a coroutine is that if is decided to implement coroutine-generators -- coroutines with yield or async yield expressions -- we wouldn't need a cocall keyword to call them. So we'll end up having __cocall__ and no __call__ for regular coroutines, and having __call__ and no __cocall__ for coroutine- generators.

  6. Requiring parentheses grammatically also introduces a whole lot of new problems.

    The following code:

    await fut
    await function_returning_future()
    await asyncio.gather(coro1(arg1, arg2), coro2(arg1, arg2))
    

    would look like:

    cocall fut()  # or cocall costart(fut)
    cocall (function_returning_future())()
    cocall asyncio.gather(costart(coro1, arg1, arg2),
                          costart(coro2, arg1, arg2))
    
  7. There are no equivalents of async for and async with in PEP 3152.

Coroutine-generators

With async for keyword it is desirable to have a concept of a coroutine-generator -- a coroutine with yield and yield from expressions. To avoid any ambiguity with regular generators, we would likely require to have an async keyword before yield, and async yield from would raise a StopAsyncIteration exception.

While it is possible to implement coroutine-generators, we believe that they are out of scope of this proposal. It is an advanced concept that should be carefully considered and balanced, with a non-trivial changes in the implementation of current generator objects. This is a matter for a separate PEP.

Why "async" and "await" keywords

async/await is not a new concept in programming languages:

  • C# has it since long time ago [5];
  • proposal to add async/await in ECMAScript 7 [2]; see also Traceur project [9];
  • Facebook's Hack/HHVM [6];
  • Google's Dart language [7];
  • Scala [8];
  • proposal to add async/await to C++ [10];
  • and many other less popular languages.

This is a huge benefit, as some users already have experience with async/await, and because it makes working with many languages in one project easier (Python with ECMAScript 7 for instance).

Why "__aiter__" returns awaitable

In principle, __aiter__ could be a regular function. There are several good reasons to make it a coroutine:

  • as most of the __anext__, __aenter__, and __aexit__ methods are coroutines, users would often make a mistake defining it as async anyways;
  • there might be a need to run some asynchronous operations in __aiter__, for instance to prepare DB queries or do some file operation.

Importance of "async" keyword

While it is possible to just implement await expression and treat all functions with at least one await as coroutines, this approach makes APIs design, code refactoring and its long time support harder.

Let's pretend that Python only has await keyword:

def useful():
    ...
    await log(...)
    ...

def important():
    await useful()

If useful() function is refactored and someone removes all await expressions from it, it would become a regular python function, and all code that depends on it, including important() would be broken. To mitigate this issue a decorator similar to @asyncio.coroutine has to be introduced.

Why "async def"

For some people bare async name(): pass syntax might look more appealing than async def name(): pass. It is certainly easier to type. But on the other hand, it breaks the symmetry between async def, async with and async for, where async is a modifier, stating that the statement is asynchronous. It is also more consistent with the existing grammar.

Why not "await for" and "await with"

async is an adjective, and hence it is a better choice for a statement qualifier keyword. await for/with would imply that something is awaiting for a completion of a for or with statement.

Why "async def" and not "def async"

async keyword is a statement qualifier. A good analogy to it are "static", "public", "unsafe" keywords from other languages. "async for" is an asynchronous "for" statement, "async with" is an asynchronous "with" statement, "async def" is an asynchronous function.

Having "async" after the main statement keyword might introduce some confusion, like "for async item in iterator" can be read as "for each asynchronous item in iterator".

Having async keyword before def, with and for also makes the language grammar simpler. And "async def" better separates coroutines from regular functions visually.

Why not a __future__ import

Transition Plan section explains how tokenizer is modified to treat async and await as keywords only in async def blocks. Hence async def fills the role that a module level compiler declaration like from __future__ import async_await would otherwise fill.

Why magic methods start with "a"

New asynchronous magic methods __aiter__, __anext__, __aenter__, and __aexit__ all start with the same prefix "a". An alternative proposal is to use "async" prefix, so that __aiter__ becomes __async_iter__. However, to align new magic methods with the existing ones, such as __radd__ and __iadd__ it was decided to use a shorter version.

Why not reuse existing magic names

An alternative idea about new asynchronous iterators and context managers was to reuse existing magic methods, by adding an async keyword to their declarations:

class CM:
    async def __enter__(self): # instead of __aenter__
        ...

This approach has the following downsides:

  • it would not be possible to create an object that works in both with and async with statements;
  • it would break backwards compatibility, as nothing prohibits from returning a Future-like objects from __enter__ and/or __exit__ in Python <= 3.4;
  • one of the main points of this proposal is to make native coroutines as simple and foolproof as possible, hence the clear separation of the protocols.

Why not reuse existing "for" and "with" statements

The vision behind existing generator-based coroutines and this proposal is to make it easy for users to see where the code might be suspended. Making existing "for" and "with" statements to recognize asynchronous iterators and context managers will inevitably create implicit suspend points, making it harder to reason about the code.

Comprehensions

Syntax for asynchronous comprehensions could be provided, but this construct is outside of the scope of this PEP.

Async lambda functions

Syntax for asynchronous lambda functions could be provided, but this construct is outside of the scope of this PEP.

Performance

Overall Impact

This proposal introduces no observable performance impact. Here is an output of python's official set of benchmarks [4]:

python perf.py -r -b default ../cpython/python.exe ../cpython-aw/python.exe

[skipped]

Report on Darwin ysmac 14.3.0 Darwin Kernel Version 14.3.0:
Mon Mar 23 11:59:05 PDT 2015; root:xnu-2782.20.48~5/RELEASE_X86_64
x86_64 i386

Total CPU cores: 8

### etree_iterparse ###
Min: 0.365359 -> 0.349168: 1.05x faster
Avg: 0.396924 -> 0.379735: 1.05x faster
Significant (t=9.71)
Stddev: 0.01225 -> 0.01277: 1.0423x larger

The following not significant results are hidden, use -v to show them:
django_v2, 2to3, etree_generate, etree_parse, etree_process, fastpickle,
fastunpickle, json_dump_v2, json_load, nbody, regex_v8, tornado_http.

Tokenizer modifications

There is no observable slowdown of parsing python files with the modified tokenizer: parsing of one 12Mb file (Lib/test/test_binop.py repeated 1000 times) takes the same amount of time.

async/await

The following micro-benchmark was used to determine performance difference between "async" functions and generators:

import sys
import time

def binary(n):
    if n <= 0:
        return 1
    l = yield from binary(n - 1)
    r = yield from binary(n - 1)
    return l + 1 + r

async def abinary(n):
    if n <= 0:
        return 1
    l = await abinary(n - 1)
    r = await abinary(n - 1)
    return l + 1 + r

def timeit(gen, depth, repeat):
    t0 = time.time()
    for _ in range(repeat):
        list(gen(depth))
    t1 = time.time()
    print('{}({}) * {}: total {:.3f}s'.format(
        gen.__name__, depth, repeat, t1-t0))

The result is that there is no observable performance difference. Minimum timing of 3 runs

abinary(19) * 30: total 12.985s
binary(19) * 30: total 12.953s

Note that depth of 19 means 1,048,575 calls.

Reference Implementation

The reference implementation can be found here: [3].

List of high-level changes and new protocols

  1. New syntax for defining coroutines: async def and new await keyword.
  2. New __await__ method for Future-like objects, and new tp_as_async->am_await slot in PyTypeObject.
  3. New syntax for asynchronous context managers: async with. And associated protocol with __aenter__ and __aexit__ methods.
  4. New syntax for asynchronous iteration: async for. And associated protocol with __aiter__, __aexit__ and new built- in exception StopAsyncIteration. New tp_as_async->am_aiter and tp_as_async->am_anext slots in PyTypeObject.
  5. New AST nodes: AsyncFunctionDef, AsyncFor, AsyncWith, Await.
  6. New functions: sys.set_coroutine_wrapper(callback), sys.get_coroutine_wrapper(), types.coroutine(gen), inspect.iscoroutinefunction(func), inspect.iscoroutine(obj), and inspect.isawaitable(obj).
  7. New CO_COROUTINE and CO_ITERABLE_COROUTINE bit flags for code objects.
  8. New ABCs: collections.abc.Awaitable, collections.abc.Coroutine, collections.abc.AsyncIterable, and collections.abc.AsyncIterator.

While the list of changes and new things is not short, it is important to understand, that most users will not use these features directly. It is intended to be used in frameworks and libraries to provide users with convenient to use and unambiguous APIs with async def, await, async for and async with syntax.

Working example

All concepts proposed in this PEP are implemented [3] and can be tested.

import asyncio

async def echo_server():
    print('Serving on localhost:8000')
    await asyncio.start_server(handle_connection,
                               'localhost', 8000)

async def handle_connection(reader, writer):
    print('New connection...')

    while True:
        data = await reader.read(8192)

        if not data:
            break

        print('Sending {:.10}... back'.format(repr(data)))
        writer.write(data)

loop = asyncio.get_event_loop()
loop.run_until_complete(echo_server())
try:
    loop.run_forever()
finally:
    loop.close()

Acceptance

PEP 492 was accepted by Guido, Tuesday, May 5, 2015 [14].

Implementation

The implementation is tracked in issue 24017 [15]. It was committed on May 11, 2015.

Acknowledgments

I thank Guido van Rossum, Victor Stinner, Elvis Pranskevichus, Andrew Svetlov, Łukasz Langa, Greg Ewing, Stephen J. Turnbull, Jim J. Jewett, Brett Cannon, Nick Coghlan, Steven D'Aprano, Paul Moore, Nathaniel Smith, Ethan Furman, Stefan Behnel, Paul Sokolovsky, Victor Petrovykh, and many others for their feedback, ideas, edits, criticism, code reviews, and discussions around this PEP.

pep-0493 HTTPS verification recommendations for Python 2.7 redistributors

PEP:493
Title:HTTPS verification recommendations for Python 2.7 redistributors
Version:$Revision$
Last-Modified:$Date$
Author:Nick Coghlan <ncoghlan at gmail.com>, Robert Kuska <rkuska at redhat.com>
Status:Draft
Type:Informational
Content-Type:text/x-rst
Created:10-May-2015

Abstract

PEP 476 updated Python's default handling of HTTPS certificates to be appropriate for communication over the public internet. The Python 2.7 long term maintenance series was judged to be in scope for this change, with the new behaviour introduced in the Python 2.7.9 maintenance release.

This PEP provides recommendations to downstream redistributors wishing to provide a smoother migration experience when helping their users to manage this change in Python's default behaviour.

Note that this PEP is not currently accepted, so it is a *proposed recommendation, rather than an active one.*

Rationale

PEP 476 changed Python's default behaviour to better match the needs and expectations of developers operating over the public internet, a category which appears to include most new Python developers. It is the position of the authors of this PEP that this was a correct decision.

However, it is also the case that this change does cause problems for infrastructure administrators operating private intranets that rely on self-signed certificates, or otherwise encounter problems with the new default certificate verification settings.

The long term answer for such environments is to update their internal certificate management to at least match the standards set by the public internet, but in the meantime, it is desirable to offer these administrators a way to continue receiving maintenance updates to the Python 2.7 series, without having to gate that on upgrades to their certificate management infrastructure.

PEP 476 did attempt to address this question, by covering how to revert the new settings process wide by monkeypatching the ssl module to restore the old behaviour. Unfortunately, the sitecustomize.py based technique proposed to allow system administrators to disable the feature by default in their Standard Operating Environment definition has been determined to be insufficient in at least some cases. The specific case of interest to the authors of this PEP is the one where a Linux distributor aims to provide their users with a smoother migration path than the standard one provided by consuming upstream CPython 2.7 releases directly, but other potential challenges have also been pointed out with updating embedded Python runtimes and other user level installations of Python.

Rather than allowing a plethora of mutually incompatibile migration techniques to bloom, this PEP proposes two alternative approaches that redistributors may take when addressing these problems. Redistributors may choose to implement one, both, or neither of these approaches based on their assessment of the needs of their particular userbase.

These designs are being proposed as a recommendation for redistributors, rather than as new upstream features, as they are needed purely to support legacy environments migrating from older versions of Python 2.7. Neither approach is being proposed as an upstream Python 2.7 feature, nor as a feature in any version of Python 3 (whether published directly by the Python Software Foundation or by a redistributor).

Recommendation for an environment variable based security downgrade

Some redistributors may wish to provide a per-application option to disable certificate verification in selected applications that run on or embed CPython without needing to modify the application itself.

In these cases, a configuration mechanism is needed that provides:

  • an opt-out model that allows certificate verification to be selectively turned off for particular applications after upgrading to a version of Python that verifies certificates by default
  • the ability for all users to configure this setting on a per-application basis, rather than on a per-system, or per-Python-installation basis

This approach may be used for any redistributor provided version of Python 2.7, including those that advertise themselves as providing Python 2.7.9 or later.

Example implementation

def _get_https_context_factory():
    config_setting = os.environ.get('PYTHONHTTPSVERIFY')
    if config_setting == '0':
        return _create_unverified_context
    return create_default_context

_create_default_https_context = _get_https_context_factory()

Security Considerations

Relative to an unmodified version of CPython 2.7.9 or later, this approach does introduce a new downgrade attack against the default security settings that potentially allows a sufficiently determined attacker to revert Python to the vulnerable configuration used in CPython 2.7.8 and earlier releases. Such an attack requires the ability to modify the execution environment of a Python process prior to the import of the ssl module.

Redistributors should balance this marginal increase in risk against the ability to offer a smoother migration path to their users when deciding whether or not it is appropriate for them to implement this per-application "opt out" model.

Recommendation for backporting to earlier Python versions

Some redistributors, most notably Linux distributions, may choose to backport the PEP 476 HTTPS verification changes to modified Python versions based on earlier Python 2 maintenance releases. In these cases, a configuration mechanism is needed that provides:

  • an opt-in model that allows the decision to enable HTTPS certificate verification to be made independently of the decision to upgrade to the Python version where the feature was first backported
  • the ability for system administrators to set the default behaviour of Python applications and scripts run directly in the system Python installation
  • the ability for the redistributor to consider changing the default behaviour of new installations at some point in the future without impacting existing installations that have been explicitly configured to skip verifying HTTPS certificates by default

This approach should not be used for any Python installation that advertises itself as providing Python 2.7.9 or later, as most Python users will have the reasonable expectation that all such environments will validate HTTPS certificates by default.

Recommended modifications to the Python standard library

The recommended approach to backporting the PEP 476 modifications to an earlier point release is to implement the following changes relative to the default PEP 476 behaviour implemented in Python 2.7.9+:

  • modify the ssl module to read a system wide configuration file when the module is first imported into a Python process
  • define a platform default behaviour (either verifying or not verifying HTTPS certificates) to be used if this configuration file is not present
  • support selection between the following three modes of operation:
    • ensure HTTPS certificate verification is enabled
    • ensure HTTPS certificate verification is disabled
    • delegate the decision to the redistributor providing this Python version
  • set the ssl._create_default_https_context function to be an alias for either ssl.create_default_context or ssl._create_unverified_context based on the given configuration setting.

Example implementation

def _get_https_context_factory():
    # Check for a system-wide override of the default behaviour
    config_file = '/etc/python/cert-verification.cfg'
    context_factories = {
        'enable': create_default_context,
        'disable': _create_unverified_context,
        'platform_default': _create_unverified_context, # For now :)
    }
    import ConfigParser
    config = ConfigParser.RawConfigParser()
    config.read(config_file)
    try:
        verify_mode = config.get('https', 'verify')
    except (ConfigParser.NoSectionError, ConfigParser.NoOptionError):
        verify_mode = 'platform_default'
    default_factory = context_factories.get('platform_default')
    return context_factories.get(verify_mode, default_factory)

_create_default_https_context = _get_https_context_factory()

Security Considerations

The specific recommendations for the backporting case are designed to work for privileged, security sensitive processes, even those being run in the following locked down configuration:

  • run from a locked down administrator controlled directory rather than a normal user directory (preventing sys.path[0] based privilege escalation attacks)
  • run using the -E switch (preventing PYTHON* environment variable based privilege escalation attacks)
  • run using the -s switch (preventing user site directory based privilege escalation attacks)
  • run using the -S switch (preventing sitecustomize based privilege escalation attacks)

The intent is that the only reason HTTPS verification should be getting turned off system wide when using this approach is because:

  • an end user is running a redistributor provided version of CPython rather than running upstream CPython directly
  • that redistributor has decided to provide a smoother migration path to verifying HTTPS certificates by default than that being provided by the upstream project
  • either the redistributor or the local infrastructure administrator has determined that it is appropriate to override the default upstream behaviour (at least for the time being)

Using an administrator controlled configuration file rather than an environment variable has the essential feature of providing a smoother migraiton path, even for applications being run with the -E switch.

Combining the recommendations

If a redistributor chooses to implement both recommendations, then the environment variable should take precedence over the system-wide configuration setting. This allows the setting to be changed for a given user, virtual environment or application, regardless of the system-wide default behaviour.

In this case, if the PYTHONHTTPSVERIFY environment variable is defined, and set to anything other than '0', then HTTPS certificate verification should be enabled.

Example implementation

def _get_https_context_factory():
    # Check for am environmental override of the default behaviour
    config_setting = os.environ.get('PYTHONHTTPSVERIFY')
    if config_setting is not None:
        if config_setting == '0':
            return _create_unverified_context
        return create_default_context

    # Check for a system-wide override of the default behaviour
    config_file = '/etc/python/cert-verification.cfg'
    context_factories = {
        'enable': create_default_context,
        'disable': _create_unverified_context,
        'platform_default': _create_unverified_context, # For now :)
    }
    import ConfigParser
    config = ConfigParser.RawConfigParser()
    config.read(config_file)
    try:
        verify_mode = config.get('https', 'verify')
    except (ConfigParser.NoSectionError, ConfigParser.NoOptionError):
        verify_mode = 'platform_default'
    default_factory = context_factories.get('platform_default')
    return context_factories.get(verify_mode, default_factory)

_create_default_https_context = _get_https_context_factory()

pep-0494 Python 3.6 Release Schedule

PEP:494
Title:Python 3.6 Release Schedule
Version:$Revision$
Last-Modified:$Date$
Author:Ned Deily <nad at acm.org>
Status:Active
Type:Informational
Content-Type:text/x-rst
Created:30-May-2015
Python-Version:3.6

Abstract

This document describes the development and release schedule for Python 3.6. The schedule primarily concerns itself with PEP-sized items.

Release Manager and Crew

  • 3.6 Release Manager: Ned Deily
  • Windows installers: Steve Dower
  • Mac installers: Ned Deily
  • Documentation: Georg Brandl

Release Schedule

The releases:

  • 3.6.0 alpha 1: TBD
  • 3.6.0 beta 1: TBD
  • 3.6.0 candidate 1: TBD
  • 3.6.0 final: TBD (late 2016?)

(Beta 1 is also "feature freeze"--no new features beyond this point.)

Features for 3.6

Proposed changes for 3.6:

  • TBD

pep-0628 Add math.tau

PEP:628
Title:Add math.tau
Version:$Revision$
Last-Modified:$Date$
Author:Nick Coghlan <ncoghlan at gmail.com>
Status:Deferred
Type:Standards Track
Content-Type:text/x-rst
Created:2011-06-28
Python-Version:3.x
Post-History:2011-06-28
Resolution:TBD

Abstract

In honour of Tau Day 2011, this PEP proposes the addition of the circle constant math.tau to the Python standard library.

The concept of tau (τ) is based on the observation that the ratio of a circle's circumference to its radius is far more fundamental and interesting than the ratio between its circumference and diameter. It is simply a matter of assigning a name to the value 2 * pi (2π).

PEP Deferral

The idea in this PEP was first proposed in the auspiciously named issue 12345 [1]. The immediate negative reactions I received from other core developers on that issue made it clear to me that there wasn't likely to be much collective interest in being part of a movement towards greater clarity in the explanation of profound mathematical concepts that are unnecessarily obscured by a historical quirk of notation.

Accordingly, this PEP is being submitted in a Deferred state, in the hope that it may someday be revisited if the mathematical and educational establishment choose to adopt a more enlightened and informative notation for dealing with radians.

Converts to the merits of tau as the more fundamental circle constant should feel free to start their mathematical code with tau = 2 * math.pi.

The Rationale for Tau

pi is defined as the ratio of a circle's circumference to its diameter. However, a circle is defined by its centre point and its radius. This is shown clearly when we note that the parameter of integration to go from a circle's circumference to its area is the radius, not the diameter. If we use the diameter instead we have to divide by four to get rid of the extraneous multiplier.

When working with radians, it is trivial to convert any given fraction of a circle to a value in radians in terms of tau. A quarter circle is tau/4, a half circle is tau/2, seven 25ths is 7*tau/25, etc. In contrast with the equivalent expressions in terms of pi (pi/2, pi, 14*pi/25), the unnecessary and needlessly confusing multiplication by two is gone.

Other Resources

I've barely skimmed the surface of the many examples put forward to point out just how much easier and more sensible many aspects of mathematics become when conceived in terms of tau rather than pi. If you don't find my specific examples sufficiently persausive, here are some more resources that may be of interest:

pep-0666 Reject Foolish Indentation

PEP: 666
Title: Reject Foolish Indentation
Version: $Revision$
Last-Modified: $Date$
Author: Laura Creighton <lac at strakt.com>
Status: Rejected
Type: Standards Track
Created: 3-Dec-2001
Python-Version: 2.2
Post-History: 5-Dec-2001

Abstract

    Everybody agrees that mixing tabs and spaces is a bad idea.  Some
    people want more than this.  I propose that we let people define
    whatever Python behaviour they want, so it will only run the way
    they like it, and will not run the way they don't like it.  We
    will do this with a command line switch.  Programs that aren't
    formatted the way the programmer wants things will raise
    IndentationError:

    Python -TNone will refuse to run when there are any tabs.
    Python -Tn will refuse to run when tabs are not exactly n spaces
    Python -TOnly will refuse to run when blocks are indented by anything
            other than tabs

   People who mix tabs and spaces, naturally, will find that their
   programs do not run.  Alas, we haven't found a way to give them an
   electric shock as from a cattle prod remotely.  (Though if somebody
   finds out a way to do this, I will be pleased to add this option to
   the PEP.)

    

Rationale

   Python-list@python.org (a.k.a. comp.lang.python) is periodically
   awash with discussions about tabs and spaces.  This is inevitable,
   given that indentation is syntactically significant in Python.
   This has never solved anything, and just makes various people
   frustrated and angry.  Eventually they start saying rude things to
   each other which is sad for all of us.  And it is also sad that
   they are wasting their valuable time which they could spend
   creating something with Python.  Moreover, for the Python community
   as a whole, from a public relations point of view, this is quite
   unfortunate.  The people who aren't posting about tabs and spaces,
   are, (unsurprisingly) invisible, while the people who are posting
   make the rest of us look somewhat foolish.

   The problem is that there is no polite way to say 'Stop wasting
   your valuable time and mine.'  People who are already in the middle
   of a flame war are not well disposed to believe that you are acting
   out of compassion for them, and quite rightly insist that their own
   time is their own to do with as they please.  They are stuck like
   flies in treacle in this wretched argument, and it is self-evident
   that they cannot disengage or they would have already done so.

   But today I had to spend time cleaning my keyboard because the 'n'
   key is sticking.  So, in addition to feeling compassion for these
   people, I am pretty annoyed.  I figure if I make this PEP, we can
   then ask Guido to quickly reject it, and then when this argument
   next starts up again, we can say 'Guido isn't changing things to
   suit the tab-haters or the only-tabbers, so this conversation is a
   waste of time.'  Then everybody can quietly believe that a) they
   are correct and b) other people are fools and c) they are
   undeniably fortunate to not have to share a lab with idiots, (which
   is something the arguers could do _now_, but apparently have
   forgotten).

   And python-list can go back to worrying if it is too smug, rather
   than whether it is too hostile for newcomers.  Possibly somebody
   could get around to explaining to me what is the difference between
   __getattr__ and __getattribute__ in non-Classic classes in 2.2, a
   question I have foolishly posted in the middle of the current tab
   thread.  I would like to know the answer to that question.[2]
   
   This proposal, if accepted, will probably mean a heck of a lot of
   work for somebody.  But since I don't want it accepted, I don't
   care.


References

    [1] PEP 1, PEP Purpose and Guidelines
        http://www.python.org/dev/peps/pep-0001/

    [2] Tim Peters already has (private correspondence).  My early 2.2 
        didn't have a __getattribute__, and __getattr__ was
        implemented like __getattribute__ now is.  This has been
        fixed.  The important conclusion is that my Decorator Pattern
        is safe and all is right with the world.


Copyright

    This document has been placed in the public domain.



pep-0754 IEEE 754 Floating Point Special Values

PEP:754
Title:IEEE 754 Floating Point Special Values
Version:$Revision$
Last-Modified:$Date$
Author:Gregory R. Warnes <gregory_r_warnes at groton.pfizer.com> (Pfizer, Inc.)
Status:Rejected
Type:Standards Track
Content-Type:text/x-rst
Created:28-Mar-2003
Python-Version:2.3
Post-History:

Rejection Notice

This PEP has been rejected. After sitting open for four years, it has failed to generate sufficient community interest.

Several ideas of this PEP were implemented for Python 2.6. float('inf') and repr(float('inf')) are now guaranteed to work on every supported platform with IEEE 754 semantics. However the eval(repr(float('inf'))) roundtrip is still not supported unless you define inf and nan yourself:

>>> inf = float('inf')
>>> inf, 1E400
(inf, inf)
>>> neginf = float('-inf')
>>> neginf, -1E400
(-inf, -inf)
>>> nan = float('nan')
>>> nan, inf * 0.
(nan, nan)

The math and the sys module also have gained additional features, sys.float_info, math.isinf, math.isnan, math.copysign.

Abstract

This PEP proposes an API and a provides a reference module that generates and tests for IEEE 754 double-precision special values: positive infinity, negative infinity, and not-a-number (NaN).

Rationale

The IEEE 754 standard defines a set of binary representations and algorithmic rules for floating point arithmetic. Included in the standard is a set of constants for representing special values, including positive infinity, negative infinity, and indeterminate or non-numeric results (NaN). Most modern CPUs implement the IEEE 754 standard, including the (Ultra)SPARC, PowerPC, and x86 processor series.

Currently, the handling of IEEE 754 special values in Python depends on the underlying C library. Unfortunately, there is little consistency between C libraries in how or whether these values are handled. For instance, on some systems "float('Inf')" will properly return the IEEE 754 constant for positive infinity. On many systems, however, this expression will instead generate an error message.

The output string representation for an IEEE 754 special value also varies by platform. For example, the expression "float(1e3000)", which is large enough to generate an overflow, should return a string representation corresponding to IEEE 754 positive infinity. Python 2.1.3 on x86 Debian Linux returns "inf". On Sparc Solaris 8 with Python 2.2.1, this same expression returns "Infinity", and on MS-Windows 2000 with Active Python 2.2.1, it returns "1.#INF".

Adding to the confusion, some platforms generate one string on conversion from floating point and accept a different string for conversion to floating point. On these systems

float(str(x))

will generate an error when "x" is an IEEE special value.

In the past, some have recommended that programmers use expressions like:

PosInf = 1e300**2
NaN = PosInf/PosInf

to obtain positive infinity and not-a-number constants. However, the first expression generates an error on current Python interpreters. A possible alternative is to use:

PosInf = 1e300000
NaN = PosInf/PosInf

While this does not generate an error with current Python interpreters, it is still an ugly and potentially non-portable hack. In addition, defining NaN in this way does solve the problem of detecting such values. First, the IEEE 754 standard provides for an entire set of constant values for Not-a-Number. Second, the standard requires that

NaN != X

for all possible values of X, including NaN. As a consequence

NaN == NaN

should always evaluate to false. However, this behavior also is not consistently implemented. [e.g. Cygwin Python 2.2.2]

Due to the many platform and library inconsistencies in handling IEEE special values, it is impossible to consistently set or detect IEEE 754 floating point values in normal Python code without resorting to directly manipulating bit-patterns.

This PEP proposes a standard Python API and provides a reference module implementation which allows for consistent handling of IEEE 754 special values on all supported platforms.

API Definition

Constants

NaN
Non-signalling IEEE 754 "Not a Number" value
PosInf
IEEE 754 Positive Infinity value
NegInf
IEEE 754 Negative Infinity value

Functions

isNaN(value)
Determine if the argument is a IEEE 754 NaN (Not a Number) value.
isPosInf(value)
Determine if the argument is a IEEE 754 positive infinity value.
isNegInf(value)
Determine if the argument is a IEEE 754 negative infinity value.
isFinite(value)
Determine if the argument is an finite IEEE 754 value (i.e., is not NaN, positive, or negative infinity).
isInf(value)
Determine if the argument is an infinite IEEE 754 value (positive or negative infinity)

Example

(Run under Python 2.2.1 on Solaris 8.)

>>> import fpconst
>>> val = 1e30000 # should be cause overflow and result in "Inf"
>>> val
Infinity
>>> fpconst.isInf(val)
1
>>> fpconst.PosInf
Infinity
>>> nval = val/val # should result in NaN
>>> nval
NaN
>>> fpconst.isNaN(nval)
1
>>> fpconst.isNaN(val)
0

Implementation

The reference implementation is provided in the module "fpconst" [1], which is written in pure Python by taking advantage of the "struct" standard module to directly set or test for the bit patterns that define IEEE 754 special values. Care has been taken to generate proper results on both big-endian and little-endian machines. The current implementation is pure Python, but some efficiency could be gained by translating the core routines into C.

Patch 1151323 "New fpconst module" [2] on SourceForge adds the fpconst module to the Python standard library.

pep-3000 Python 3000

PEP:3000
Title:Python 3000
Version:$Revision$
Last-Modified:$Date$
Author:Guido van Rossum <guido at python.org>
Status:Final
Type:Process
Content-Type:text/x-rst
Created:05-Apr-2006
Post-History:

Abstract

This PEP sets guidelines for Python 3000 development. Ideally, we first agree on the process, and start discussing features only after the process has been decided and specified. In practice, we'll be discussing features and process simultaneously; often the debate about a particular feature will prompt a process discussion.

Naming

Python 3000, Python 3.0 and Py3K are all names for the same thing. The project is called Python 3000, or abbreviated to Py3k. The actual Python release will be referred to as Python 3.0, and that's what "python3.0 -V" will print; the actual file names will use the same naming convention we use for Python 2.x. I don't want to pick a new name for the executable or change the suffix for Python source files.

PEP Numbering

Python 3000 PEPs are numbered starting at PEP 3000. PEPs 3000-3099 are meta-PEPs -- these can be either process or informational PEPs. PEPs 3100-3999 are feature PEPs. PEP 3000 itself (this PEP) is special; it is the meta-PEP for Python 3000 meta-PEPs (IOW it describe the process to define processes). PEP 3100 is also special; it's a laundry list of features that were selected for (hopeful) inclusion in Python 3000 before we started the Python 3000 process for real. PEP 3099, finally, is a list of features that will not change.

Timeline

See PEP 361 [3], which contains the release schedule for Python 2.6 and 3.0. These versions will be released in lockstep.

Note: standard library development is expected to ramp up after 3.0a1 is released.

I expect that there will be parallel Python 2.x and 3.x releases for some time; the Python 2.x releases will continue for a longer time than the traditional 2.x.y bugfix releases. Typically, we stop releasing bugfix versions for 2.x once version 2.(x+1) has been released. But I expect there to be at least one or two new 2.x releases even after 3.0 (final) has been released, probably well into 3.1 or 3.2. This will to some extent depend on community demand for continued 2.x support, acceptance and stability of 3.0, and volunteer stamina.

I expect that Python 3.1 and 3.2 will be released much sooner after 3.0 than has been customary for the 2.x series. The 3.x release pattern will stabilize once the community is happy with 3.x.

Compatibility and Transition

Python 3.0 will break backwards compatibility with Python 2.x.

There is no requirement that Python 2.6 code will run unmodified on Python 3.0. Not even a subset. (Of course there will be a tiny subset, but it will be missing major functionality.)

Python 2.6 will support forward compatibility in the following two ways:

  • It will support a "Py3k warnings mode" which will warn dynamically (i.e. at runtime) about features that will stop working in Python 3.0, e.g. assuming that range() returns a list.
  • It will contain backported versions of many Py3k features, either enabled through __future__ statements or simply by allowing old and new syntax to be used side-by-side (if the new syntax would be a syntax error in 2.x).

Instead, and complementary to the forward compatibility features in 2.6, there will be a separate source code conversion tool [1]. This tool can do a context-free source-to-source translation. For example, it can translate apply(f, args) into f(*args). However, the tool cannot do data flow analysis or type inferencing, so it simply assumes that apply in this example refers to the old built-in function.

The recommended development model for a project that needs to support Python 2.6 and 3.0 simultaneously is as follows:

  1. You should have excellent unit tests with close to full coverage.
  2. Port your project to Python 2.6.
  3. Turn on the Py3k warnings mode.
  4. Test and edit until no warnings remain.
  5. Use the 2to3 tool to convert this source code to 3.0 syntax. Do not manually edit the output!
  6. Test the converted source code under 3.0.
  7. If problems are found, make corrections to the 2.6 version of the source code and go back to step 3.
  8. When it's time to release, release separate 2.6 and 3.0 tarballs (or whatever archive form you use for releases).

It is recommended not to edit the 3.0 source code until you are ready to reduce 2.6 support to pure maintenance (i.e. the moment when you would normally move the 2.6 code to a maintenance branch anyway).

PS. We need a meta-PEP to describe the transitional issues in detail.

Implementation Language

Python 3000 will be implemented in C, and the implementation will be derived as an evolution of the Python 2 code base. This reflects my views (which I share with Joel Spolsky [2]) on the dangers of complete rewrites. Since Python 3000 as a language is a relatively mild improvement on Python 2, we can gain a lot by not attempting to reimplement the language from scratch. I am not against parallel from-scratch implementation efforts, but my own efforts will be directed at the language and implementation that I know best.

Meta-Contributions

Suggestions for additional text for this PEP are gracefully accepted by the author. Draft meta-PEPs for the topics above and additional topics are even more welcome!

References

[1]The 2to3 tool, in the subversion sandbox http://svn.python.org/view/sandbox/trunk/2to3/
[2]Joel on Software: Things You Should Never Do, Part I http://www.joelonsoftware.com/articles/fog0000000069.html
[3]PEP 361 (Python 2.6 and 3.0 Release Schedule) http://www.python.org/dev/peps/pep-0361

pep-3001 Procedure for reviewing and improving standard library modules

PEP:3001
Title:Procedure for reviewing and improving standard library modules
Version:$Revision$
Last-Modified:$Date$
Author:Georg Brandl <georg at python.org>
Status:Withdrawn
Type:Process
Content-Type:text/x-rst
Created:05-Apr-2006
Post-History:

Abstract

This PEP describes a procedure for reviewing and improving standard library modules, especially those written in Python, making them ready for Python 3000. There can be different steps of refurbishing, each of which is described in a section below. Of course, not every step has to be performed for every module.

Removal of obsolete modules

All modules marked as deprecated in 2.x versions should be removed for Python 3000. The same applies to modules which are seen as obsolete today, but are too widely used to be deprecated or removed. Python 3000 is the big occasion to get rid of them.

There will have to be a document listing all removed modules, together with information on possible substitutes or alternatives. This infor- mation will also have to be provided by the python3warn.py porting helper script mentioned in PEP XXX.

Renaming modules

There are proposals for a "great stdlib renaming" introducing a hierarchic library namespace or a top-level package from which to import standard modules. That possibility aside, some modules' names are known to have been chosen unwisely, a mistake which could never be corrected in the 2.x series. Examples are names like "StringIO" or "Cookie". For Python 3000, there will be the possibility to give those modules less confusing and more conforming names.

Of course, each rename will have to be stated in the documentation of the respective module and perhaps in the global document of Step 1. Additionally, the python3warn.py script will recognize the old module names and notify the user accordingly.

If the name change is made in time for another release of the Python 2.x series, it is worth considering to introduce the new name in the 2.x branch to ease transition.

Code cleanup

As most library modules written in Python have not been touched except for bug fixes, following the policy of never changing a running system, many of them may contain code that is not up to the newest language features and could be rewritten in a more concise, modern Python.

PyChecker should run cleanly over the library. With a carefully tuned configuration file, PyLint should also emit as few warnings as possible.

As long as these changes don't change the module's interface and behavior, no documentation updates are necessary.

Enhancement of test and documentation coverage

Code coverage by unit tests varies greatly between modules. Each test suite should be checked for completeness, and the remaining classic tests should be converted to PyUnit (or whatever new shiny testing framework comes with Python 3000, perhaps py.test?).

It should also be verified that each publicly visible function has a meaningful docstring which ideally contains several doctests.

No documentation changes are necessary for enhancing test coverage.

Unification of module metadata

This is a small and probably not very important step. There have been various attempts at providing author, version and similar metadata in modules (such as a "__version__" global). Those could be standardized and used throughout the library.

No documentation changes are necessary for this step, too.

Backwards incompatible bug fixes

Over the years, many bug reports have been filed which complained about bugs in standard library modules, but have subsequently been closed as "Won't fix" since a fix would have introduced a major incompatibility which was not acceptable in the Python 2.x series. In Python 3000, the fix can be applied if the interface per se is still acceptable.

Each slight behavioral change caused by such fixes must be mentioned in the documentation, perhaps in a "Changed in Version 3.0" paragraph.

Interface changes

The last and most disruptive change is the overhaul of a module's public interface. If a module's interface is to be changed, a justification should be made beforehand, or a PEP should be written.

The change must be fully documented as "New in Version 3.0", and the python3warn.py script must know about it.

References

None yet.

pep-3002 Procedure for Backwards-Incompatible Changes

PEP:3002
Title:Procedure for Backwards-Incompatible Changes
Version:$Revision$
Last-Modified:$Date$
Author:Steven Bethard <steven.bethard at gmail.com>
Status:Final
Type:Process
Content-Type:text/x-rst
Created:27-Mar-2006
Post-History:27-Mar-2006, 13-Apr-2006

Abstract

This PEP describes the procedure for changes to Python that are backwards-incompatible between the Python 2.X series and Python 3000. All such changes must be documented by an appropriate Python 3000 PEP and must be accompanied by code that can identify when pieces of Python 2.X code may be problematic in Python 3000.

Rationale

Python 3000 will introduce a number of backwards-incompatible changes to Python, mainly to streamline the language and to remove some previous design mistakes. But Python 3000 is not intended to be a new and completely different language from the Python 2.X series, and it is expected that much of the Python user community will make the transition to Python 3000 when it becomes available.

To encourage this transition, it is crucial to provide a clear and complete guide on how to upgrade Python 2.X code to Python 3000 code. Thus, for any backwards-incompatible change, two things are required:

  • An official Python Enhancement Proposal (PEP)
  • Code that can identify pieces of Python 2.X code that may be problematic in Python 3000

Python Enhancement Proposals

Every backwards-incompatible change must be accompanied by a PEP. This PEP should follow the usual PEP guidelines and explain the purpose and reasoning behind the backwards incompatible change. In addition to the usual PEP sections, all PEPs proposing backwards-incompatible changes must include an additional section: Compatibility Issues. This section should describe what is backwards incompatible about the proposed change to Python, and the major sorts of breakage to be expected.

While PEPs must still be evaluated on a case-by-case basis, a PEP may be inappropriate for Python 3000 if its Compatibility Issues section implies any of the following:

  • Most or all instances of a Python 2.X construct are incorrect in Python 3000, and most or all instances of the Python 3000 construct are incorrect in Python 2.X.

    So for example, changing the meaning of the for-loop else-clause from "executed when the loop was not broken out of" to "executed when the loop had zero iterations" would mean that all Python 2.X for-loop else-clauses would be broken, and there would be no way to use a for-loop else-clause in a Python-3000-appropriate manner. Thus a PEP for such an idea would likely be rejected.

  • Many instances of a Python 2.X construct are incorrect in Python 3000 and the PEP fails to demonstrate real-world use-cases for the changes.

    Backwards incompatible changes are allowed in Python 3000, but not to excess. A PEP that proposes backwards-incompatible changes should provide good examples of code that visibly benefits from the changes.

PEP-writing is time-consuming, so when a number of backwards-incompatible changes are closely related, they should be proposed in the same PEP. Such PEPs will likely have longer Compatibility Issues sections, however, since they must now describe the sorts of breakage expected from all the proposed changes.

Identifying Problematic Code

In addition to the PEP requirement, backwards incompatible changes to Python must also be accompanied by code to issue warnings for pieces of Python 2.X code that will behave differently in Python 3000. Such warnings will be enabled in Python 2.X using a new command-line switch: -3. All backwards incompatible changes should be accompanied by a patch for Python 2.X that, when -3 is specified, issues warnings for each construct that is being changed.

For example, if dict.keys() returns an iterator in Python 3000, the patch to the Python 2.X branch should do something like:

If -3 was specified, change dict.keys() to return a subclass of list that issues warnings whenever you use any methods other than __iter__().

Such a patch would mean that warnings are only issued when features that will not be present in Python 3000 are used, and almost all existing code should continue to work. (Code that relies on dict.keys() always returning a list and not a subclass should be pretty much non-existent.)

pep-3003 Python Language Moratorium

PEP:3003
Title:Python Language Moratorium
Version:$Revision$
Last-Modified:$Date$
Author:Brett Cannon, Jesse Noller, Guido van Rossum
Status:Final
Type:Process
Content-Type:text/x-rst
Created:21-Oct-2009
Post-History:03-Nov-2009

Abstract

This PEP proposes a temporary moratorium (suspension) of all changes to the Python language syntax, semantics, and built-ins for a period of at least two years from the release of Python 3.1. In particular, the moratorium would include Python 3.2 (to be released 18-24 months after 3.1) but allow Python 3.3 (assuming it is not released prematurely) to once again include language changes.

This suspension of features is designed to allow non-CPython implementations to "catch up" to the core implementation of the language, help ease adoption of Python 3.x, and provide a more stable base for the community.

Rationale

This idea was proposed by Guido van Rossum on the python-ideas [1] mailing list. The premise of his email was to slow the alteration of the Python core syntax, builtins and semantics to allow non-CPython implementations to catch up to the current state of Python, both 2.x and 3.x.

Python, as a language is more than the core implementation -- CPython -- with a rich, mature and vibrant community of implementations, such as Jython [2], IronPython [3] and PyPy [4] that are a benefit not only to the community, but to the language itself.

Still others, such as Unladen Swallow [5] (a branch of CPython) seek not to create an alternative implementation, but rather they seek to enhance the performance and implementation of CPython itself.

Python 3.x was a large part of the last several years of Python's development. Its release, as well as a bevy of changes to the language introduced by it and the previous 2.6.x releases, puts alternative implementations at a severe disadvantage in "keeping pace" with core python development.

Additionally, many of the changes put into the recent releases of the language as implemented by CPython have not yet seen widespread usage by the general user population. For example, most users are limited to the version of the interpreter (typically CPython) which comes pre-installed with their operating system. Most OS vendors are just barely beginning to ship Python 2.6 -- even fewer are shipping Python 3.x.

As it is expected that Python 2.7 be the effective "end of life" of the Python 2.x code line, with Python 3.x being the future, it is in the best interest of Python core development to temporarily suspend the alteration of the language itself to allow all of these external entities to catch up and to assist in the adoption of, and migration to, Python 3.x

Finally, the moratorium is intended to free up cycles within core development to focus on other issues, such as the CPython interpreter and improvements therein, the standard library, etc.

This moratorium does not allow for exceptions -- once accepted, any pending changes to the syntax or semantics of the language will be postponed until the moratorium is lifted.

This moratorium does not attempt to apply to any other Python implementation meaning that if desired other implementations may add features which deviate from the standard implementation.

Details

Cannot Change

  • New built-ins

  • Language syntax

    The grammar file essentially becomes immutable apart from ambiguity fixes.

  • General language semantics

    The language operates as-is with only specific exemptions (see below).

  • New __future__ imports

    These are explicitly forbidden, as they effectively change the language syntax and/or semantics (albeit using a compiler directive).

Case-by-Case Exemptions

  • New methods on built-ins

    The case for adding a method to a built-in object can be made.

  • Incorrect language semantics

    If the language semantics turn out to be ambiguous or improperly implemented based on the intention of the original design then the semantics may change.

  • Language semantics that are difficult to implement

    Because other VMs have not begun implementing Python 3.x semantics there is a possibility that certain semantics are too difficult to replicate. In those cases they can be changed to ease adoption of Python 3.x by the other VMs.

Allowed to Change

  • C API

    It is entirely acceptable to change the underlying C code of CPython as long as other restrictions of this moratorium are not broken. E.g. removing the GIL would be fine assuming certain operations that are currently atomic remain atomic.

  • The standard library

    As the standard library is not directly tied to the language definition it is not covered by this moratorium.

  • Backports of 3.x features to 2.x

    The moratorium only affects features that would be new in 3.x.

  • Import semantics

    For example, PEP 382. After all, import semantics vary between Python implementations anyway.

Retroactive

It is important to note that the moratorium covers all changes since the release of Python 3.1. This rule is intended to avoid features being rushed or smuggled into the CPython source tree while the moratorium is being discussed. A review of the NEWS file for the py3k development branch showed no commits would need to be rolled back in order to meet this goal.

Extensions

The time period of the moratorium can only be extended through a new PEP.

pep-3099 Things that will Not Change in Python 3000

PEP:3099
Title:Things that will Not Change in Python 3000
Version:$Revision$
Last-Modified:$Date$
Author:Georg Brandl <georg at python.org>
Status:Final
Type:Process
Content-Type:text/x-rst
Created:04-Apr-2006
Post-History:

Abstract

Some ideas are just bad. While some thoughts on Python evolution are constructive, some go against the basic tenets of Python so egregiously that it would be like asking someone to run in a circle: it gets you nowhere, even for Python 3000, where extraordinary proposals are allowed. This PEP tries to list all BDFL pronouncements on Python 3000 that refer to changes that will not happen and new features that will not be introduced, sorted by topics, along with a short explanation or a reference to the relevant thread on the python-3000 mailing list.

If you think you should suggest any of the listed ideas it would be better to just step away from the computer, go outside, and enjoy yourself. Being active outdoors by napping in a nice patch of grass is more productive than bringing up a beating-a-dead-horse idea and having people tell you how dead the idea is. Consider yourself warned.

Core language

Builtins

Standard types

Coding style

Interactive Interpreter

pep-3100 Miscellaneous Python 3.0 Plans

PEP:3100
Title:Miscellaneous Python 3.0 Plans
Version:$Revision$
Last-Modified:$Date$
Author:Brett Cannon <brett at python.org>
Status:Final
Type:Process
Content-Type:text/x-rst
Created:20-Aug-2004
Post-History:

Abstract

This PEP, previously known as PEP 3000, describes smaller scale changes and new features for which no separate PEP is written yet, all targeted for Python 3000.

The list of features included in this document is subject to change and isn't binding on the Python development community; features may be added, removed, and modified at any time. The purpose of this list is to focus our language development effort on changes that are steps to 3.0, and to encourage people to invent ways to smooth the transition.

This document is not a wish-list that anyone can extend. While there are two authors of this PEP, we're just supplying the text; the decisions for which changes are listed in this document are made by Guido van Rossum, who has chosen them as goals for Python 3.0.

Guido's pronouncements on things that will not change in Python 3.0 are recorded in PEP 3099. [43]

General goals

A general goal is to reduce feature duplication by removing old ways of doing things. A general principle of the design will be that one obvious way of doing something is enough. [1]

Influencing PEPs

Style changes

  • The C style guide will be updated to use 4-space indents, never tabs. This style should be used for all new files; existing files can be updated only if there is no hope to ever merge a particular file from the Python 2 HEAD. Within a file, the indentation style should be consistent. No other style guide changes are planned ATM.

Core language

  • True division becomes default behavior [34] [done]

  • exec as a statement is not worth it -- make it a function [done]

  • Add optional declarations for static typing [45] [10] [done]

  • Support only new-style classes; classic classes will be gone [1] [done]

  • Replace print by a function [14] [44] [done]

  • The softspace attribute of files goes away. [done]

  • Use except E1, E2, E3 as err: if you want the error variable. [3] [done]

  • None becomes a keyword [4]; also True and False [done]

  • ... to become a general expression element [16] [done]

  • as becomes a keyword [5] (starting in 2.6 already) [done]

  • Have list comprehensions be syntactic sugar for passing an equivalent generator expression to list(); as a consequence the loop variable will no longer be exposed [36] [done]

  • Comparisons other than == and != between disparate types will raise an exception unless explicitly supported by the type [6] [done]

  • floats will not be acceptable as arguments in place of ints for operations where floats are inadvertantly accepted (PyArg_ParseTuple() i & l formats)

  • Remove from ... import * at function scope. [done] This means that functions can always be optimized and support for unoptimized functions can go away.

  • Imports [39]
    • Imports will be absolute by default. [done]
    • Relative imports must be explicitly specified. [done]
    • Indirection entries in sys.modules (i.e., a value of None for A.string means to use the top-level string module) will not be supported.
  • __init__.py might become optional in sub-packages? __init__.py will still be required for top-level packages.

  • Cleanup the Py_InitModule() variants {,3,4} (also import and parser APIs)

  • Cleanup the APIs exported in pythonrun, etc.

  • Some expressions will require parentheses that didn't in 2.x:

    • List comprehensions will require parentheses around the iterables. This will make list comprehensions more similar to generator comprehensions. [x for x in 1, 2] will need to be: [x for x in (1, 2)] [done]
    • Lambdas may have to be parenthesized [38] [NO]
  • In order to get rid of the confusion between __builtin__ and __builtins__, it was decided to rename __builtin__ (the module) to builtins, and to leave __builtins__ (the sandbox hook) alone. [47] [48] [done]

  • Attributes on functions of the form func_whatever will be renamed __whatever__ [17] [done]

  • Set literals and comprehensions [19] [20] [done] {x} means set([x]); {x, y} means set([x, y]). {F(x) for x in S if P(x)} means set(F(x) for x in S if P(x)). NB. {range(x)} means set([range(x)]), NOT set(range(x)). There's no literal for an empty set; use set() (or {1}&{2} :-). There's no frozenset literal; they are too rarely needed.

  • The __nonzero__ special method will be renamed to __bool__ and have to return a bool. The typeobject slot will be called tp_bool [23] [done]

  • Dict comprehensions, as first proposed in [35] [done] {K(x): V(x) for x in S if P(x)} means dict((K(x), V(x)) for x in S if P(x)).

To be removed:

  • String exceptions: use instances of an Exception class [2] [done]

  • raise Exception, "message": use raise Exception("message") [12] [done]

  • `x`: use repr(x) [2] [done]

  • The <> operator: use != instead [3] [done]

  • The __mod__ and __divmod__ special methods on float. [they should stay] [21]

  • Drop unbound methods [7] [25] [done]

  • METH_OLDARGS [done]

  • WITH_CYCLE_GC [done]

  • __getslice__, __setslice__, __delslice__ [32]; remove slice opcodes and use slice objects. [done]

  • __oct__, __hex__: use __index__ in oct() and hex() instead. [done]

  • __methods__ and __members__ [done]

  • C APIs (see code): PyFloat_AsString, PyFloat_AsReprString, PyFloat_AsStringEx, PySequence_In, PyEval_EvalFrame, PyEval_CallObject, _PyObject_Del, _PyObject_GC_Del, _PyObject_GC_Track, _PyObject_GC_UnTrack PyString_AsEncodedString, PyString_AsDecodedString PyArg_NoArgs, PyArg_GetInt, intargfunc, intintargfunc

    PyImport_ReloadModule ?

Atomic Types

  • Remove distinction between int and long types; 'long' built-in type and literals with 'L' or 'l' suffix disappear [1] [done]
  • Make all strings be Unicode, and have a separate bytes() type [1] The new string type will be called 'str'. See PEP 3137. [done]
  • Return iterable views instead of lists where appropriate for atomic type methods (e.g. dict.keys(), dict.values(), dict.items(), etc.); iter* methods will be removed. [done]
  • Make string.join() stringify its arguments? [18] [NO]
  • Fix open() so it returns a ValueError if the mode is bad rather than IOError. [done]

To be removed:

  • basestring.find() and basestring.rfind(); use basestring.index() or basestring.[r]partition() or or basestring.rindex() in a try/except block??? [13] [UNLIKELY]
  • file.xreadlines() method [31] [done]
  • dict.setdefault()? [15] [UNLIKELY]
  • dict.has_key() method; use in operator [done]
  • list.sort() and builtin.sorted() methods: eliminate cmp parameter [27] [done]

Built-in Namespace

  • Make built-ins return an iterator where appropriate (e.g. range(), zip(), map(), filter(), etc.) [done]
  • Remove input() and rename raw_input() to input(). If you need the old input(), use eval(input()). [done]
  • Introduce trunc(), which would call the __trunc__() method on its argument; suggested use is for objects like float where calling __int__() has data loss, but an integral representation is still desired? [8] [done]
  • Exception hierarchy changes [41] [done]
  • Add a bin() function for a binary representation of integers [done]

To be removed:

  • apply(): use f(*args, **kw) instead [2] [done]

  • buffer(): must die (use a bytes() type instead) (?) [2] [done]

  • callable(): just use isinstance(x, collections.Callable) (?) [2] [done]

  • compile(): put in sys (or perhaps in a module of its own) [2]

  • coerce(): no longer needed [2] [done]

  • execfile(), reload(): use exec() [2] [done]

  • intern(): put in sys [2], [22] [done]

  • reduce(): put in functools, a loop is more readable most of the times [2], [9] [done]

  • xrange(): use range() instead [1] [See range() above] [done]

  • StandardError: this is a relic from the original exception hierarchy;

    subclass Exception instead. [done]

Standard library

  • Reorganize the standard library to not be as shallow?
  • Move test code to where it belongs, there will be no more test() functions in the standard library
  • Convert all tests to use either doctest or unittest.
  • For the procedures of standard library improvement, see PEP 3001 [42]

To be removed:

  • The sets module. [done]

  • stdlib modules to be removed
    • see docstrings and comments in the source
      • macfs [to do]
      • new, reconvert, stringold, xmllib, pcre, pypcre, strop [all done]
    • see PEP 4 [33]
      • buildtools, mimetools, multifile, rfc822, [to do]
      • mpz, posixfile, regsub, rgbimage, sha, statcache, sv, TERMIOS, timing [done]
      • cfmfile, gopherlib, md5, MimeWriter, mimify [done]
      • cl, sets, xreadlines, rotor, whrandom [done]
    • Everything in lib-old [33] [done]
      • Para, addpack, cmp, cmpcache, codehack, dircmp, dump, find, fmt, grep, lockfile, newdir, ni, packmail, poly, rand, statcache, tb, tzparse, util, whatsound, whrandom, zmod
  • sys.exitfunc: use atexit module instead [28], [49] [done]

  • sys.exc_type, sys.exc_values, sys.exc_traceback: not thread-safe; use sys.exc_info() or an attribute of the exception [2] [11] [28] [done]

  • sys.exc_clear: Python 3's except statements provide the same functionality [24] [46] [28] [done]

  • array.read, array.write [30]

  • operator.isCallable : callable() built-in is being removed [29] [50] [done]

  • operator.sequenceIncludes : redundant thanks to operator.contains [29] [50] [done]

  • In the thread module, the aquire_lock() and release_lock() aliases for the acquire() and release() methods on lock objects. (Probably also just remove the thread module as a public API, in favor of always using threading.py.)

  • UserXyz classes, in favour of XyzMixins.

  • Remove the unreliable empty() and full() methods from Queue.py?

  • Remove jumpahead() from the random API?

  • Make the primitive for random be something generating random bytes rather than random floats?

  • Get rid of Cookie.SerialCookie and Cookie.SmartCookie?

  • Modify the heapq.heapreplace() API to compare the new value to the top of the heap?

Outstanding Issues

  • Require C99, so we can use // comments, named initializers, declare variables without introducing a new scope, among other benefits. (Also better support for IEEE floating point issues like NaN and infinities?)
  • Remove support for old systems, including: BeOS, RISCOS, (SGI) Irix, Tru64

References

[1](1, 2, 3, 4, 5) PyCon 2003 State of the Union: http://www.python.org/doc/essays/ppt/pycon2003/pycon2003.ppt
[2](1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11) Python Regrets: http://www.python.org/doc/essays/ppt/regrets/PythonRegrets.pdf
[3](1, 2) Python Wiki: http://www.python.org/moin/Python3.0
[4]python-dev email ("Constancy of None") http://mail.python.org/pipermail/python-dev/2004-July/046294.html
[5]python-dev email (' "as" to be a keyword?') http://mail.python.org/pipermail/python-dev/2004-July/046316.html
[6]python-dev email ("Comparing heterogeneous types") http://mail.python.org/pipermail/python-dev/2004-June/045111.html
[7]python-dev email ("Let's get rid of unbound methods") http://mail.python.org/pipermail/python-dev/2005-January/050625.html
[8]python-dev email ("Fixing _PyEval_SliceIndex so that integer-like objects can be used") http://mail.python.org/pipermail/python-dev/2005-February/051674.html
[9]Guido's blog ("The fate of reduce() in Python 3000") http://www.artima.com/weblogs/viewpost.jsp?thread=98196
[10]Guido's blog ("Python Optional Typechecking Redux") http://www.artima.com/weblogs/viewpost.jsp?thread=89161
[11]python-dev email ("anonymous blocks") http://mail.python.org/pipermail/python-dev/2005-April/053060.html
[12]python-dev email ("PEP 8: exception style") http://mail.python.org/pipermail/python-dev/2005-August/055190.html
[13]python-dev email (Remove str.find in 3.0?) http://mail.python.org/pipermail/python-dev/2005-August/055705.html
[14]python-dev email (Replacement for print in Python 3.0) http://mail.python.org/pipermail/python-dev/2005-September/056154.html
[15]python-dev email ("defaultdict") http://mail.python.org/pipermail/python-dev/2006-February/061261.html
[16]python-3000 email http://mail.python.org/pipermail/python-3000/2006-April/000996.html
[17]python-3000 email ("Pronouncement on parameter lists") http://mail.python.org/pipermail/python-3000/2006-April/001175.html
[18]python-3000 email ("More wishful thinking") http://mail.python.org/pipermail/python-3000/2006-April/000810.html
[19]python-3000 email ("sets in P3K?") http://mail.python.org/pipermail/python-3000/2006-April/001286.html
[20]python-3000 email ("sets in P3K?") http://mail.python.org/pipermail/python-3000/2006-May/001666.html
[21]python-3000 email ("bug in modulus?") http://mail.python.org/pipermail/python-3000/2006-May/001735.html
[22]SF patch "sys.id() and sys.intern()" http://www.python.org/sf/1601678
[23]python-3000 email ("__nonzero__ vs. __bool__") http://mail.python.org/pipermail/python-3000/2006-November/004524.html
[24]python-3000 email ("Pre-peps on raise and except changes") http://mail.python.org/pipermail/python-3000/2007-February/005672.html
[25]python-3000 email ("Py3.0 Library Ideas") http://mail.python.org/pipermail/python-3000/2007-February/005726.html
[26]python-dev email ("Should we do away with unbound methods in Py3k?") http://mail.python.org/pipermail/python-dev/2007-November/075279.html
[27]python-dev email ("Mutable sequence .sort() signature") http://mail.python.org/pipermail/python-dev/2008-February/076818.html
[28](1, 2, 3) Python docs (sys -- System-specific parameters and functions) http://docs.python.org/library/sys.html
[29](1, 2) Python docs (operator -- Standard operators as functions) http://docs.python.org/library/operator.html
[30]Python docs (array -- Efficient arrays of numeric values) http://docs.python.org/library/array.html
[31]Python docs (File objects) http://docs.python.org/library/stdtypes.html
[32]Python docs (Additional methods for emulation of sequence types) http://docs.python.org/reference/datamodel.html#additional-methods-for-emulation-of-sequence-types
[33](1, 2) PEP 4 ("Deprecation of Standard Modules") http://www.python.org/dev/peps/pep-0004
[34](1, 2) PEP 238 (Changing the Division Operator) http://www.python.org/dev/peps/pep-0238
[35]PEP 274 (Dict Comprehensions) http://www.python.org/dev/peps/pep-0274
[36]PEP 289 ("Generator Expressions") http://www.python.org/dev/peps/pep-0289
[37]PEP 299 ("Special __main__() function in modules") http://www.python.org/dev/peps/pep-0299
[38]PEP 308 ("Conditional Expressions") http://www.python.org/dev/peps/pep-0308
[39](1, 2) PEP 328 (Imports: Multi-Line and Absolute/Relative) http://www.python.org/dev/peps/pep-0328
[40]PEP 343 (The "with" Statement) http://www.python.org/dev/peps/pep-0343
[41](1, 2) PEP 352 (Required Superclass for Exceptions) http://www.python.org/dev/peps/pep-0352
[42]PEP 3001 (Process for reviewing and improving standard library modules) http://www.python.org/dev/peps/pep-3001
[43]PEP 3099 (Things that will Not Change in Python 3000) http://www.python.org/dev/peps/pep-3099
[44]PEP 3105 (Make print a function) http://www.python.org/dev/peps/pep-3105
[45]PEP 3107 (Function Annotations) http://www.python.org/dev/peps/pep-3107
[46]PEP 3110 (Catching Exceptions in Python 3000) http://www.python.org/dev/peps/pep-3110/#semantic-changes
[47]Approach to resolving __builtin__ vs __builtins__ http://mail.python.org/pipermail/python-3000/2007-March/006161.html
[48]New name for __builtins__ http://mail.python.org/pipermail/python-dev/2007-November/075388.html
[49]Patch to remove sys.exitfunc http://www.python.org/sf/1680961
[50](1, 2) Remove deprecated functions from operator http://www.python.org/sf/1516309

pep-3101 Advanced String Formatting

PEP: 3101
Title: Advanced String Formatting
Version: $Revision$
Last-Modified: $Date$
Author: Talin <talin at acm.org>
Status: Final
Type: Standards Track
Content-Type: text/plain
Created: 16-Apr-2006
Python-Version: 3.0
Post-History: 28-Apr-2006, 6-May-2006, 10-Jun-2007, 14-Aug-2007, 14-Sep-2008

Abstract

    This PEP proposes a new system for built-in string formatting
    operations, intended as a replacement for the existing '%' string
    formatting operator.


Rationale

    Python currently provides two methods of string interpolation:

    - The '%' operator for strings. [1]

    - The string.Template module. [2]

    The primary scope of this PEP concerns proposals for built-in
    string formatting operations (in other words, methods of the
    built-in string type).

    The '%' operator is primarily limited by the fact that it is a
    binary operator, and therefore can take at most two arguments.
    One of those arguments is already dedicated to the format string,
    leaving all other variables to be squeezed into the remaining
    argument.  The current practice is to use either a dictionary or a
    tuple as the second argument, but as many people have commented
    [3], this lacks flexibility.  The "all or nothing" approach
    (meaning that one must choose between only positional arguments,
    or only named arguments) is felt to be overly constraining.

    While there is some overlap between this proposal and
    string.Template, it is felt that each serves a distinct need,
    and that one does not obviate the other.  This proposal is for
    a mechanism which, like '%', is efficient for small strings
    which are only used once, so, for example, compilation of a
    string into a template is not contemplated in this proposal,
    although the proposal does take care to define format strings
    and the API in such a way that an efficient template package
    could reuse the syntax and even some of the underlying
    formatting code.


Specification

    The specification will consist of the following parts:

    - Specification of a new formatting method to be added to the
      built-in string class.

    - Specification of functions and flag values to be added to
      the string module, so that the underlying formatting engine
      can be used with additional options.

    - Specification of a new syntax for format strings.

    - Specification of a new set of special methods to control the
      formatting and conversion of objects.

    - Specification of an API for user-defined formatting classes.

    - Specification of how formatting errors are handled.

    Note on string encodings: When discussing this PEP in the context
    of Python 3.0, it is assumed that all strings are unicode strings,
    and that the use of the word 'string' in the context of this
    document will generally refer to a Python 3.0 string, which is
    the same as Python 2.x unicode object.

    In the context of Python 2.x, the use of the word 'string' in this
    document refers to an object which may either be a regular string
    or a unicode object.  All of the function call interfaces
    described in this PEP can be used for both strings and unicode
    objects, and in all cases there is sufficient information
    to be able to properly deduce the output string type (in
    other words, there is no need for two separate APIs).
    In all cases, the type of the format string dominates - that
    is, the result of the conversion will always result in an object
    that contains the same representation of characters as the
    input format string.


String Methods

    The built-in string class (and also the unicode class in 2.6) will
    gain a new method, 'format', which takes an arbitrary number of
    positional and keyword arguments:

        "The story of {0}, {1}, and {c}".format(a, b, c=d)

    Within a format string, each positional argument is identified
    with a number, starting from zero, so in the above example, 'a' is
    argument 0 and 'b' is argument 1.  Each keyword argument is
    identified by its keyword name, so in the above example, 'c' is
    used to refer to the third argument.
    
    There is also a global built-in function, 'format' which formats
    a single value:
    
       print(format(10.0, "7.3g"))
       
    This function is described in a later section.


Format Strings

    Format strings consist of intermingled character data and markup.

    Character data is data which is transferred unchanged from the
    format string to the output string; markup is not transferred from
    the format string directly to the output, but instead is used to
    define 'replacement fields' that describe to the format engine
    what should be placed in the output string in place of the markup.

    Brace characters ('curly braces') are used to indicate a
    replacement field within the string:

        "My name is {0}".format('Fred')

    The result of this is the string:

        "My name is Fred"

    Braces can be escaped by doubling:

        "My name is {0} :-{{}}".format('Fred')

    Which would produce:

        "My name is Fred :-{}"

    The element within the braces is called a 'field'.  Fields consist
    of a 'field name', which can either be simple or compound, and an
    optional 'format specifier'.


Simple and Compound Field Names

    Simple field names are either names or numbers.  If numbers, they
    must be valid base-10 integers; if names, they must be valid
    Python identifiers.  A number is used to identify a positional
    argument, while a name is used to identify a keyword argument.

    A compound field name is a combination of multiple simple field
    names in an expression:

        "My name is {0.name}".format(open('out.txt', 'w'))

    This example shows the use of the 'getattr' or 'dot' operator
    in a field expression.  The dot operator allows an attribute of
    an input value to be specified as the field value.

    Unlike some other programming languages, you cannot embed arbitrary
    expressions in format strings.  This is by design - the types of
    expressions that you can use is deliberately limited.  Only two operators
    are supported: the '.' (getattr) operator, and the '[]' (getitem)
    operator.  The reason for allowing these operators is that they don't
    normally have side effects in non-pathological code.

    An example of the 'getitem' syntax:

        "My name is {0[name]}".format(dict(name='Fred'))

    It should be noted that the use of 'getitem' within a format string
    is much more limited than its conventional usage.  In the above example,
    the string 'name' really is the literal string 'name', not a variable
    named 'name'.  The rules for parsing an item key are very simple.
    If it starts with a digit, then it is treated as a number, otherwise
    it is used as a string.

    Because keys are not quote-delimited, it is not possible to
    specify arbitrary dictionary keys (e.g., the strings "10" or
    ":-]") from within a format string.

    Implementation note: The implementation of this proposal is
    not required to enforce the rule about a simple or dotted name
    being a valid Python identifier.  Instead, it will rely on the
    getattr function of the underlying object to throw an exception if
    the identifier is not legal.  The str.format() function will have
    a minimalist parser which only attempts to figure out when it is
    "done" with an identifier (by finding a '.' or a ']', or '}',
    etc.).


Format Specifiers

    Each field can also specify an optional set of 'format
    specifiers' which can be used to adjust the format of that field.
    Format specifiers follow the field name, with a colon (':')
    character separating the two:

        "My name is {0:8}".format('Fred')

    The meaning and syntax of the format specifiers depends on the
    type of object that is being formatted, but there is a standard
    set of format specifiers used for any object that does not
    override them.

    Format specifiers can themselves contain replacement fields.
    For example, a field whose field width is itself a parameter
    could be specified via:

        "{0:{1}}".format(a, b)

    These 'internal' replacement fields can only occur in the format
    specifier part of the replacement field.  Internal replacement fields
    cannot themselves have format specifiers.  This implies also that
    replacement fields cannot be nested to arbitrary levels.

    Note that the doubled '}' at the end, which would normally be
    escaped, is not escaped in this case.  The reason is because
    the '{{' and '}}' syntax for escapes is only applied when used
    *outside* of a format field.  Within a format field, the brace
    characters always have their normal meaning.

    The syntax for format specifiers is open-ended, since a class
    can override the standard format specifiers.  In such cases,
    the str.format() method merely passes all of the characters between
    the first colon and the matching brace to the relevant underlying
    formatting method.


Standard Format Specifiers

    If an object does not define its own format specifiers, a standard
    set of format specifiers is used.  These are similar in concept to
    the format specifiers used by the existing '%' operator, however
    there are also a number of differences.

    The general form of a standard format specifier is:

        [[fill]align][sign][#][0][minimumwidth][.precision][type]

    The brackets ([]) indicate an optional element.

    Then the optional align flag can be one of the following:

        '<' - Forces the field to be left-aligned within the available
              space (This is the default.)
        '>' - Forces the field to be right-aligned within the
              available space.
        '=' - Forces the padding to be placed after the sign (if any)
              but before the digits.  This is used for printing fields
              in the form '+000000120'. This alignment option is only
              valid for numeric types.
        '^' - Forces the field to be centered within the available
              space.

    Note that unless a minimum field width is defined, the field
    width will always be the same size as the data to fill it, so
    that the alignment option has no meaning in this case.

    The optional 'fill' character defines the character to be used to
    pad the field to the minimum width.  The fill character, if present,
    must be followed by an alignment flag.

    The 'sign' option is only valid for numeric types, and can be one
    of the following:

        '+'  - indicates that a sign should be used for both
               positive as well as negative numbers
        '-'  - indicates that a sign should be used only for negative
               numbers (this is the default behavior)
        ' '  - indicates that a leading space should be used on
               positive numbers

    If the '#' character is present, integers use the 'alternate form'
    for formatting.  This means that binary, octal, and hexadecimal
    output will be prefixed with '0b', '0o', and '0x', respectively.

    'width' is a decimal integer defining the minimum field width.  If
    not specified, then the field width will be determined by the
    content.
    
    If the width field is preceded by a zero ('0') character, this enables
    zero-padding.  This is equivalent to an alignment type of '=' and a
    fill character of '0'.

    The 'precision' is a decimal number indicating how many digits
    should be displayed after the decimal point in a floating point
    conversion.  For non-numeric types the field indicates the maximum
    field size - in other words, how many characters will be used from
    the field content.  The precision is ignored for integer conversions.

    Finally, the 'type' determines how the data should be presented.
    
    The available integer presentation types are:

        'b' - Binary. Outputs the number in base 2.
        'c' - Character. Converts the integer to the corresponding
              Unicode character before printing.
        'd' - Decimal Integer. Outputs the number in base 10.
        'o' - Octal format. Outputs the number in base 8.
        'x' - Hex format. Outputs the number in base 16, using lower-
              case letters for the digits above 9.
        'X' - Hex format. Outputs the number in base 16, using upper-
              case letters for the digits above 9.
        'n' - Number. This is the same as 'd', except that it uses the
              current locale setting to insert the appropriate
              number separator characters.
        '' (None) - the same as 'd'

    The available floating point presentation types are:

        'e' - Exponent notation. Prints the number in scientific
              notation using the letter 'e' to indicate the exponent.
        'E' - Exponent notation. Same as 'e' except it converts the
              number to uppercase.
        'f' - Fixed point. Displays the number as a fixed-point
              number.
        'F' - Fixed point. Same as 'f' except it converts the number
              to uppercase.
        'g' - General format. This prints the number as a fixed-point
              number, unless the number is too large, in which case
              it switches to 'e' exponent notation.
        'G' - General format. Same as 'g' except switches to 'E'
              if the number gets to large.
        'n' - Number. This is the same as 'g', except that it uses the
              current locale setting to insert the appropriate
              number separator characters.
        '%' - Percentage. Multiplies the number by 100 and displays
              in fixed ('f') format, followed by a percent sign.
        '' (None) - similar to 'g', except that it prints at least one
              digit after the decimal point.

    Objects are able to define their own format specifiers to
    replace the standard ones.  An example is the 'datetime' class,
    whose format specifiers might look something like the
    arguments to the strftime() function:

        "Today is: {0:%a %b %d %H:%M:%S %Y}".format(datetime.now())

    For all built-in types, an empty format specification will produce
    the equivalent of str(value).  It is recommended that objects
    defining their own format specifiers follow this convention as
    well.


Explicit Conversion Flag

    The explicit conversion flag is used to transform the format field value
    before it is formatted.  This can be used to override the type-specific
    formatting behavior, and format the value as if it were a more
    generic type.  Currently, two explicit conversion flags are
    recognized:

        !r - convert the value to a string using repr().
        !s - convert the value to a string using str().

    These flags are placed before the format specifier:

        "{0!r:20}".format("Hello")

    In the preceding example, the string "Hello" will be printed, with quotes,
    in a field of at least 20 characters width.
    
    A custom Formatter class can define additional conversion flags.
    The built-in formatter will raise a ValueError if an invalid
    conversion flag is specified.


Controlling Formatting on a Per-Type Basis

    Each Python type can control formatting of its instances by defining
    a __format__ method.  The __format__ method is responsible for
    interpreting the format specifier, formatting the value, and
    returning the resulting string.
    
    The new, global built-in function 'format' simply calls this special
    method, similar to how len() and str() simply call their respective
    special methods:
    
        def format(value, format_spec):
            return value.__format__(format_spec)
            
    It is safe to call this function with a value of "None" (because the
    "None" value in Python is an object and can have methods.)

    Several built-in types, including 'str', 'int', 'float', and 'object'
    define __format__ methods.  This means that if you derive from any of
    those types, your class will know how to format itself.
    
    The object.__format__ method is the simplest: It simply converts the
    object to a string, and then calls format again:
    
        class object:
            def __format__(self, format_spec):
                return format(str(self), format_spec)
                
    The __format__ methods for 'int' and 'float' will do numeric formatting
    based on the format specifier.  In some cases, these formatting
    operations may be delegated to other types.  So for example, in the case
    where the 'int' formatter sees a format type of 'f' (meaning 'float')
    it can simply cast the value to a float and call format() again.
    
    Any class can override the __format__ method to provide custom
    formatting for that type:

        class AST:
            def __format__(self, format_spec):
                ...

    Note for Python 2.x: The 'format_spec' argument will be either
    a string object or a unicode object, depending on the type of the
    original format string.  The __format__ method should test the type
    of the specifiers parameter to determine whether to return a string or
    unicode object.  It is the responsibility of the __format__ method
    to return an object of the proper type.
    
    Note that the 'explicit conversion' flag mentioned above is not passed
    to the __format__ method.  Rather, it is expected that the conversion
    specified by the flag will be performed before calling __format__.


User-Defined Formatting

    There will be times when customizing the formatting of fields
    on a per-type basis is not enough.  An example might be a
    spreadsheet application, which displays hash marks '#' when a value
    is too large to fit in the available space.

    For more powerful and flexible formatting, access to the underlying
    format engine can be obtained through the 'Formatter' class that
    lives in the 'string' module.  This class takes additional options
    which are not accessible via the normal str.format method.
    
    An application can subclass the Formatter class to create its own
    customized formatting behavior.

    The PEP does not attempt to exactly specify all methods and
    properties defined by the Formatter class; instead, those will be
    defined and documented in the initial implementation.  However, this
    PEP will specify the general requirements for the Formatter class,
    which are listed below.

    Although string.format() does not directly use the Formatter class
    to do formatting, both use the same underlying implementation.  The
    reason that string.format() does not use the Formatter class directly
    is because "string" is a built-in type, which means that all of its
    methods must be implemented in C, whereas Formatter is a Python
    class.  Formatter provides an extensible wrapper around the same
    C functions as are used by string.format().


Formatter Methods

    The Formatter class takes no initialization arguments:
    
        fmt = Formatter()

    The public API methods of class Formatter are as follows:

        -- format(format_string, *args, **kwargs)
        -- vformat(format_string, args, kwargs)
        
    'format' is the primary API method.  It takes a format template,
    and an arbitrary set of positional and keyword arguments.
    'format' is just a wrapper that calls 'vformat'.

    'vformat' is the function that does the actual work of formatting.  It
    is exposed as a separate function for cases where you want to pass in
    a predefined dictionary of arguments, rather than unpacking and
    repacking the dictionary as individual arguments using the '*args' and
    '**kwds' syntax.  'vformat' does the work of breaking up the format
    template string into character data and replacement fields.  It calls
    the 'get_positional' and 'get_index' methods as appropriate (described
    below.)

    Formatter defines the following overridable methods:
        
        -- get_value(key, args, kwargs)
        -- check_unused_args(used_args, args, kwargs)
        -- format_field(value, format_spec)

    'get_value' is used to retrieve a given field value.  The 'key' argument
    will be either an integer or a string.  If it is an integer, it represents
    the index of the positional argument in 'args'; If it is a string, then
    it represents a named argument in 'kwargs'.
    
    The 'args' parameter is set to the list of positional arguments to
    'vformat', and the 'kwargs' parameter is set to the dictionary of
    positional arguments.
    
    For compound field names, these functions are only called for the
    first component of the field name; subsequent components are handled
    through normal attribute and indexing operations.
    
    So for example, the field expression '0.name' would cause 'get_value'
    to be called with a 'key' argument of 0.  The 'name' attribute will be
    looked up after 'get_value' returns by calling the built-in 'getattr'
    function.

    If the index or keyword refers to an item that does not exist, then an
    IndexError/KeyError should be raised.
    
    'check_unused_args' is used to implement checking for unused arguments
    if desired.  The arguments to this function is the set of all argument
    keys that were actually referred to in the format string (integers for
    positional arguments, and strings for named arguments), and a reference
    to the args and kwargs that was passed to vformat.  The set of unused
    args can be calculated from these parameters.  'check_unused_args'
    is assumed to throw an exception if the check fails.
    
    'format_field' simply calls the global 'format' built-in.  The method
    is provided so that subclasses can override it.

    To get a better understanding of how these functions relate to each
    other, here is pseudocode that explains the general operation of
    vformat.
    
        def vformat(format_string, args, kwargs):
        
          # Output buffer and set of used args
          buffer = StringIO.StringIO()
          used_args = set()
          
          # Tokens are either format fields or literal strings
          for token in self.parse(format_string):
            if is_format_field(token):
              # Split the token into field value and format spec
              field_spec, _, format_spec = token.partition(":")
              
              # Check for explicit type conversion
              explicit, _, field_spec  = field_spec.rpartition("!")
              
              # 'first_part' is the part before the first '.' or '['
              # Assume that 'get_first_part' returns either an int or
              # a string, depending on the syntax.
              first_part = get_first_part(field_spec)
              value = self.get_value(first_part, args, kwargs)
              
              # Record the fact that we used this arg
              used_args.add(first_part)
              
              # Handle [subfield] or .subfield. Assume that 'components'
              # returns an iterator of the various subfields, not including
              # the first part.
              for comp in components(field_spec):
                value = resolve_subfield(value, comp)

              # Handle explicit type conversion
              if explicit == 'r':
                value = repr(value)
              elif explicit == 's':
                value = str(value)

              # Call the global 'format' function and write out the converted
              # value.
              buffer.write(self.format_field(value, format_spec))
              
            else:
              buffer.write(token)
              
          self.check_unused_args(used_args, args, kwargs)
          return buffer.getvalue()
          
    Note that the actual algorithm of the Formatter class (which will be
    implemented in C) may not be the one presented here.  (It's likely
    that the actual implementation won't be a 'class' at all - rather,
    vformat may just call a C function which accepts the other overridable
    methods as arguments.)  The primary purpose of this code example is to
    illustrate the order in which overridable methods are called.


Customizing Formatters

    This section describes some typical ways that Formatter objects
    can be customized.

    To support alternative format-string syntax, the 'vformat' method
    can be overridden to alter the way format strings are parsed.

    One common desire is to support a 'default' namespace, so that
    you don't need to pass in keyword arguments to the format()
    method, but can instead use values in a pre-existing namespace.
    This can easily be done by overriding get_value() as follows:

       class NamespaceFormatter(Formatter):
          def __init__(self, namespace={}):
              Formatter.__init__(self)
              self.namespace = namespace

          def get_value(self, key, args, kwds):
              if isinstance(key, str):
                  try:
                      # Check explicitly passed arguments first
                      return kwds[key]
                  except KeyError:
                      return self.namespace[key]
              else:
                  Formatter.get_value(key, args, kwds)

    One can use this to easily create a formatting function that allows
    access to global variables, for example:

        fmt = NamespaceFormatter(globals())

        greeting = "hello"
        print(fmt.format("{greeting}, world!"))

    A similar technique can be done with the locals() dictionary to
    gain access to the locals dictionary.

    It would also be possible to create a 'smart' namespace formatter
    that could automatically access both locals and globals through
    snooping of the calling stack.  Due to the need for compatibility
    with the different versions of Python, such a capability will not
    be included in the standard library, however it is anticipated
    that someone will create and publish a recipe for doing this.

    Another type of customization is to change the way that built-in
    types are formatted by overriding the 'format_field' method.  (For
    non-built-in types, you can simply define a __format__ special
    method on that type.)  So for example, you could override the
    formatting of numbers to output scientific notation when needed.


Error handling

    There are two classes of exceptions which can occur during formatting:
    exceptions generated by the formatter code itself, and exceptions
    generated by user code (such as a field object's 'getattr' function).

    In general, exceptions generated by the formatter code itself are
    of the "ValueError" variety -- there is an error in the actual "value"
    of the format string.  (This is not always true; for example, the
    string.format() function might be passed a non-string as its first
    parameter, which would result in a TypeError.)

    The text associated with these internally generated ValueError
    exceptions will indicate the location of the exception inside
    the format string, as well as the nature of the exception.

    For exceptions generated by user code, a trace record and
    dummy frame will be added to the traceback stack to help
    in determining the location in the string where the exception
    occurred.  The inserted traceback will indicate that the
    error occurred at:

        File "<format_string>;", line XX, in column_YY

    where XX and YY represent the line and character position
    information in the string, respectively.


Alternate Syntax

    Naturally, one of the most contentious issues is the syntax of the
    format strings, and in particular the markup conventions used to
    indicate fields.

    Rather than attempting to exhaustively list all of the various
    proposals, I will cover the ones that are most widely used
    already.

    - Shell variable syntax: $name and $(name) (or in some variants,
      ${name}).  This is probably the oldest convention out there, and
      is used by Perl and many others.  When used without the braces,
      the length of the variable is determined by lexically scanning
      until an invalid character is found.

      This scheme is generally used in cases where interpolation is
      implicit - that is, in environments where any string can contain
      interpolation variables, and no special substitution function
      need be invoked.  In such cases, it is important to prevent the
      interpolation behavior from occurring accidentally, so the '$'
      (which is otherwise a relatively uncommonly-used character) is
      used to signal when the behavior should occur.

      It is the author's opinion, however, that in cases where the
      formatting is explicitly invoked, that less care needs to be
      taken to prevent accidental interpolation, in which case a
      lighter and less unwieldy syntax can be used.

    - printf and its cousins ('%'), including variations that add a
      field index, so that fields can be interpolated out of order.

    - Other bracket-only variations.  Various MUDs (Multi-User
      Dungeons) such as MUSH have used brackets (e.g. [name]) to do
      string interpolation.  The Microsoft .Net libraries uses braces
      ({}), and a syntax which is very similar to the one in this
      proposal, although the syntax for format specifiers is quite
      different. [4]

    - Backquoting.  This method has the benefit of minimal syntactical
      clutter, however it lacks many of the benefits of a function
      call syntax (such as complex expression arguments, custom
      formatters, etc.).

    - Other variations include Ruby's #{}, PHP's {$name}, and so
      on.

    Some specific aspects of the syntax warrant additional comments:

    1) Backslash character for escapes.  The original version of
    this PEP used backslash rather than doubling to escape a bracket.
    This worked because backslashes in Python string literals that
    don't conform to a standard backslash sequence such as '\n'
    are left unmodified.  However, this caused a certain amount
    of confusion, and led to potential situations of multiple
    recursive escapes, i.e. '\\\\{' to place a literal backslash
    in front of a bracket.

    2) The use of the colon character (':') as a separator for
    format specifiers.  This was chosen simply because that's
    what .Net uses.


Alternate Feature Proposals

    Restricting attribute access: An earlier version of the PEP
    restricted the ability to access attributes beginning with a
    leading underscore, for example "{0}._private".  However, this
    is a useful ability to have when debugging, so the feature
    was dropped.
    
    Some developers suggested that the ability to do 'getattr' and
    'getitem' access should be dropped entirely.  However, this
    is in conflict with the needs of another set of developers who
    strongly lobbied for the ability to pass in a large dict as a
    single argument (without flattening it into individual keyword
    arguments using the **kwargs syntax) and then have the format
    string refer to dict entries individually.
    
    There has also been suggestions to expand the set of expressions
    that are allowed in a format string.  However, this was seen
    to go against the spirit of TOOWTDI, since the same effect can
    be achieved in most cases by executing the same expression on
    the parameter before it's passed in to the formatting function.
    For cases where the format string is being use to do arbitrary
    formatting in a data-rich environment, it's recommended to use
    a template engine specialized for this purpose, such as
    Genshi [5] or Cheetah [6].
    
    Many other features were considered and rejected because they
    could easily be achieved by subclassing Formatter instead of
    building the feature into the base implementation.  This includes
    alternate syntax, comments in format strings, and many others.
    

Security Considerations

    Historically, string formatting has been a common source of
    security holes in web-based applications, particularly if the
    string formatting system allows arbitrary expressions to be
    embedded in format strings.

    The best way to use string formatting in a way that does not
    create potential security holes is to never use format strings
    that come from an untrusted source.
    
    Barring that, the next best approach is to ensure that string
    formatting has no side effects.  Because of the open nature of
    Python, it is impossible to guarantee that any non-trivial
    operation has this property.  What this PEP does is limit the
    types of expressions in format strings to those in which visible
    side effects are both rare and strongly discouraged by the
    culture of Python developers.  So for example, attribute access
    is allowed because it would be considered pathological to write
    code where the mere access of an attribute has visible side
    effects (whether the code has *invisible* side effects - such
    as creating a cache entry for faster lookup - is irrelevant.)


Sample Implementation

    An implementation of an earlier version of this PEP was created by
    Patrick Maupin and Eric V. Smith, and can be found in the pep3101
    sandbox at:

       http://svn.python.org/view/sandbox/trunk/pep3101/


Backwards Compatibility

    Backwards compatibility can be maintained by leaving the existing
    mechanisms in place.  The new system does not collide with any of
    the method names of the existing string formatting techniques, so
    both systems can co-exist until it comes time to deprecate the
    older system.


References

    [1] Python Library Reference - String formating operations
        http://docs.python.org/library/stdtypes.html#string-formatting-operations

    [2] Python Library References - Template strings
        http://docs.python.org/library/string.html#string.Template

    [3] [Python-3000] String formating operations in python 3k
        http://mail.python.org/pipermail/python-3000/2006-April/000285.html

    [4] Composite Formatting - [.Net Framework Developer's Guide]
        http://msdn.microsoft.com/library/en-us/cpguide/html/cpconcompositeformatting.asp?frame=true
        
    [5] Genshi templating engine.
        http://genshi.edgewall.org/

    [5] Cheetah - The Python-Powered Template Engine.
        http://www.cheetahtemplate.org/


Copyright

    This document has been placed in the public domain.


pep-3102 Keyword-Only Arguments

PEP: 3102
Title: Keyword-Only Arguments
Version: $Revision$
Last-Modified: $Date$
Author: Talin <talin at acm.org>
Status: Final
Type: Standards Track
Content-Type: text/plain
Created: 22-Apr-2006
Python-Version: 3.0
Post-History: 28-Apr-2006, May-19-2006

Abstract

    This PEP proposes a change to the way that function arguments are
    assigned to named parameter slots.  In particular, it enables the
    declaration of "keyword-only" arguments: arguments that can only
    be supplied by keyword and which will never be automatically
    filled in by a positional argument.


Rationale

    The current Python function-calling paradigm allows arguments to
    be specified either by position or by keyword.  An argument can be
    filled in either explicitly by name, or implicitly by position.

    There are often cases where it is desirable for a function to take
    a variable number of arguments.  The Python language supports this
    using the 'varargs' syntax ('*name'), which specifies that any
    'left over' arguments be passed into the varargs parameter as a
    tuple.

    One limitation on this is that currently, all of the regular
    argument slots must be filled before the vararg slot can be.

    This is not always desirable.  One can easily envision a function
    which takes a variable number of arguments, but also takes one
    or more 'options' in the form of keyword arguments.  Currently,
    the only way to do this is to define both a varargs argument,
    and a 'keywords' argument (**kwargs), and then manually extract
    the desired keywords from the dictionary.


Specification

    Syntactically, the proposed changes are fairly simple.  The first
    change is to allow regular arguments to appear after a varargs
    argument:

        def sortwords(*wordlist, case_sensitive=False):
           ...

    This function accepts any number of positional arguments, and it
    also accepts a keyword option called 'case_sensitive'.  This
    option will never be filled in by a positional argument, but
    must be explicitly specified by name.

    Keyword-only arguments are not required to have a default value.
    Since Python requires that all arguments be bound to a value,
    and since the only way to bind a value to a keyword-only argument
    is via keyword, such arguments are therefore 'required keyword'
    arguments.  Such arguments must be supplied by the caller, and
    they must be supplied via keyword.

    The second syntactical change is to allow the argument name to
    be omitted for a varargs argument. The meaning of this is to
    allow for keyword-only arguments for functions that would not
    otherwise take a varargs argument:

        def compare(a, b, *, key=None):
            ...

    The reasoning behind this change is as follows.  Imagine for a
    moment a function which takes several positional arguments, as
    well as a keyword argument:

        def compare(a, b, key=None):
            ...

    Now, suppose you wanted to have 'key' be a keyword-only argument.
    Under the above syntax, you could accomplish this by adding a
    varargs argument immediately before the keyword argument:

        def compare(a, b, *ignore, key=None):
            ...

    Unfortunately, the 'ignore' argument will also suck up any
    erroneous positional arguments that may have been supplied by the
    caller.  Given that we'd prefer any unwanted arguments to raise an
    error, we could do this:

        def compare(a, b, *ignore, key=None):
            if ignore:  # If ignore is not empty
                raise TypeError

    As a convenient shortcut, we can simply omit the 'ignore' name,
    meaning 'don't allow any positional arguments beyond this point'.
    
    (Note: After much discussion of alternative syntax proposals, the
    BDFL has pronounced in favor of this 'single star' syntax for
    indicating the end of positional parameters.)
    

Function Calling Behavior

    The previous section describes the difference between the old
    behavior and the new.  However, it is also useful to have a
    description of the new behavior that stands by itself, without
    reference to the previous model.  So this next section will
    attempt to provide such a description.

    When a function is called, the input arguments are assigned to
    formal parameters as follows:

      - For each formal parameter, there is a slot which will be used
        to contain the value of the argument assigned to that
        parameter.

      - Slots which have had values assigned to them are marked as
        'filled'.  Slots which have no value assigned to them yet are
        considered 'empty'.

      - Initially, all slots are marked as empty.

      - Positional arguments are assigned first, followed by keyword
        arguments.

      - For each positional argument:

         o Attempt to bind the argument to the first unfilled
           parameter slot.  If the slot is not a vararg slot, then
           mark the slot as 'filled'.

         o If the next unfilled slot is a vararg slot, and it does
           not have a name, then it is an error.

         o Otherwise, if the next unfilled slot is a vararg slot then
           all remaining non-keyword arguments are placed into the
           vararg slot.

      - For each keyword argument:

         o If there is a parameter with the same name as the keyword,
           then the argument value is assigned to that parameter slot.
           However, if the parameter slot is already filled, then that
           is an error.

         o Otherwise, if there is a 'keyword dictionary' argument,
           the argument is added to the dictionary using the keyword
           name as the dictionary key, unless there is already an
           entry with that key, in which case it is an error.

         o Otherwise, if there is no keyword dictionary, and no
           matching named parameter, then it is an error.

      - Finally:

         o If the vararg slot is not yet filled, assign an empty tuple
           as its value.

         o For each remaining empty slot: if there is a default value
           for that slot, then fill the slot with the default value.
           If there is no default value, then it is an error.

    In accordance with the current Python implementation, any errors
    encountered will be signaled by raising TypeError.  (If you want
    something different, that's a subject for a different PEP.)


Backwards Compatibility

    The function calling behavior specified in this PEP is a superset
    of the existing behavior - that is, it is expected that any
    existing programs will continue to work.


Copyright

    This document has been placed in the public domain.


pep-3103 A Switch/Case Statement

PEP:3103
Title:A Switch/Case Statement
Version:$Revision$
Last-Modified:$Date$
Author:guido at python.org (Guido van Rossum)
Status:Rejected
Type:Standards Track
Content-Type:text/x-rst
Created:25-Jun-2006
Python-Version:3.0
Post-History:26-Jun-2006

Rejection Notice

A quick poll during my keynote presentation at PyCon 2007 shows this proposal has no popular support. I therefore reject it.

Abstract

Python-dev has recently seen a flurry of discussion on adding a switch statement. In this PEP I'm trying to extract my own preferences from the smorgasboard of proposals, discussing alternatives and explaining my choices where I can. I'll also indicate how strongly I feel about alternatives I discuss.

This PEP should be seen as an alternative to PEP 275. My views are somewhat different from that PEP's author, but I'm grateful for the work done in that PEP.

This PEP introduces canonical names for the many variants that have been discussed for different aspects of the syntax and semantics, such as "alternative 1", "school II", "option 3" and so on. Hopefully these names will help the discussion.

Rationale

A common programming idiom is to consider an expression and do different things depending on its value. This is usually done with a chain of if/elif tests; I'll refer to this form as the "if/elif chain". There are two main motivations to want to introduce new syntax for this idiom:

  • It is repetitive: the variable and the test operator, usually '==' or 'in', are repeated in each if/elif branch.
  • It is inefficient: when an expression matches the last test value (or no test value at all) it is compared to each of the preceding test values.

Both of these complaints are relatively mild; there isn't a lot of readability or performance to be gained by writing this differently. Yet, some kind of switch statement is found in many languages and it is not unreasonable to expect that its addition to Python will allow us to write up certain code more cleanly and efficiently than before.

There are forms of dispatch that are not suitable for the proposed switch statement; for example, when the number of cases is not statically known, or when it is desirable to place the code for different cases in different classes or files.

Basic Syntax

I'm considering several variants of the syntax first proposed in PEP 275 here. There are lots of other possibilities, but I don't see that they add anything.

I've recently been converted to alternative 1.

I should note that all alternatives here have the "implicit break" property: at the end of the suite for a particular case, the control flow jumps to the end of the whole switch statement. There is no way to pass control from one case to another. This in contrast to C, where an explicit 'break' statement is required to prevent falling through to the next case.

In all alternatives, the else-suite is optional. It is more Pythonic to use 'else' here rather than introducing a new reserved word, 'default', as in C.

Semantics are discussed in the next top-level section.

Alternative 1

This is the preferred form in PEP 275:

switch EXPR:
    case EXPR:
        SUITE
    case EXPR:
        SUITE
    ...
    else:
        SUITE

The main downside is that the suites where all the action is are indented two levels deep; this can be remedied by indenting the cases "half a level" (e.g. 2 spaces if the general indentation level is 4).

Alternative 2

This is Fredrik Lundh's preferred form; it differs by not indenting the cases:

switch EXPR:
case EXPR:
    SUITE
case EXPR:
    SUITE
....
else:
    SUITE

Some reasons not to choose this include expected difficulties for auto-indenting editors, folding editors, and the like; and confused users. There are no situations currently in Python where a line ending in a colon is followed by an unindented line.

Alternative 3

This is the same as alternative 2 but leaves out the colon after the switch:

switch EXPR
case EXPR:
    SUITE
case EXPR:
    SUITE
....
else:
    SUITE

The hope of this alternative is that it will not upset the auto-indent logic of the average Python-aware text editor less. But it looks strange to me.

Alternative 4

This leaves out the 'case' keyword on the basis that it is redundant:

switch EXPR:
    EXPR:
        SUITE
    EXPR:
        SUITE
    ...
    else:
        SUITE

Unfortunately now we are forced to indent the case expressions, because otherwise (at least in the absence of an 'else' keyword) the parser would have a hard time distinguishing between an unindented case expression (which continues the switch statement) or an unrelated statement that starts like an expression (such as an assignment or a procedure call). The parser is not smart enough to backtrack once it sees the colon. This is my least favorite alternative.

Extended Syntax

There is one additional concern that needs to be addressed syntactically. Often two or more values need to be treated the same. In C, this done by writing multiple case labels together without any code between them. The "fall through" semantics then mean that these are all handled by the same code. Since the Python switch will not have fall-through semantics (which have yet to find a champion) we need another solution. Here are some alternatives.

Alternative A

Use:

case EXPR:

to match on a single expression; use:

case EXPR, EXPR, ...:

to match on mulltiple expressions. The is interpreted so that if EXPR is a parenthesized tuple or another expression whose value is a tuple, the switch expression must equal that tuple, not one of its elements. This means that we cannot use a variable to indicate multiple cases. While this is also true in C's switch statement, it is a relatively common occurrence in Python (see for example sre_compile.py).

Alternative B

Use:

case EXPR:

to match on a single expression; use:

case in EXPR_LIST:

to match on multiple expressions. If EXPR_LIST is a single expression, the 'in' forces its interpretation as an iterable (or something supporting __contains__, in a minority semantics alternative). If it is multiple expressions, each of those is considered for a match.

Alternative C

Use:

case EXPR:

to match on a single expression; use:

case EXPR, EXPR, ...:

to match on multiple expressions (as in alternative A); and use:

case *EXPR:

to match on the elements of an expression whose value is an iterable. The latter two cases can be combined, so that the true syntax is more like this:

case [*]EXPR, [*]EXPR, ...:

The * notation is similar to the use of prefix * already in use for variable-length parameter lists and for passing computed argument lists, and often proposed for value-unpacking (e.g. a, b, *c = X as an alternative to (a, b), c = X[:2], X[2:]).

Alternative D

This is a mixture of alternatives B and C; the syntax is like alternative B but instead of the 'in' keyword it uses '*'. This is more limited, but still allows the same flexibility. It uses:

case EXPR:

to match on a single expression and:

case *EXPR:

to match on the elements of an iterable. If one wants to specify multiple matches in one case, one can write this:

case *(EXPR, EXPR, ...):

or perhaps this (although it's a bit strange because the relative priority of '*' and ',' is different than elsewhere):

case * EXPR, EXPR, ...:

Discussion

Alternatives B, C and D are motivated by the desire to specify multiple cases with the same treatment using a variable representing a set (usually a tuple) rather than spelling them out. The motivation for this is usually that if one has several switches over the same set of cases it's a shame to have to spell out all the alternatives each time. An additional motivation is to be able to specify ranges to be matched easily and efficiently, similar to Pascal's "1..1000:" notation. At the same time we want to prevent the kind of mistake that is common in exception handling (and which will be addressed in Python 3000 by changing the syntax of the except clause): writing "case 1, 2:" where "case (1, 2):" was meant, or vice versa.

The case could be made that the need is insufficient for the added complexity; C doesn't have a way to express ranges either, and it's used a lot more than Pascal these days. Also, if a dispatch method based on dict lookup is chosen as the semantics, large ranges could be inefficient (consider range(1, sys.maxint)).

All in all my preferences are (from most to least favorite) B, A, D', C, where D' is D without the third possibility.

Semantics

There are several issues to review before we can choose the right semantics.

If/Elif Chain vs. Dict-based Dispatch

There are several main schools of thought about the switch statement's semantics:

  • School I wants to define the switch statement in term of an equivalent if/elif chain (possibly with some optimization thrown in).
  • School II prefers to think of it as a dispatch on a precomputed dict. There are different choices for when the precomputation happens.
  • There's also school III, which agrees with school I that the definition of a switch statement should be in terms of an equivalent if/elif chain, but concedes to the optimization camp that all expressions involved must be hashable.

We need to further separate school I into school Ia and school Ib:

  • School Ia has a simple position: a switch statement is translated to an equivalent if/elif chain, and that's that. It should not be linked to optimization at all. That is also my main objection against this school: without any hint of optimization, the switch statement isn't attractive enough to warrant new syntax.
  • School Ib has a more complex position: it agrees with school II that optimization is important, and is willing to concede the compiler certain liberties to allow this. (For example, PEP 275 Solution 1.) In particular, hash() of the switch and case expressions may or may not be called (so it should be side-effect-free); and the case expressions may not be evaluated each time as expected by the if/elif chain behavior, so the case expressions should also be side-effect free. My objection to this (elaborated below) is that if either the hash() or the case expressions aren't side-effect-free, optimized and unoptimized code may behave differently.

School II grew out of the realization that optimization of commonly found cases isn't so easy, and that it's better to face this head on. This will become clear below.

The differences between school I (mostly school Ib) and school II are threefold:

  • When optimizing using a dispatch dict, if either the switch expression or the case expressions are unhashable (in which case hash() raises an exception), school Ib requires catching the hash() failure and falling back to an if/elif chain. School II simply lets the exception happen. The problem with catching an exception in hash() as required by school Ib, is that this may hide a genuine bug. A possible way out is to only use a dispatch dict if all case expressions are ints, strings or other built-ins with known good hash behavior, and to only attempt to hash the switch expression if it is also one of those types. Type objects should probably also be supported here. This is the (only) problem that school III addresses.
  • When optimizing using a dispatch dict, if the hash() function of any expression involved returns an incorrect value, under school Ib, optimized code will not behave the same as unoptimized code. This is a well-known problem with optimization-related bugs, and waste lots of developer time. Under school II, in this situation incorrect results are produced at least consistently, which should make debugging a bit easier. The way out proposed for the previous bullet would also help here.
  • School Ib doesn't have a good optimization strategy if the case expressions are named constants. The compiler cannot know their values for sure, and it cannot know whether they are truly constant. As a way out, it has been proposed to re-evaluate the expression corresponding to the case once the dict has identified which case should be taken, to verify that the value of the expression didn't change. But strictly speaking, all the case expressions occurring before that case would also have to be checked, in order to preserve the true if/elif chain semantics, thereby completely killing the optimization. Another proposed solution is to have callbacks notifying the dispatch dict of changes in the value of variables or attributes involved in the case expressions. But this is not likely implementable in the general case, and would require many namespaces to bear the burden of supporting such callbacks, which currently don't exist at all.
  • Finally, there's a difference of opinion regarding the treatment of duplicate cases (i.e. two or more cases with match expressions that evaluates to the same value). School I wants to treat this the same is an if/elif chain would treat it (i.e. the first match wins and the code for the second match is silently unreachable); school II wants this to be an error at the time the dispatch dict is frozen (so dead code doesn't go undiagnosed).

School I sees trouble in school II's approach of pre-freezing a dispatch dict because it places a new and unusual burden on programmers to understand exactly what kinds of case values are allowed to be frozen and when the case values will be frozen, or they might be surprised by the switch statement's behavior.

School II doesn't believe that school Ia's unoptimized switch is worth the effort, and it sees trouble in school Ib's proposal for optimization, which can cause optimized and unoptimized code to behave differently.

In addition, school II sees little value in allowing cases involving unhashable values; after all if the user expects such values, they can just as easily write an if/elif chain. School II also doesn't believe that it's right to allow dead code due to overlapping cases to occur unflagged, when the dict-based dispatch implementation makes it so easy to trap this.

However, there are some use cases for overlapping/duplicate cases. Suppose you're switching on some OS-specific constants (e.g. exported by the os module or some module like that). You have a case for each. But on some OS, two different constants have the same value (since on that OS they are implemented the same way -- like O_TEXT and O_BINARY on Unix). If duplicate cases are flagged as errors, your switch wouldn't work at all on that OS. It would be much better if you could arrange the cases so that one case has preference over another.

There's also the (more likely) use case where you have a set of cases to be treated the same, but one member of the set must be treated differently. It would be convenient to put the exception in an earlier case and be done with it.

(Yes, it seems a shame not to be able to diagnose dead code due to accidental case duplication. Maybe that's less important, and pychecker can deal with it? After all we don't diagnose duplicate method definitions either.)

This suggests school IIb: like school II but redundant cases must be resolved by choosing the first match. This is trivial to implement when building the dispatch dict (skip keys already present).

(An alternative would be to introduce new syntax to indicate "okay to have overlapping cases" or "ok if this case is dead code" but I find that overkill.)

Personally, I'm in school II: I believe that the dict-based dispatch is the one true implementation for switch statements and that we should face the limitiations up front, so that we can reap maximal benefits. I'm leaning towards school IIb -- duplicate cases should be resolved by the ordering of the cases instead of flagged as errors.

When to Freeze the Dispatch Dict

For the supporters of school II (dict-based dispatch), the next big dividing issue is when to create the dict used for switching. I call this "freezing the dict".

The main problem that makes this interesting is the observation that Python doesn't have named compile-time constants. What is conceptually a constant, such as re.IGNORECASE, is a variable to the compiler, and there's nothing to stop crooked code from modifying its value.

Option 1

The most limiting option is to freeze the dict in the compiler. This would require that the case expressions are all literals or compile-time expressions involving only literals and operators whose semantics are known to the compiler, since with the current state of Python's dynamic semantics and single-module compilation, there is no hope for the compiler to know with sufficient certainty the values of any variables occurring in such expressions. This is widely though not universally considered too restrictive.

Raymond Hettinger is the main advocate of this approach. He proposes a syntax where only a single literal of certain types is allowed as the case expression. It has the advantage of being unambiguous and easy to implement.

My main complaint about this is that by disallowing "named constants" we force programmers to give up good habits. Named constants are introduced in most languages to solve the problem of "magic numbers" occurring in the source code. For example, sys.maxint is a lot more readable than 2147483647. Raymond proposes to use string literals instead of named "enums", observing that the string literal's content can be the name that the constant would otherwise have. Thus, we could write "case 'IGNORECASE':" instead of "case re.IGNORECASE:". However, if there is a spelling error in the string literal, the case will silently be ignored, and who knows when the bug is detected. If there is a spelling error in a NAME, however, the error will be caught as soon as it is evaluated. Also, sometimes the constants are externally defined (e.g. when parsing a file format like JPEG) and we can't easily choose appropriate string values. Using an explicit mapping dict sounds like a poor hack.

Option 2

The oldest proposal to deal with this is to freeze the dispatch dict the first time the switch is executed. At this point we can assume that all the named "constants" (constant in the programmer's mind, though not to the compiler) used as case expressions are defined -- otherwise an if/elif chain would have little chance of success either. Assuming the switch will be executed many times, doing some extra work the first time pays back quickly by very quick dispatch times later.

An objection to this option is that there is no obvious object where the dispatch dict can be stored. It can't be stored on the code object, which is supposed to be immutable; it can't be stored on the function object, since many function objects may be created for the same function (e.g. for nested functions). In practice, I'm sure that something can be found; it could be stored in a section of the code object that's not considered when comparing two code objects or when pickling or marshalling a code object; or all switches could be stored in a dict indexed by weak references to code objects. The solution should also be careful not to leak switch dicts between multiple interpreters.

Another objection is that the first-use rule allows obfuscated code like this:

def foo(x, y):
    switch x:
    case y:
        print 42

To the untrained eye (not familiar with Python) this code would be equivalent to this:

def foo(x, y):
    if x == y:
        print 42

but that's not what it does (unless it is always called with the same value as the second argument). This has been addressed by suggesting that the case expressions should not be allowed to reference local variables, but this is somewhat arbitrary.

A final objection is that in a multi-threaded application, the first-use rule requires intricate locking in order to guarantee the correct semantics. (The first-use rule suggests a promise that side effects of case expressions are incurred exactly once.) This may be as tricky as the import lock has proved to be, since the lock has to be held while all the case expressions are being evaluated.

Option 3

A proposal that has been winning support (including mine) is to freeze a switch's dict when the innermost function containing it is defined. The switch dict is stored on the function object, just as parameter defaults are, and in fact the case expressions are evaluated at the same time and in the same scope as the parameter defaults (i.e. in the scope containing the function definition).

This option has the advantage of avoiding many of the finesses needed to make option 2 work: there's no need for locking, no worry about immutable code objects or multiple interpreters. It also provides a clear explanation for why locals can't be referenced in case expressions.

This option works just as well for situations where one would typically use a switch; case expressions involving imported or global named constants work exactly the same way as in option 2, as long as they are imported or defined before the function definition is encountered.

A downside however is that the dispatch dict for a switch inside a nested function must be recomputed each time the nested function is defined. For certain "functional" styles of programming this may make switch unattractive in nested functions. (Unless all case expressions are compile-time constants; then the compiler is of course free to optimize away the swich freezing code and make the dispatch table part of the code object.)

Another downside is that under this option, there's no clear moment when the dispatch dict is frozen for a switch that doesn't occur inside a function. There are a few pragmatic choices for how to treat a switch outside a function:

  1. Disallow it.
  2. Translate it into an if/elif chain.
  3. Allow only compile-time constant expressions.
  4. Compute the dispatch dict each time the switch is reached.
  5. Like (b) but tests that all expressions evaluated are hashable.

Of these, (a) seems too restrictive: it's uniformly worse than (c); and (d) has poor performance for little or no benefits compared to (b). It doesn't make sense to have a performance-critical inner loop at the module level, as all local variable references are slow there; hence (b) is my (weak) favorite. Perhaps I should favor (e), which attempts to prevent atypical use of a switch; examples that work interactively but not in a function are annoying. In the end I don't think this issue is all that important (except it must be resolved somehow) and am willing to leave it up to whoever ends up implementing it.

When a switch occurs in a class but not in a function, we can freeze the dispatch dict at the same time the temporary function object representing the class body is created. This means the case expressions can reference module globals but not class variables. Alternatively, if we choose (b) above, we could choose this implementation inside a class definition as well.

Option 4

There are a number of proposals to add a construct to the language that makes the concept of a value pre-computed at function definition time generally available, without tying it either to parameter default values or case expressions. Some keywords proposed include 'const', 'static', 'only' or 'cached'. The associated syntax and semantics vary.

These proposals are out of scope for this PEP, except to suggest that if such a proposal is accepted, there are two ways for the switch to benefit: we could require case expressions to be either compile-time constants or pre-computed values; or we could make pre-computed values the default (and only) evaluation mode for case expressions. The latter would be my preference, since I don't see a use for more dynamic case expressions that isn't addressed adequately by writing an explicit if/elif chain.

Conclusion

It is too early to decide. I'd like to see at least one completed proposal for pre-computed values before deciding. In the mean time, Python is fine without a switch statement, and perhaps those who claim it would be a mistake to add one are right.

pep-3104 Access to Names in Outer Scopes

PEP:3104
Title:Access to Names in Outer Scopes
Version:$Revision$
Last-Modified:$Date$
Author:Ka-Ping Yee <ping at zesty.ca>
Status:Final
Type:Standards Track
Content-Type:text/x-rst
Created:12-Oct-2006
Python-Version:3.0
Post-History:

Abstract

In most languages that support nested scopes, code can refer to or rebind (assign to) any name in the nearest enclosing scope. Currently, Python code can refer to a name in any enclosing scope, but it can only rebind names in two scopes: the local scope (by simple assignment) or the module-global scope (using a global declaration).

This limitation has been raised many times on the Python-Dev mailing list and elsewhere, and has led to extended discussion and many proposals for ways to remove this limitation. This PEP summarizes the various alternatives that have been suggested, together with advantages and disadvantages that have been mentioned for each.

Rationale

Before version 2.1, Python's treatment of scopes resembled that of standard C: within a file there were only two levels of scope, global and local. In C, this is a natural consequence of the fact that function definitions cannot be nested. But in Python, though functions are usually defined at the top level, a function definition can be executed anywhere. This gave Python the syntactic appearance of nested scoping without the semantics, and yielded inconsistencies that were surprising to some programmers -- for example, a recursive function that worked at the top level would cease to work when moved inside another function, because the recursive function's own name would no longer be visible in its body's scope. This violates the intuition that a function should behave consistently when placed in different contexts. Here's an example:

def enclosing_function():
    def factorial(n):
        if n < 2:
            return 1
        return n * factorial(n - 1)  # fails with NameError
    print factorial(5)

Python 2.1 moved closer to static nested scoping by making visible the names bound in all enclosing scopes (see PEP 227). This change makes the above code example work as expected. However, because any assignment to a name implicitly declares that name to be local, it is impossible to rebind a name in an outer scope (except when a global declaration forces the name to be global). Thus, the following code, intended to display a number that can be incremented and decremented by clicking buttons, doesn't work as someone familiar with lexical scoping might expect:

def make_scoreboard(frame, score=0):
    label = Label(frame)
    label.pack()
    for i in [-10, -1, 1, 10]:
        def increment(step=i):
            score = score + step  # fails with UnboundLocalError
            label['text'] = score
        button = Button(frame, text='%+d' % i, command=increment)
        button.pack()
    return label

Python syntax doesn't provide a way to indicate that the name score mentioned in increment refers to the variable score bound in make_scoreboard, not a local variable in increment. Users and developers of Python have expressed an interest in removing this limitation so that Python can have the full flexibility of the Algol-style scoping model that is now standard in many programming languages, including JavaScript, Perl, Ruby, Scheme, Smalltalk, C with GNU extensions, and C# 2.0.

It has been argued that that such a feature isn't necessary, because a rebindable outer variable can be simulated by wrapping it in a mutable object:

class Namespace:
    pass

def make_scoreboard(frame, score=0):
    ns = Namespace()
    ns.score = 0
    label = Label(frame)
    label.pack()
    for i in [-10, -1, 1, 10]:
        def increment(step=i):
            ns.score = ns.score + step
            label['text'] = ns.score
        button = Button(frame, text='%+d' % i, command=increment)
        button.pack()
    return label

However, this workaround only highlights the shortcomings of existing scopes: the purpose of a function is to encapsulate code in its own namespace, so it seems unfortunate that the programmer should have to create additional namespaces to make up for missing functionality in the existing local scopes, and then have to decide whether each name should reside in the real scope or the simulated scope.

Another common objection is that the desired functionality can be written as a class instead, albeit somewhat more verbosely. One rebuttal to this objection is that the existence of a different implementation style is not a reason to leave a supported programming construct (nested scopes) functionally incomplete. Python is sometimes called a "multi-paradigm language" because it derives so much strength, practical flexibility, and pedagogical power from its support and graceful integration of multiple programming paradigms.

A proposal for scoping syntax appeared on Python-Dev as far back as 1994 [1], long before PEP 227's support for nested scopes was adopted. At the time, Guido's response was:

This is dangerously close to introducing CSNS [classic static nested scopes]. If you were to do so, your proposed semantics of scoped seem allright. I still think there is not enough need for CSNS to warrant this kind of construct ...

After PEP 227, the "outer name rebinding discussion" has reappeared on Python-Dev enough times that it has become a familiar event, having recurred in its present form since at least 2003 [2]. Although none of the language changes proposed in these discussions have yet been adopted, Guido has acknowledged that a language change is worth considering [12].

Other Languages

To provide some background, this section describes how some other languages handle nested scopes and rebinding.

JavaScript, Perl, Scheme, Smalltalk, GNU C, C# 2.0

These languages use variable declarations to indicate scope. In JavaScript, a lexically scoped variable is declared with the var keyword; undeclared variable names are assumed to be global. In Perl, a lexically scoped variable is declared with the my keyword; undeclared variable names are assumed to be global. In Scheme, all variables must be declared (with define or let, or as formal parameters). In Smalltalk, any block can begin by declaring a list of local variable names between vertical bars. C and C# require type declarations for all variables. For all these cases, the variable belongs to the scope containing the declaration.

Ruby (as of 1.8)

Ruby is an instructive example because it appears to be the only other currently popular language that, like Python, tries to support statically nested scopes without requiring variable declarations, and thus has to come up with an unusual solution. Functions in Ruby can contain other function definitions, and they can also contain code blocks enclosed in curly braces. Blocks have access to outer variables, but nested functions do not. Within a block, an assignment to a name implies a declaration of a local variable only if it would not shadow a name already bound in an outer scope; otherwise assignment is interpreted as rebinding of the outer name. Ruby's scoping syntax and rules have also been debated at great length, and changes seem likely in Ruby 2.0 [28].

Overview of Proposals

There have been many different proposals on Python-Dev for ways to rebind names in outer scopes. They all fall into two categories: new syntax in the scope where the name is bound, or new syntax in the scope where the name is used.

New Syntax in the Binding (Outer) Scope

Scope Override Declaration

The proposals in this category all suggest a new kind of declaration statement similar to JavaScript's var. A few possible keywords have been proposed for this purpose:

In all these proposals, a declaration such as var x in a particular scope S would cause all references to x in scopes nested within S to refer to the x bound in S.

The primary objection to this category of proposals is that the meaning of a function definition would become context-sensitive. Moving a function definition inside some other block could cause any of the local name references in the function to become nonlocal, due to declarations in the enclosing block. For blocks in Ruby 1.8, this is actually the case; in the following example, the two setters have different effects even though they look identical:

setter1 = proc { | x | y = x }      # y is local here
y = 13
setter2 = proc { | x | y = x }      # y is nonlocal here
setter1.call(99)
puts y                              # prints 13
setter2.call(77)
puts y                              # prints 77

Note that although this proposal resembles declarations in JavaScript and Perl, the effect on the language is different because in those languages undeclared variables are global by default, whereas in Python undeclared variables are local by default. Thus, moving a function inside some other block in JavaScript or Perl can only reduce the scope of a previously global name reference, whereas in Python with this proposal, it could expand the scope of a previously local name reference.

Required Variable Declaration

A more radical proposal [21] suggests removing Python's scope-guessing convention altogether and requiring that all names be declared in the scope where they are to be bound, much like Scheme. With this proposal, var x = 3 would both declare x to belong to the local scope and bind it, where as x = 3 would rebind the existing visible x. In a context without an enclosing scope containing a var x declaration, the statement x = 3 would be statically determined to be illegal.

This proposal yields a simple and consistent model, but it would be incompatible with all existing Python code.

New Syntax in the Referring (Inner) Scope

There are three kinds of proposals in this category.

Outer Reference Expression

This type of proposal suggests a new way of referring to a variable in an outer scope when using the variable in an expression. One syntax that has been suggested for this is .x [7], which would refer to x without creating a local binding for it. A concern with this proposal is that in many contexts x and .x could be used interchangeably, which would confuse the reader. A closely related idea is to use multiple dots to specify the number of scope levels to ascend [8], but most consider this too error-prone [17].

Rebinding Operator

This proposal suggests a new assignment-like operator that rebinds a name without declaring the name to be local [2]. Whereas the statement x = 3 both declares x a local variable and binds it to 3, the statement x := 3 would change the existing binding of x without declaring it local.

This is a simple solution, but according to PEP 3099 it has been rejected (perhaps because it would be too easy to miss or to confuse with =).

Scope Override Declaration

The proposals in this category suggest a new kind of declaration statement in the inner scope that prevents a name from becoming local. This statement would be similar in nature to the global statement, but instead of making the name refer to a binding in the top module-level scope, it would make the name refer to the binding in the nearest enclosing scope.

This approach is attractive due to its parallel with a familiar Python construct, and because it retains context-independence for function definitions.

This approach also has advantages from a security and debugging perspective. The resulting Python would not only match the functionality of other nested-scope languages but would do so with a syntax that is arguably even better for defensive programming. In most other languages, a declaration contracts the scope of an existing name, so inadvertently omitting the declaration could yield farther-reaching (i.e. more dangerous) effects than expected. In Python with this proposal, the extra effort of adding the declaration is aligned with the increased risk of non-local effects (i.e. the path of least resistance is the safer path).

Many spellings have been suggested for such a declaration:

  • scoped x [1]
  • global x in f [3] (explicitly specify which scope)
  • free x [5]
  • outer x [6]
  • use x [9]
  • global x [10] (change the meaning of global)
  • nonlocal x [11]
  • global x outer [18]
  • global in x [18]
  • not global x [18]
  • extern x [20]
  • ref x [22]
  • refer x [22]
  • share x [22]
  • sharing x [22]
  • common x [22]
  • using x [22]
  • borrow x [22]
  • reuse x [23]
  • scope f x [25] (explicitly specify which scope)

The most commonly discussed choices appear to be outer, global, and nonlocal. outer is already used as both a variable name and an attribute name in the standard library. The word global has a conflicting meaning, because "global variable" is generally understood to mean a variable with top-level scope [27]. In C, the keyword extern means that a name refers to a variable in a different compilation unit. While nonlocal is a bit long and less pleasant-sounding than some of the other options, it does have precisely the correct meaning: it declares a name not local.

Proposed Solution

The solution proposed by this PEP is to add a scope override declaration in the referring (inner) scope. Guido has expressed a preference for this category of solution on Python-Dev [14] and has shown approval for nonlocal as the keyword [19].

The proposed declaration:

nonlocal x

prevents x from becoming a local name in the current scope. All occurrences of x in the current scope will refer to the x bound in an outer enclosing scope. As with global, multiple names are permitted:

nonlocal x, y, z

If there is no pre-existing binding in an enclosing scope, the compiler raises a SyntaxError. (It may be a bit of a stretch to call this a syntax error, but so far SyntaxError is used for all compile-time errors, including, for example, __future__ import with an unknown feature name.) Guido has said that this kind of declaration in the absence of an outer binding should be considered an error [16].

If a nonlocal declaration collides with the name of a formal parameter in the local scope, the compiler raises a SyntaxError.

A shorthand form is also permitted, in which nonlocal is prepended to an assignment or augmented assignment:

nonlocal x = 3

The above has exactly the same meaning as nonlocal x; x = 3. (Guido supports a similar form of the global statement [24].)

On the left side of the shorthand form, only identifiers are allowed, not target expressions like x[0]. Otherwise, all forms of assignment are allowed. The proposed grammar of the nonlocal statement is:

nonlocal_stmt ::=
    "nonlocal" identifier ("," identifier)*
               ["=" (target_list "=")+ expression_list]
  | "nonlocal" identifier augop expression_list

The rationale for allowing all these forms of assignment is that it simplifies understanding of the nonlocal statement. Separating the shorthand form into a declaration and an assignment is sufficient to understand what it means and whether it is valid.

Backward Compatibility

This PEP targets Python 3000, as suggested by Guido [19]. However, others have noted that some options considered in this PEP may be small enough changes to be feasible in Python 2.x [26], in which case this PEP could possibly be moved to be a 2.x series PEP.

As a (very rough) measure of the impact of introducing a new keyword, here is the number of times that some of the proposed keywords appear as identifiers in the standard library, according to a scan of the Python SVN repository on November 5, 2006:

nonlocal    0
use         2
using       3
reuse       4
free        8
outer     147

global appears 214 times as an existing keyword. As a measure of the impact of using global as the outer-scope keyword, there are 18 files in the standard library that would break as a result of such a change (because a function declares a variable global before that variable has been introduced in the global scope):

cgi.py
dummy_thread.py
mhlib.py
mimetypes.py
idlelib/PyShell.py
idlelib/run.py
msilib/__init__.py
test/inspect_fodder.py
test/test_compiler.py
test/test_decimal.py
test/test_descr.py
test/test_dummy_threading.py
test/test_fileinput.py
test/test_global.py (not counted: this tests the keyword itself)
test/test_grammar.py (not counted: this tests the keyword itself)
test/test_itertools.py
test/test_multifile.py
test/test_scope.py (not counted: this tests the keyword itself)
test/test_threaded_import.py
test/test_threadsignals.py
test/test_warnings.py

References

[1](1, 2) Scoping (was Re: Lambda binding solved?) (Rafael Bracho) http://www.python.org/search/hypermail/python-1994q1/0301.html
[2](1, 2) Extended Function syntax (Just van Rossum) http://mail.python.org/pipermail/python-dev/2003-February/032764.html
[3]Closure semantics (Guido van Rossum) http://mail.python.org/pipermail/python-dev/2003-October/039214.html
[4](1, 2) Better Control of Nested Lexical Scopes (Almann T. Goo) http://mail.python.org/pipermail/python-dev/2006-February/061568.html
[5]PEP for Better Control of Nested Lexical Scopes (Jeremy Hylton) http://mail.python.org/pipermail/python-dev/2006-February/061602.html
[6]PEP for Better Control of Nested Lexical Scopes (Almann T. Goo) http://mail.python.org/pipermail/python-dev/2006-February/061603.html
[7]Using and binding relative names (Phillip J. Eby) http://mail.python.org/pipermail/python-dev/2006-February/061636.html
[8]Using and binding relative names (Steven Bethard) http://mail.python.org/pipermail/python-dev/2006-February/061749.html
[9](1, 2) Lexical scoping in Python 3k (Ka-Ping Yee) http://mail.python.org/pipermail/python-dev/2006-July/066862.html
[10]Lexical scoping in Python 3k (Greg Ewing) http://mail.python.org/pipermail/python-dev/2006-July/066889.html
[11]Lexical scoping in Python 3k (Ka-Ping Yee) http://mail.python.org/pipermail/python-dev/2006-July/066942.html
[12]Lexical scoping in Python 3k (Guido van Rossum) http://mail.python.org/pipermail/python-dev/2006-July/066950.html
[13]Explicit Lexical Scoping (pre-PEP?) (Talin) http://mail.python.org/pipermail/python-dev/2006-July/066978.html
[14]Explicit Lexical Scoping (pre-PEP?) (Guido van Rossum) http://mail.python.org/pipermail/python-dev/2006-July/066991.html
[15]Explicit Lexical Scoping (pre-PEP?) (Guido van Rossum) http://mail.python.org/pipermail/python-dev/2006-July/066995.html
[16]Lexical scoping in Python 3k (Guido van Rossum) http://mail.python.org/pipermail/python-dev/2006-July/066968.html
[17]Explicit Lexical Scoping (pre-PEP?) (Guido van Rossum) http://mail.python.org/pipermail/python-dev/2006-July/067004.html
[18](1, 2, 3) Explicit Lexical Scoping (pre-PEP?) (Andrew Clover) http://mail.python.org/pipermail/python-dev/2006-July/067007.html
[19](1, 2) Explicit Lexical Scoping (pre-PEP?) (Guido van Rossum) http://mail.python.org/pipermail/python-dev/2006-July/067067.html
[20]Explicit Lexical Scoping (pre-PEP?) (Matthew Barnes) http://mail.python.org/pipermail/python-dev/2006-July/067221.html
[21]Sky pie: a "var" keyword (a thread started by Neil Toronto) http://mail.python.org/pipermail/python-3000/2006-October/003968.html
[22](1, 2, 3, 4, 5, 6, 7) Alternatives to 'outer' (Talin) http://mail.python.org/pipermail/python-3000/2006-October/004021.html
[23]Alternatives to 'outer' (Jim Jewett) http://mail.python.org/pipermail/python-3000/2006-November/004153.html
[24]Draft PEP for outer scopes (Guido van Rossum) http://mail.python.org/pipermail/python-3000/2006-November/004166.html
[25]Draft PEP for outer scopes (Talin) http://mail.python.org/pipermail/python-3000/2006-November/004190.html
[26]Draft PEP for outer scopes (Nick Coghlan) http://mail.python.org/pipermail/python-3000/2006-November/004237.html
[27]Global variable (version 2006-11-01T01:23:16) http://en.wikipedia.org/wiki/Global_variable
[28]Ruby 2.0 block local variable http://redhanded.hobix.com/inspect/ruby20BlockLocalVariable.html

Acknowledgements

The ideas and proposals mentioned in this PEP are gleaned from countless Python-Dev postings. Thanks to Jim Jewett, Mike Orr, Jason Orendorff, and Christian Tanzer for suggesting specific edits to this PEP.

pep-3105 Make print a function

PEP:3105
Title:Make print a function
Version:$Revision$
Last-Modified:$Date$
Author:Georg Brandl <georg at python.org>
Status:Final
Type:Standards Track
Content-Type:text/x-rst
Created:19-Nov-2006
Python-Version:3.0
Post-History:

Abstract

The title says it all -- this PEP proposes a new print() builtin that replaces the print statement and suggests a specific signature for the new function.

Rationale

The print statement has long appeared on lists of dubious language features that are to be removed in Python 3000, such as Guido's "Python Regrets" presentation [1]. As such, the objective of this PEP is not new, though it might become much disputed among Python developers.

The following arguments for a print() function are distilled from a python-3000 message by Guido himself [2]:

  • print is the only application-level functionality that has a statement dedicated to it. Within Python's world, syntax is generally used as a last resort, when something can't be done without help from the compiler. Print doesn't qualify for such an exception.
  • At some point in application development one quite often feels the need to replace print output by something more sophisticated, like logging calls or calls into some other I/O library. With a print() function, this is a straightforward string replacement, today it is a mess adding all those parentheses and possibly converting >>stream style syntax.
  • Having special syntax for print puts up a much larger barrier for evolution, e.g. a hypothetical new printf() function is not too far fetched when it will coexist with a print() function.
  • There's no easy way to convert print statements into another call if one needs a different separator, not spaces, or none at all. Also, there's no easy way at all to conveniently print objects with some other separator than a space.
  • If print() is a function, it would be much easier to replace it within one module (just def print(*args):...) or even throughout a program (e.g. by putting a different function in __builtin__.print). As it is, one can do this by writing a class with a write() method and assigning that to sys.stdout -- that's not bad, but definitely a much larger conceptual leap, and it works at a different level than print.

Specification

The signature for print(), taken from various mailings and recently posted on the python-3000 list [3] is:

def print(*args, sep=' ', end='\n', file=None)

A call like:

print(a, b, c, file=sys.stderr)

will be equivalent to today's:

print >>sys.stderr, a, b, c

while the optional sep and end arguments specify what is printed between and after the arguments, respectively.

The softspace feature (a semi-secret attribute on files currently used to tell print whether to insert a space before the first item) will be removed. Therefore, there will not be a direct translation for today's:

print "a",
print

which will not print a space between the "a" and the newline.

Backwards Compatibility

The changes proposed in this PEP will render most of today's print statements invalid. Only those which incidentally feature parentheses around all of their arguments will continue to be valid Python syntax in version 3.0, and of those, only the ones printing a single parenthesized value will continue to do the same thing. For example, in 2.x:

>>> print ("Hello")
Hello
>>> print ("Hello", "world")
('Hello', 'world')

whereas in 3.0:

>>> print ("Hello")
Hello
>>> print ("Hello", "world")
Hello world

Luckily, as it is a statement in Python 2, print can be detected and replaced reliably and non-ambiguously by an automated tool, so there should be no major porting problems (provided someone writes the mentioned tool).

Implementation

The proposed changes were implemented in the Python 3000 branch in the Subversion revisions 53685 to 53704. Most of the legacy code in the library has been converted too, but it is an ongoing effort to catch every print statement that may be left in the distribution.

pep-3106 Revamping dict.keys(), .values() and .items()

PEP:3106
Title:Revamping dict.keys(), .values() and .items()
Version:$Revision$
Last-Modified:$Date$
Author:Guido van Rossum
Status:Final
Type:Standards Track
Content-Type:text/x-rst
Created:19-Dec-2006
Post-History:

Abstract

This PEP proposes to change the .keys(), .values() and .items() methods of the built-in dict type to return a set-like or unordered container object whose contents are derived from the underlying dictionary rather than a list which is a copy of the keys, etc.; and to remove the .iterkeys(), .itervalues() and .iteritems() methods.

The approach is inspired by that taken in the Java Collections Framework [1].

Introduction

It has long been the plan to change the .keys(), .values() and .items() methods of the built-in dict type to return a more lightweight object than a list, and to get rid of .iterkeys(), .itervalues() and .iteritems(). The idea is that code that currently (in 2.x) reads:

for k, v in d.iteritems(): ...

should be rewritten as:

for k, v in d.items(): ...

(and similar for .itervalues() and .iterkeys(), except the latter is redundant since we can write that loop as for k in d.)

Code that currently reads:

a = d.keys()    # assume we really want a list here

(etc.) should be rewritten as

a = list(d.keys())

There are (at least) two ways to accomplish this. The original plan was to simply let .keys(), .values() and .items() return an iterator, i.e. exactly what iterkeys(), itervalues() and iteritems() return in Python 2.x. However, the Java Collections Framework [1] suggests that a better solution is possible: the methods return objects with set behavior (for .keys() and .items()) or multiset (== bag) behavior (for .values()) that do not contain copies of the keys, values or items, but rather reference the underlying dict and pull their values out of the dict as needed.

The advantage of this approach is that one can still write code like this:

a = d.items()
for k, v in a: ...
# And later, again:
for k, v in a: ...

Effectively, iter(d.keys()) (etc.) in Python 3.0 will do what d.iterkeys() (etc.) does in Python 2.x; but in most contexts we don't have to write the iter() call because it is implied by a for-loop.

The objects returned by the .keys() and .items() methods behave like sets. The object returned by the values() method behaves like a much simpler unordered collection -- it cannot be a set because duplicate values are possible.

Because of the set behavior, it will be possible to check whether two dicts have the same keys by simply testing:

if a.keys() == b.keys(): ...

and similarly for .items().

These operations are thread-safe only to the extent that using them in a thread-unsafe way may cause an exception but will not cause corruption of the internal representation.

As in Python 2.x, mutating a dict while iterating over it using an iterator has an undefined effect and will in most cases raise a RuntimeError exception. (This is similar to the guarantees made by the Java Collections Framework.)

The objects returned by .keys() and .items() are fully interoperable with instances of the built-in set and frozenset types; for example:

set(d.keys()) == d.keys()

is guaranteed to be True (except when d is being modified simultaneously by another thread).

Specification

I'm using pseudo-code to specify the semantics:

class dict:

    # Omitting all other dict methods for brevity.
    # The .iterkeys(), .itervalues() and .iteritems() methods
    # will be removed.

    def keys(self):
        return d_keys(self)

    def items(self):
        return d_items(self)

    def values(self):
        return d_values(self)

class d_keys:

    def __init__(self, d):
        self.__d = d

    def __len__(self):
        return len(self.__d)

    def __contains__(self, key):
        return key in self.__d

    def __iter__(self):
        for key in self.__d:
            yield key

    # The following operations should be implemented to be
    # compatible with sets; this can be done by exploiting
    # the above primitive operations:
    #
    #   <, <=, ==, !=, >=, > (returning a bool)
    #   &, |, ^, - (returning a new, real set object)
    #
    # as well as their method counterparts (.union(), etc.).
    #
    # To specify the semantics, we can specify x == y as:
    #
    #   set(x) == set(y)   if both x and y are d_keys instances
    #   set(x) == y        if x is a d_keys instance
    #   x == set(y)        if y is a d_keys instance
    #
    # and so on for all other operations.

class d_items:

    def __init__(self, d):
        self.__d = d

    def __len__(self):
        return len(self.__d)

    def __contains__(self, (key, value)):
        return key in self.__d and self.__d[key] == value

    def __iter__(self):
        for key in self.__d:
            yield key, self.__d[key]

    # As well as the set operations mentioned for d_keys above.
    # However the specifications suggested there will not work if
    # the values aren't hashable.  Fortunately, the operations can
    # still be implemented efficiently.  For example, this is how
    # intersection can be specified:

    def __and__(self, other):
        if isinstance(other, (set, frozenset, d_keys)):
            result = set()
            for item in other:
                if item in self:
                    result.add(item)
            return result
        if not isinstance(other, d_items):
            return NotImplemented
        d = {}
        if len(other) < len(self):
            self, other = other, self
        for item in self:
            if item in other:
                key, value = item
                d[key] = value
        return d.items()

    # And here is equality:

    def __eq__(self, other):
        if isinstance(other, (set, frozenset, d_keys)):
            if len(self) != len(other):
                return False
            for item in other:
                if item not in self:
                    return False
            return True
        if not isinstance(other, d_items):
            return NotImplemented
        # XXX We could also just compare the underlying dicts...
        if len(self) != len(other):
            return False
        for item in self:
            if item not in other:
                return False
        return True

    def __ne__(self, other):
        # XXX Perhaps object.__ne__() should be defined this way.
        result = self.__eq__(other)
        if result is not NotImplemented:
            result = not result
        return result

class d_values:

    def __init__(self, d):
        self.__d = d

    def __len__(self):
        return len(self.__d)

    def __contains__(self, value):
        # This is slow, and it's what "x in y" uses as a fallback
        # if __contains__ is not defined; but I'd rather make it
        # explicit that it is supported.
        for v in self:
             if v == value:
                 return True
        return False

    def __iter__(self):
        for key in self.__d:
            yield self.__d[key]

    def __eq__(self, other):
        if not isinstance(other, d_values):
            return NotImplemented
        if len(self) != len(other):
            return False
        # XXX Sometimes this could be optimized, but these are the
        # semantics: we can't depend on the values to be hashable
        # or comparable.
        olist = list(other)
        for x in self:
            try:
                olist.remove(x)
            except ValueError:
                return False
        assert olist == []
        return True

    def __ne__(self, other):
        result = self.__eq__(other)
        if result is not NotImplemented:
            result = not result
        return result

Notes:

The view objects are not directly mutable, but don't implement __hash__(); their value can change if the underlying dict is mutated.

The only requirements on the underlying dict are that it implements __getitem__(), __contains__(), __iter__(), and __len__().

We don't implement .copy() -- the presence of a .copy() method suggests that the copy has the same type as the original, but that's not feasible without copying the underlying dict. If you want a copy of a specific type, like list or set, you can just pass one of the above to the list() or set() constructor.

The specification implies that the order in which items are returned by .keys(), .values() and .items() is the same (just as it was in Python 2.x), because the order is all derived from the dict iterator (which is presumably arbitrary but stable as long as a dict isn't modified). This can be expressed by the following invariant:

list(d.items()) == list(zip(d.keys(), d.values()))

Open Issues

Do we need more of a motivation? I would think that being able to do set operations on keys and items without having to copy them should speak for itself.

I've left out the implementation of various set operations. These could still present small surprises.

It would be okay if multiple calls to d.keys() (etc.) returned the same object, since the object's only state is the dict to which it refers. Is this worth having extra slots in the dict object for? Should that be a weak reference or should the d_keys (etc.) object live forever once created? Strawman: probably not worth the extra slots in every dict.

Should d_keys, d_values and d_items have a public instance variable or method through which one can retrieve the underlying dict? Strawman: yes (but what should it be called?).

I'm soliciting better names than d_keys, d_values and d_items. These classes could be public so that their implementations could be reused by the .keys(), .values() and .items() methods of other mappings. Or should they?

Should the d_keys, d_values and d_items classes be reusable? Strawman: yes.

Should they be subclassable? Strawman: yes (but see below).

A particularly nasty issue is whether operations that are specified in terms of other operations (e.g. .discard()) must really be implemented in terms of those other operations; this may appear irrelevant but it becomes relevant if these classes are ever subclassed. Historically, Python has a really poor track record of specifying the semantics of highly optimized built-in types clearly in such cases; my strawman is to continue that trend. Subclassing may still be useful to add new methods, for example.

I'll leave the decisions (especially about naming) up to whoever submits a working implementation.

pep-3107 Function Annotations

PEP:3107
Title:Function Annotations
Version:$Revision$
Last-Modified:$Date$
Author:Collin Winter <collinwinter at google.com>, Tony Lownds <tony at lownds.com>
Status:Final
Type:Standards Track
Content-Type:text/x-rst
Created:2-Dec-2006
Python-Version:3.0
Post-History:

Abstract

This PEP introduces a syntax for adding arbitrary metadata annotations to Python functions [1].

Rationale

Because Python's 2.x series lacks a standard way of annotating a function's parameters and return values, a variety of tools and libraries have appeared to fill this gap. Some utilise the decorators introduced in "PEP 318", while others parse a function's docstring, looking for annotations there.

This PEP aims to provide a single, standard way of specifying this information, reducing the confusion caused by the wide variation in mechanism and syntax that has existed until this point.

Fundamentals of Function Annotations

Before launching into a discussion of the precise ins and outs of Python 3.0's function annotations, let's first talk broadly about what annotations are and are not:

  1. Function annotations, both for parameters and return values, are completely optional.

  2. Function annotations are nothing more than a way of associating arbitrary Python expressions with various parts of a function at compile-time.

    By itself, Python does not attach any particular meaning or significance to annotations. Left to its own, Python simply makes these expressions available as described in Accessing Function Annotations below.

    The only way that annotations take on meaning is when they are interpreted by third-party libraries. These annotation consumers can do anything they want with a function's annotations. For example, one library might use string-based annotations to provide improved help messages, like so:

    def compile(source: "something compilable",
                filename: "where the compilable thing comes from",
                mode: "is this a single statement or a suite?"):
        ...
    

    Another library might be used to provide typechecking for Python functions and methods. This library could use annotations to indicate the function's expected input and return types, possibly something like:

    def haul(item: Haulable, *vargs: PackAnimal) -> Distance:
        ...
    

    However, neither the strings in the first example nor the type information in the second example have any meaning on their own; meaning comes from third-party libraries alone.

  3. Following from point 2, this PEP makes no attempt to introduce any kind of standard semantics, even for the built-in types. This work will be left to third-party libraries.

Syntax

Parameters

Annotations for parameters take the form of optional expressions that follow the parameter name:

def foo(a: expression, b: expression = 5):
    ...

In pseudo-grammar, parameters now look like identifier [: expression] [= expression]. That is, annotations always precede a parameter's default value and both annotations and default values are optional. Just like how equal signs are used to indicate a default value, colons are used to mark annotations. All annotation expressions are evaluated when the function definition is executed, just like default values.

Annotations for excess parameters (i.e., *args and **kwargs) are indicated similarly:

def foo(*args: expression, **kwargs: expression):
    ...

Annotations for nested parameters always follow the name of the parameter, not the last parenthesis. Annotating all parameters of a nested parameter is not required:

def foo((x1, y1: expression),
        (x2: expression, y2: expression)=(None, None)):
    ...

Return Values

The examples thus far have omitted examples of how to annotate the type of a function's return value. This is done like so:

def sum() -> expression:
    ...

That is, the parameter list can now be followed by a literal -> and a Python expression. Like the annotations for parameters, this expression will be evaluated when the function definition is executed.

The grammar for function definitions [11] is now:

decorator: '@' dotted_name [ '(' [arglist] ')' ] NEWLINE
decorators: decorator+
funcdef: [decorators] 'def' NAME parameters ['->' test] ':' suite
parameters: '(' [typedargslist] ')'
typedargslist: ((tfpdef ['=' test] ',')*
                ('*' [tname] (',' tname ['=' test])* [',' '**' tname]
                 | '**' tname)
                | tfpdef ['=' test] (',' tfpdef ['=' test])* [','])
tname: NAME [':' test]
tfpdef: tname | '(' tfplist ')'
tfplist: tfpdef (',' tfpdef)* [',']

Lambda

lambda's syntax does not support annotations. The syntax of lambda could be changed to support annotations, by requiring parentheses around the parameter list. However it was decided [12] not to make this change because:

  1. It would be an incompatible change.
  2. Lambda's are neutered anyway.
  3. The lambda can always be changed to a function.

Accessing Function Annotations

Once compiled, a function's annotations are available via the function's func_annotations attribute. This attribute is a mutable dictionary, mapping parameter names to an object representing the evaluated annotation expression

There is a special key in the func_annotations mapping, "return". This key is present only if an annotation was supplied for the function's return value.

For example, the following annotation:

def foo(a: 'x', b: 5 + 6, c: list) -> max(2, 9):
    ...

would result in a func_annotation mapping of

{'a': 'x',
 'b': 11,
 'c': list,
 'return': 9}

The return key was chosen because it cannot conflict with the name of a parameter; any attempt to use return as a parameter name would result in a SyntaxError.

func_annotations is an empty, mutable dictionary if there are no annotations on the function or if the functions was created from a lambda expression.

Use Cases

In the course of discussing annotations, a number of use-cases have been raised. Some of these are presented here, grouped by what kind of information they convey. Also included are examples of existing products and packages that could make use of annotations.

  • Providing typing information
    • Type checking ([3], [4])
    • Let IDEs show what types a function expects and returns ([17])
    • Function overloading / generic functions ([22])
    • Foreign-language bridges ([18], [19])
    • Adaptation ([21], [20])
    • Predicate logic functions
    • Database query mapping
    • RPC parameter marshaling ([23])
  • Other information
    • Documentation for parameters and return values ([24])

Standard Library

pydoc and inspect

The pydoc module should display the function annotations when displaying help for a function. The inspect module should change to support annotations.

Relation to Other PEPs

Function Signature Objects [13]

Function Signature Objects should expose the function's annotations. The Parameter object may change or other changes may be warranted.

Implementation

A reference implementation has been checked into the p3yk branch as revision 53170 [10].

Rejected Proposals

  • The BDFL rejected the author's idea for a special syntax for adding annotations to generators as being "too ugly" [2].
  • Though discussed early on ([5], [6]), including special objects in the stdlib for annotating generator functions and higher-order functions was ultimately rejected as being more appropriate for third-party libraries; including them in the standard library raised too many thorny issues.
  • Despite considerable discussion about a standard type parameterisation syntax, it was decided that this should also be left to third-party libraries. ([7], [8], [9]).
  • Despite yet more discussion, it was decided not to standardize a mechanism for annotation interoperability. Standardizing interoperability conventions at this point would be premature. We would rather let these conventions develop organically, based on real-world usage and necessity, than try to force all users into some contrived scheme. ([14], [15], [16]).

References and Footnotes

[1]Unless specifically stated, "function" is generally used as a synonym for "callable" throughout this document.
[2]http://mail.python.org/pipermail/python-3000/2006-May/002103.html
[3]http://oakwinter.com/code/typecheck/
[4]http://maxrepo.info/taxonomy/term/3,6/all
[5]http://mail.python.org/pipermail/python-3000/2006-May/002091.html
[6]http://mail.python.org/pipermail/python-3000/2006-May/001972.html
[7]http://mail.python.org/pipermail/python-3000/2006-May/002105.html
[8]http://mail.python.org/pipermail/python-3000/2006-May/002209.html
[9]http://mail.python.org/pipermail/python-3000/2006-June/002438.html
[10]http://svn.python.org/view?rev=53170&view=rev
[11]http://docs.python.org/reference/compound_stmts.html#function-definitions
[12]http://mail.python.org/pipermail/python-3000/2006-May/001613.html
[13]http://www.python.org/dev/peps/pep-0362/
[14]http://mail.python.org/pipermail/python-3000/2006-August/002895.html
[15]http://mail.python.org/pipermail/python-ideas/2007-January/000032.html
[16]http://mail.python.org/pipermail/python-list/2006-December/420645.html
[17]http://www.python.org/idle/doc/idle2.html#Tips
[18]http://www.jython.org/Project/index.html
[19]http://www.codeplex.com/Wiki/View.aspx?ProjectName=IronPython
[20]http://peak.telecommunity.com/PyProtocols.html
[21]http://www.artima.com/weblogs/viewpost.jsp?thread=155123
[22]http://www-128.ibm.com/developerworks/library/l-cppeak2/
[23]http://rpyc.wikispaces.com/
[24]http://docs.python.org/library/pydoc.html

pep-3108 Standard Library Reorganization

PEP:3108
Title:Standard Library Reorganization
Version:$Revision$
Last-Modified:$Date$
Author:Brett Cannon <brett at python.org>
Status:Final
Type:Standards Track
Content-Type:text/x-rst
Created:01-Jan-2007
Python-Version:3.0
Post-History:28-Apr-2008

Note

The merging of profile/cProfile as of Python 3.3 did not occur, and thus is considered abandoned (although it would be acceptable to do in the future).

Abstract

Just like the language itself, Python's standard library (stdlib) has grown over the years to be very rich. But over time some modules have lost their need to be included with Python. There has also been an introduction of a naming convention for modules since Python's inception that not all modules follow.

Python 3.0 has presents a chance to remove modules that do not have long term usefulness. This chance also allows for the renaming of modules so that they follow the Python style guide [8]. This PEP lists modules that should not be included in Python 3.0 or which need to be renamed.

Modules to Remove

Guido pronounced that "silly old stuff" is to be deleted from the stdlib for Py3K [12]. This is open-ended on purpose. Each module to be removed needs to have a justification as to why it should no longer be distributed with Python. This can range from the module being deprecated in Python 2.x to being for a platform that is no longer widely used.

This section of the PEP lists the various modules to be removed. Each subsection represents a different reason for modules to be removed. Each module must have a specific justification on top of being listed in a specific subsection so as to make sure only modules that truly deserve to be removed are in fact removed.

When a reason mentions how long it has been since a module has been "uniquely edited", it is in reference to how long it has been since a checkin was done specifically for the module and not for a change that applied universally across the entire stdlib. If an edit time is not denoted as "unique" then it is the last time the file was edited, period.

Previously deprecated [done]

PEP 4 lists all modules that have been deprecated in the stdlib [7]. The specified motivations mirror those listed in PEP 4. All modules listed in the PEP at the time of the first alpha release of Python 3.0 will be removed.

The entire contents of lib-old will also be removed. These modules have already been removed from being imported but are kept in the distribution for Python for users that rely upon the code.

  • cfmfile

    • Documented as deprecated since Python 2.4 without an explicit reason.
  • cl

    • Documented as obsolete since Python 2.0 or earlier.
    • Interface to SGI hardware.
  • md5

    • Supplanted by the hashlib module.
  • mimetools

    • Documented as obsolete in a previous version.
    • Supplanted by the email package.
  • MimeWriter

    • Supplanted by the email package.
  • mimify

    • Supplanted by the email package.
  • multifile

    • Supplanted by the email package.
  • posixfile

    • Locking is better done by fcntl.lockf().
  • rfc822

    • Supplanted by the email package.
  • sha

    • Supplanted by the hashlib package.
  • sv

    • Documented as obsolete since Python 2.0 or earlier.
    • Interface to obsolete SGI Indigo hardware.
  • timing

    • Documented as obsolete since Python 2.0 or earlier.
    • time.clock() gives better time resolution.

Platform-specific with minimal use [done]

Python supports many platforms, some of which are not widely used or maintained. And on some of these platforms there are modules that have limited use to people on those platforms. Because of their limited usefulness it would be better to no longer burden the Python development team with their maintenance.

The modules mentioned below are documented. All undocumented modules for the specified platforms will also be removed.

IRIX

The IRIX operating system is no longer produced [19]. Removing all modules from the plat-irix[56] directory has been deemed reasonable because of this fact.

  • AL/al
    • Provides sound support on Indy and Indigo workstations.
    • Both workstations are no longer available.
    • Code has not been uniquely edited in three years.
  • cd/CD
    • CD drive control for SGI systems.
    • SGI no longer sells machines with IRIX on them.
    • Code has not been uniquely edited in 14 years.
  • cddb
    • Undocumented.
  • cdplayer
    • Undocumented.
  • cl/CL/CL_old
    • Compression library for SGI systems.
    • SGI no longer sells machines with IRIX on them.
    • Code has not been uniquely edited in 14 years.
  • DEVICE/GL/gl/cgen/cgensuport
    • GL access, which is the predecessor to OpenGL.
    • Has not been edited in at least eight years.
    • Third-party libraries provide better support (PyOpenGL [16]).
  • ERRNO
    • Undocumented.
  • FILE
    • Undocumented.
  • FL/fl/flp
    • Wrapper for the FORMS library [20]
    • FORMS has not been edited in 12 years.
    • Library is not widely used.
    • First eight hits on Google are for Python docs for fl.
  • fm
    • Wrapper to the IRIS Font Manager library.
    • Only available on SGI machines which no longer come with IRIX.
  • GET
    • Undocumented.
  • GLWS
    • Undocumented.
  • imgfile
    • Wrapper for SGI libimage library for imglib image files (.rgb files).
    • Python Imaging Library provdes read-only support [17].
    • Not uniquely edited in 13 years.
  • IN
    • Undocumented.
  • IOCTL
    • Undocumented.
  • jpeg
    • Wrapper for JPEG (de)compressor.
    • Code not uniquely edited in nine years.
    • Third-party libraries provide better support (Python Imaging Library [17]).
  • panel
    • Undocumented.
  • panelparser
    • Undocumented.
  • readcd
    • Undocumented.
  • SV
    • Undocumented.
  • torgb
    • Undocumented.
  • WAIT
    • Undocumented.

Mac-specific modules

The Mac-specific modules are not well-maintained (e.g., the bgen tool used to auto-generate many of the modules has never been updated to support UCS-4). It is also not Python's place to maintain such a large amount of OS-specific modules. Thus all modules under Lib/plat-mac and Mac are to be removed.

A stub module for proxy access will be provided for use by urllib.

  • _builtinSuites

    • Undocumented.
    • Package under lib-scriptpackages.
  • Audio_mac

    • Undocumented.
  • aepack

    • OSA support is better through third-party modules.

    • Hard-coded endianness which breaks on Intel Macs.

    • Might need to rename if Carbon package dependent.

  • aetools

    • See aepack.
  • aetypes

    • See aepack.
  • applesingle

    • Undocumented.
    • AppleSingle is a binary file format for A/UX.
    • A/UX no longer distributed.
  • appletrawmain

    • Undocumented.
  • appletrunner

    • Undocumented.
  • argvemulator

    • Undocumented.
  • autoGIL

    • Very bad model for using Python with the CFRunLoop.
  • bgenlocations

    • Undocumented.
  • buildtools

    • Documented as deprecated since Python 2.3 without an explicit reason.
  • bundlebuilder

    • Undocumented.
  • Carbon

    • Carbon development has stopped.
    • Does not support 64-bit systems completely.
    • Dependent on bgen which has never been updated to support UCS-4 Unicode builds of Python.
  • CodeWarrior

    • Undocumented.
    • Package under lib-scriptpackages.
  • ColorPicker

    • Better to use Cocoa for GUIs.
  • EasyDialogs

    • Better to use Cocoa for GUIs.
  • Explorer

    • Undocumented.
    • Package under lib-scriptpackages.
  • Finder

    • Undocumented.
    • Package under lib-scriptpackages.
  • findertools

    • No longer useful.
  • FrameWork

    • Poorly documented.
    • Not updated to support Carbon Events.
  • gensuitemodule

    • See aepack.
  • ic

  • icglue

  • icopen

    • Not needed on OS X.
    • Meant to replace 'open' which is usually a bad thing to do.
  • macerrors

    • Undocumented.
  • MacOS

    • Would also mean the removal of binhex.
  • macostools

  • macresource

    • Undocumented.
  • MiniAEFrame

    • See aepack.
  • Nav

    • Undocumented.
  • Netscape

    • Undocumented.
    • Package under lib-scriptpackages.
  • OSATerminology

  • pimp

    • Undocumented.
  • PixMapWrapper

    • Undocumented.
  • StdSuites

    • Undocumented.
    • Package under lib-scriptpackages.
  • SystemEvents

    • Undocumented.
    • Package under lib-scriptpackages.
  • Terminal

    • Undocumented.
    • Package under lib-scriptpackages.
  • terminalcommand

    • Undocumented.
  • videoreader

    • No longer used.
  • W

    • No longer distributed with Python.

Solaris

  • SUNAUDIODEV/sunaudiodev
    • Access to the sound card on Sun machines.
    • Code not uniquely edited in over eight years.

Hardly used [done]

Some platform-independent modules are rarely used. There are a number of possible explanations for this, including, ease of reimplementation, very small audience or lack of adherence to more modern standards.

  • audiodev
    • Undocumented.
    • Not edited in five years.
  • imputil
    • Undocumented.
    • Never updated to support absolute imports.
  • mutex
    • Easy to implement using a semaphore and a queue.
    • Cannot block on a lock attempt.
    • Not uniquely edited since its addition 15 years ago.
    • Only useful with the 'sched' module.
    • Not thread-safe.
  • stringold
    • Function versions of the methods on string objects.
    • Obsolete since Python 1.6.
    • Any functionality not in the string object or module will be moved to the string module (mostly constants).
  • sunaudio
    • Undocumented.
    • Not edited in over seven years.
    • The sunau module provides similar abilities.
  • toaiff
    • Undocumented.
    • Requires sox library to be installed on the system.
  • user
    • Easily handled by allowing the application specify its own module name, check for existence, and import if found.
  • new
    • Just a rebinding of names from the 'types' module.
    • Can also call type built-in to get most types easily.
    • Docstring states the module is no longer useful as of revision 27241 (2002-06-15).
  • pure
    • Written before Pure Atria was bought by Rational which was then bought by IBM (in other words, very old).
  • test.testall
    • From the days before regrtest.

Obsolete

Becoming obsolete signifies that either another module in the stdlib or a widely distributed third-party library provides a better solution for what the module is meant for.

  • Bastion/rexec [done]

    • Restricted execution / security.
    • Turned off in Python 2.3.
    • Modules deemed unsafe.
  • bsddb185 [done]

    • Superceded by bsddb3
    • Not built by default.
    • Documentation specifies that the "module should never be used directly in new code".
    • Available externally from PyPI [27].
  • Canvas [done]

  • commands [done]

    • subprocess module replaces it [9].
    • Remove getstatus(), move rest to subprocess.
  • compiler [done]

    • Having to maintain both the built-in compiler and the stdlib package is redundant [24].
    • The AST created by the compiler is available [23].
    • Mechanism to compile from an AST needs to be added.
  • dircache [done]

    • Negligible use.
    • Easily replicated.
  • dl [done]

    • ctypes provides better support for same functionality.
  • fpformat [done]

    • All functionality is supported by string interpolation.
  • htmllib [done]

    • Superseded by HTMLParser.
  • ihooks [done]

    • Undocumented.
    • For use with rexec which has been turned off since Python 2.3.
  • imageop [done]

    • Better support by third-party libraries (Python Imaging Library [17]).

    • Unit tests relied on rgbimg and imgfile.
      • rgbimg was removed in Python 2.6.
      • imgfile slated for removal in this PEP.
  • linuxaudiodev [done]

    • Replaced by ossaudiodev.
  • mhlib [done]

    • Should be removed as an individual module; use mailbox instead.
  • popen2 [done]

    • subprocess module replaces it [9].
  • sgmllib [done]

    • Does not fully parse SGML.
    • In the stdlib for support to htmllib which is slated for removal.
  • sre [done]

    • Previously deprecated; import re instead.
  • stat [TODO need to move all uses over to os.stat()]

    • os.stat() now returns a tuple with attributes.
    • Functions in the module should be made into methods for the object returned by os.stat.
  • statvfs [done]

    • os.statvfs now returns a tuple with attributes.
  • thread [done]

    • People should use 'threading' instead.
      • Rename 'thread' to _thread.
      • Deprecate dummy_thread and rename _dummy_thread.
      • Move thread.get_ident over to threading.
    • Guido has previously supported the deprecation [13].
  • urllib [done]

    • Superceded by urllib2.
    • Functionality unique to urllib will be kept in the urllib package.
  • UserDict [done: 3.0] [TODO handle 2.6]

    • Not as useful since types can be a superclass.
    • Useful bits moved to the 'collections' module.
  • UserList/UserString [done]

    • Not useful since types can be a superclass.
    • Moved to the 'collections' module.

Maintenance Burden

Over the years, certain modules have become a heavy burden upon python-dev to maintain. In situations like this, it is better for the module to be given to the community to maintain to free python-dev to focus more on language support and other modules in the standard library that do not take up a undue amount of time and effort.

  • bsddb3
    • Externally maintained at http://www.jcea.es/programacion/pybsddb.htm .
    • Consistent testing instability.
    • Berkeley DB follows a different release schedule than Python, leading to the bindings not necessarily being in sync with what is available.

Modules to Rename

Many modules existed in the stdlib before PEP 8 came into existence [8]. This has led to some naming inconsistencies and namespace bloat that should be addressed.

PEP 8 violations [done]

PEP 8 specifies that modules "should have short, all-lowercase names" where "underscores can be used ... if it improves readability" [8]. The use of underscores is discouraged in package names. The following modules violate PEP 8 and are not somehow being renamed by being moved to a package.

Current Name Replacement Name
_winreg winreg
ConfigParser configparser
copy_reg copyreg
Queue queue
SocketServer socketserver

Merging C and Python implementations of the same interface

Several interfaces have both a Python and C implementation. While it is great to have a C implementation for speed with a Python implementation as fallback, there is no need to expose the two implementations independently in the stdlib. For Python 3.0 all interfaces with two implementations will be merged into a single public interface.

The C module is to be given a leading underscore to delineate the fact that it is not the reference implementation (the Python implementation is). This means that any semantic difference between the C and Python versions must be dealt with before Python 3.0 or else the C implementation will be removed until it can be fixed.

One interface that is not listed below is xml.etree.ElementTree. This is an externally maintained module and thus is not under the direct control of the Python development team for renaming. See Open Issues for a discussion on this.

  • pickle/cPickle [done]
    • Rename cPickle to _pickle.
    • Semantic completeness of C implementation not verified.
  • profile/cProfile [TODO]
    • Rename cProfile to _profile.
    • Semantic completeness of C implementation not verified.
  • StringIO/cStringIO [done]
    • Add the class to the 'io' module.

No public, documented interface [done]

There are several modules in the stdlib that have no defined public interface. These modules exist as support code for other modules that are exposed. Because they are not meant to be used directly they should be renamed to reflect this fact.

Current Name Replacement Name
markupbase _markupbase

Poorly chosen names [done]

A few modules have names that were poorly chosen in hindsight. They should be renamed so as to prevent their bad name from perpetuating beyond the 2.x series.

Current Name Replacement Name
repr reprlib
test.test_support test.support

Grouping of modules [done]

As the stdlib has grown, several areas within it have expanded to include multiple modules (e.g., support for database files). It thus makes sense to group related modules into packages.

dbm package

Current Name Replacement Name
anydbm dbm.__init__ [1]
dbhash dbm.bsd
dbm dbm.ndbm
dumbdm dbm.dumb
gdbm dbm.gnu
whichdb dbm.__init__ [1]
[1](1, 2) dbm.__init__ can combine anybdbm and whichdb since the public API for both modules has no name conflict and the two modules have closely related usage.

html package

Current Name Replacement Name
HTMLParser html.parser
htmlentitydefs html.entities

http package

Current Name Replacement Name
httplib http.client
BaseHTTPServer http.server [2]
CGIHTTPServer http.server [2]
SimpleHTTPServer http.server [2]
Cookie http.cookies
cookielib http.cookiejar
[2](1, 2, 3) The http.server module can combine the specified modules safely as they have no naming conflicts.

tkinter package

Current Name Replacement Name
Dialog tkinter.dialog
FileDialog tkinter.filedialog [4]
FixTk tkinter._fix
ScrolledText tkinter.scrolledtext
SimpleDialog tkinter.simpledialog [5]
Tix tkinter.tix
Tkconstants tkinter.constants
Tkdnd tkinter.dnd
Tkinter tkinter.__init__
tkColorChooser tkinter.colorchooser
tkCommonDialog tkinter.commondialog
tkFileDialog tkinter.filedialog [4]
tkFont tkinter.font
tkMessageBox tkinter.messagebox
tkSimpleDialog tkinter.simpledialog [5]
turtle tkinter.turtle
[4](1, 2) tkinter.filedialog can safely combine FileDialog and tkFileDialog as there are no naming conflicts.
[5](1, 2) tkinter.simpledialog can safely combine SimpleDialog and tkSimpleDialog have no naming conflicts.

urllib package

Originally this new package was to be named url, but because of the common use of the name as a variable, it has been deemed better to keep the name urllib and instead shift existing modules around into a new package.

Current Name Replacement Name
urllib2 urllib.request, urllib.error
urlparse urllib.parse
urllib urllib.parse, urllib.request, urllib.error [6]
robotparser urllib.robotparser
[6]The quoting-related functions from urllib will be added to urllib.parse. urllib.URLOpener and urllib.FancyUrlOpener will be added to urllib.request as long as the documentation for both modules is updated.

xmlrpc package

Current Name Replacement Name
xmlrpclib xmlrpc.client
DocXMLRPCServer xmlrpc.server [3]
SimpleXMLRPCServer xmlrpc.server [3]
[3](1, 2) The modules being combined into xmlrpc.server have no naming conflicts and thus can safely be merged.

Transition Plan

Issues

Issues related to this PEP:

For modules to be removed

For module removals, it is easiest to remove the module first in Python 3.0 to see where dependencies exist. This makes finding code that (possibly) requires the suppression of the DeprecationWarning easier.

In Python 3.0

  1. Remove the module.
  2. Remove related tests.
  3. Remove all documentation (typically the module's documentation file and its entry in a file for the Library Reference).
  4. Edit Modules/Setup.dist and setup.py if needed.
  5. Run the regression test suite (using -uall); watch out for tests that are skipped because an import failed for the removed module.
  6. Check in the change (with an appropriate Misc/NEWS entry).
  7. Update this PEP noting that the 3.0 step is done.

In Python 2.6

  1. Add the following code to the deprecated module if it is implemented in Python as the first piece of executed code (adjusting the module name and the warnings import and needed):

    from warnings import warnpy3k
    warnpy3k("the XXX module has been removed in Python 3.0",
             stacklevel=2)
    del warnpy3k
    

    or the following if it is an extension module:

    if (PyErr_WarnPy3k("the XXX module has been removed in "
                          "Python 3.0", 2) < 0)
           return;
    

    (the Python-Dev TextMate bundle, available from Misc/TextMate, contains a command that will generate all of this for you).

  2. Update the documentation. For modules with their own documentation file, use the :deprecated: option with the module directive along with the deprecated directive, stating the deprecation is occurring in 2.6, but is for the module's removal in 3.0.:

    .. deprecated:: 2.6
       The :mod:`XXX` module has been removed in Python 3.0.
    

    For modules simply listed in a file (e.g., undoc.rst), use the warning directive.

  3. Add the module to the module deletion test in test_py3kwarn.

  4. Suppress the warning in the module's test code using

    test.test_support.import_module(name, deprecated=True).

  5. Check in the change w/ appropriate Misc/NEWS entry (block this checkin in py3k!).

  6. Update this PEP noting that the 2.6 step is done.

Renaming of modules

Support in the 2to3 refactoring tool for renames will be used to help people transition to new module names [15]. Import statements will be rewritten so that only the import statement and none of the rest of the code needs to be touched. This will be accomplished by using the as keyword in import statements to bind in the module namespace to the old name while importing based on the new name (when the keyword is not already used, otherwise the re-assigned name should be left alone and only the module that is imported needs to be changed). The fix_imports fixer is an example of how to approach this.

Python 3.0

  1. Update 2to3 in the sandbox to support the rename.
  2. Use svn move to rename the module.
  3. Update all import statements in the stdlib to use the new name (use 2to3's fix_imports fixer for the easiest solution).
  4. Rename the module in its own documentation.
  5. Update all references in the documentation from the old name to the new name.
  6. Run regrtest.py -uall to verify the rename worked.
  7. Add an entry in Misc/NEWS.
  8. Commit the changes.

Python 2.6

  1. In the module's documentation, add a note mentioning that the module is renamed in Python 3.0:

    .. note::
       The :mod:`OLDNAME` module has been renamed to :mod:`NEWNAME` in
       Python 3.0.
    
  2. Commit the documentation change.

  3. Block the revision in py3k.

Open Issues

Renaming of modules maintained outside of the stdlib

xml.etree.ElementTree not only does not meet PEP 8 naming standards but it also has an exposed C implementation [8]. It is an externally maintained package, though [10]. A request will be made for the maintainer to change the name so that it matches PEP 8 and hides the C implementation.

Rejected Ideas

Modules that were originally suggested for removal

  • asynchat/asyncore

    • Josiah Carlson has said he will maintain the modules.
  • audioop/sunau/aifc

    • Audio modules where the formats are still used.
  • base64/quopri/uu

    • All still widely used.
    • 'codecs' module does not provide as nice of an API for basic usage.
  • fileinput

    • Useful when having to work with stdin.
  • linecache

    • Used internally in several places.
  • nis

    • Testimonials from people that new installations of NIS are still occurring
  • getopt

    • Simpler than optparse.
  • repr

    • Useful as a basis for overriding.
    • Used internally.
  • sched

    • Useful for simulations.
  • symtable/_symtable

    • Docs were written.
  • telnetlib

    • Really handy for quick-and-dirty remote access.
    • Some hardware supports using telnet for configuration and querying.
  • Tkinter

    • Would prevent IDLE from existing.
    • No GUI toolkit would be available out of the box.

Introducing a new top-level package

It has been suggested that the entire stdlib be placed within its own package. This PEP will not address this issue as it has its own design issues (naming, does it deserve special consideration in import semantics, etc.). Everything within this PEP can easily be handled if a new top-level package is introduced.

References

[7]PEP 4: Deprecation of Standard Modules (http://www.python.org/dev/peps/pep-0004/)
[8](1, 2, 3, 4) PEP 8: Style Guide for Python Code (http://www.python.org/dev/peps/pep-0008/)
[9](1, 2) PEP 324: subprocess -- New process module (http://www.python.org/dev/peps/pep-0324/)
[10]PEP 360: Externally Maintained Packages (http://www.python.org/dev/peps/pep-0360/)
[11]Python Documentation: Global Module Index (http://docs.python.org/modindex.html)
[12]Python-Dev email: "Py3k release schedule worries" (http://mail.python.org/pipermail/python-3000/2006-December/005130.html)
[13]Python-Dev email: Autoloading? (http://mail.python.org/pipermail/python-dev/2005-October/057244.html)
[14]Python-Dev Summary: 2004-11-01 (http://www.python.org/dev/summary/2004-11-01_2004-11-15/#id10)
[15]2to3 refactoring tool (http://svn.python.org/view/sandbox/trunk/2to3/)
[16]PyOpenGL (http://pyopengl.sourceforge.net/)
[17](1, 2, 3) Python Imaging Library (PIL) (http://www.pythonware.com/products/pil/)
[18]Twisted (http://twistedmatrix.com/trac/)
[19]SGI Press Release: End of General Availability for MIPS IRIX Products -- December 2006 (http://www.sgi.com/support/mips_irix.html)
[20]FORMS Library by Mark Overmars (ftp://ftp.cs.ruu.nl/pub/SGI/FORMS)
[21]Wikipedia: Au file format (http://en.wikipedia.org/wiki/Au_file_format)
[22]appscript (http://appscript.sourceforge.net/)
[23]_ast module (http://docs.python.org/library/ast.html)
[24]python-dev email: getting compiler package failures (http://mail.python.org/pipermail/python-3000/2007-May/007615.html)
[25]http://bugs.python.org/issue2775
[26]http://bugs.python.org/issue2828
[27]http://pypi.python.org/

pep-3109 Raising Exceptions in Python 3000

PEP:3109
Title:Raising Exceptions in Python 3000
Version:$Revision$
Last-Modified:$Date$
Author:Collin Winter <collinwinter at google.com>
Status:Final
Type:Standards Track
Content-Type:text/x-rst
Created:19-Jan-2006
Python-Version:3.0
Post-History:

Abstract

This PEP introduces changes to Python's mechanisms for raising exceptions intended to reduce both line noise and the size of the language.

Rationale

One of Python's guiding maxims is "there should be one -- and preferably only one -- obvious way to do it" [1]. Python 2.x's raise statement violates this principle, permitting multiple ways of expressing the same thought. For example, these statements are equivalent:

raise E, V

raise E(V)

There is a third form of the raise statement, allowing arbitrary tracebacks to be attached to an exception [2]:

raise E, V, T

where T is a traceback. As specified in PEP 344 [4], exception objects in Python 3.x will possess a __traceback__ attribute, admitting this translation of the three-expression raise statement:

raise E, V, T

is translated to

e = E(V)
e.__traceback__ = T
raise e

Using these translations, we can reduce the raise statement from four forms to two:

  1. raise (with no arguments) is used to re-raise the active exception in an except suite.

  2. raise EXCEPTION is used to raise a new exception. This form has two sub-variants: EXCEPTION may be an exception class or an instance of an exception class; valid exception classes are BaseException and its subclasses [5]. If EXCEPTION is a subclass, it will be called with no arguments to obtain an exception instance.

    To raise anything else is an error.

There is a further, more tangible benefit to be obtained through this consolidation, as noted by A.M. Kuchling [6].

PEP 8 doesn't express any preference between the
two forms of raise statements:
raise ValueError, 'blah'
raise ValueError("blah")

I like the second form better, because if the exception arguments
are long or include string formatting, you don't need to use line
continuation characters because of the containing parens.

The BDFL has concurred [7] and endorsed the consolidation of the several raise forms.

Grammar Changes

In Python 3, the grammar for raise statements will change from [2]

raise_stmt: 'raise' [test [',' test [',' test]]]

to

raise_stmt: 'raise' [test]

Changes to Builtin Types

Because of its relation to exception raising, the signature for the throw() method on generator objects will change, dropping the optional second and third parameters. The signature thus changes from [3]

generator.throw(E, [V, [T]])

to

generator.throw(EXCEPTION)

Where EXCEPTION is either a subclass of BaseException or an instance of a subclass of BaseException.

Semantic Changes

In Python 2, the following raise statement is legal

raise ((E1, (E2, E3)), E4), V

The interpreter will take the tuple's first element as the exception type (recursively), making the above fully equivalent to

raise E1, V

As of Python 3.0, support for raising tuples like this will be dropped. This change will bring raise statements into line with the throw() method on generator objects, which already disallows this.

Compatibility Issues

All two- and three-expression raise statements will require modification, as will all two- and three-expression throw() calls on generators. Fortunately, the translation from Python 2.x to Python 3.x in this case is simple and can be handled mechanically by Guido van Rossum's 2to3 utility [8] using the raise and throw fixers ([9], [10]).

The following translations will be performed:

  1. Zero- and one-expression raise statements will be left intact.

  2. Two-expression raise statements will be converted from

    raise E, V
    

    to

    raise E(V)
    

    Two-expression throw() calls will be converted from

    generator.throw(E, V)
    

    to

    generator.throw(E(V))
    

    See point #5 for a caveat to this transformation.

  3. Three-expression raise statements will be converted from

    raise E, V, T
    

    to

    e = E(V)
    e.__traceback__ = T
    raise e
    

    Three-expression throw() calls will be converted from

    generator.throw(E, V, T)
    

    to

    e = E(V)
    e.__traceback__ = T
    generator.throw(e)
    

    See point #5 for a caveat to this transformation.

  4. Two- and three-expression raise statements where E is a tuple literal can be converted automatically using 2to3's raise fixer. raise statements where E is a non-literal tuple, e.g., the result of a function call, will need to be converted manually.

  5. Two- and three-expression raise statements where E is an exception class and V is an exception instance will need special attention. These cases break down into two camps:

    1. raise E, V as a long-hand version of the zero-argument raise statement. As an example, assuming F is a subclass of E

      try:
          something()
      except F as V:
          raise F(V)
      except E as V:
          handle(V)
      

      This would be better expressed as

      try:
          something()
      except F:
          raise
      except E as V:
          handle(V)
      
    2. raise E, V as a way of "casting" an exception to another class. Taking an example from distutils.compiler.unixcompiler

      try:
          self.spawn(pp_args)
      except DistutilsExecError as msg:
          raise CompileError(msg)
      

      This would be better expressed as

      try:
          self.spawn(pp_args)
      except DistutilsExecError as msg:
          raise CompileError from msg
      

      Using the raise ... from ... syntax introduced in PEP 344.

Implementation

This PEP was implemented in revision 57783 [11].

pep-3110 Catching Exceptions in Python 3000

PEP:3110
Title:Catching Exceptions in Python 3000
Version:$Revision$
Last-Modified:$Date$
Author:Collin Winter <collinwinter at google.com>
Status:Final
Type:Standards Track
Content-Type:text/x-rst
Created:16-Jan-2006
Python-Version:3.0
Post-History:

Abstract

This PEP introduces changes intended to help eliminate ambiguities in Python's grammar, simplify exception classes, simplify garbage collection for exceptions and reduce the size of the language in Python 3.0.

Rationale

  1. except clauses in Python 2.x present a syntactic ambiguity where the parser cannot differentiate whether

    except <expression>, <expression>:
    

    should be interpreted as

    except <type>, <type>:
    

    or

    except <type>, <name>:
    

    Python 2 opts for the latter semantic, at the cost of requiring the former to be parenthesized, like so

    except (<type>, <type>):
    
  2. As specified in PEP 352 [1], the ability to treat exceptions as tuples will be removed, meaning this code will no longer work

    except os.error, (errno, errstr):
    

    Because the automatic unpacking will no longer be possible, it is desirable to remove the ability to use tuples as except targets.

  3. As specified in PEP 344 [5], exception instances in Python 3 will possess a __traceback__ attribute. The Open Issues section of that PEP includes a paragraph on garbage collection difficulties caused by this attribute, namely a "exception -> traceback -> stack frame -> exception" reference cycle, whereby all locals are kept in scope until the next GC run. This PEP intends to resolve this issue by adding a cleanup semantic to except clauses in Python 3 whereby the target name is deleted at the end of the except suite.

  4. In the spirit of "there should be one -- and preferably only one -- obvious way to do it" [2], it is desirable to consolidate duplicate functionality. To this end, the exc_value, exc_type and exc_traceback attributes of the sys module [3] will be removed in favor of sys.exc_info(), which provides the same information. These attributes are already listed in PEP 3100 [4] as targeted for removal.

Grammar Changes

In Python 3, the grammar for except statements will change from [8]

except_clause: 'except' [test [',' test]]

to

except_clause: 'except' [test ['as' NAME]]

The use of as in place of the comma token means that

except (AttributeError, os.error):

can be clearly understood as a tuple of exception classes. This new syntax was first proposed by Greg Ewing [6] and endorsed ([6], [7]) by the BDFL.

Further, the restriction of the token following as from test to NAME means that only valid identifiers can be used as except targets.

Note that the grammar above always requires parenthesized tuples as exception clases. That way, the ambiguous

except A, B:

which would mean different things in Python 2.x and 3.x -- leading to hard-to-catch bugs -- cannot legally occur in 3.x code.

Semantic Changes

In order to resolve the garbage collection issue related to PEP 344, except statements in Python 3 will generate additional bytecode to delete the target, thus eliminating the reference cycle. The source-to-source translation, as suggested by Phillip J. Eby [9], is

try:
    try_body
except E as N:
    except_body
...

gets translated to (in Python 2.5 terms)

try:
    try_body
except E, N:
    try:
        except_body
    finally:
        N = None
        del N
...

An implementation has already been checked into the p3yk branch [10].

Compatibility Issues

Nearly all except clauses will need to be changed. except clauses with identifier targets will be converted from

except E, N:

to

except E as N:

except clauses with non-tuple, non-identifier targets (e.g., a.b.c[d]) will need to be converted from

except E, T:

to

except E as t:
    T = t

Both of these cases can be handled by Guido van Rossum's 2to3 utility [11] using the except fixer [12].

except clauses with tuple targets will need to be converted manually, on a case-by-case basis. These changes will usually need to be accompanied by changes to the exception classes themselves. While these changes generally cannot be automated, the 2to3 utility is able to point out cases where the target of an except clause is a tuple, simplifying conversion.

Situations where it is necessary to keep an exception instance around past the end of the except suite can be easily translated like so

try:
    ...
except E as N:
    ...
...

becomes

try:
    ...
except E as N:
    n = N
    ...
...

This way, when N is deleted at the end of the block, n will persist and can be used as normal.

Lastly, all uses of the sys module's exc_type, exc_value and exc_traceback attributes will need to be removed. They can be replaced with sys.exc_info()[0], sys.exc_info()[1] and sys.exc_info()[2] respectively, a transformation that can be performed by 2to3's sysexcattrs fixer.

2.6 - 3.0 Compatibility

In order to facilitate forwards compatibility between Python 2.6 and 3.0, the except ... as ...: syntax will be backported to the 2.x series. The grammar will thus change from:

except_clause: 'except' [test [',' test]]

to:

except_clause: 'except' [test [('as' | ',') test]]

The end-of-suite cleanup semantic for except statements will not be included in the 2.x series of releases.

Open Issues

Replacing or Dropping "sys.exc_info()"

The idea of dropping sys.exc_info() or replacing it with a sys.exception attribute or a sys.get_exception() function has been raised several times on python-3000 ([13], [14]) and mentioned in PEP 344's "Open Issues" section.

While a 2to3 fixer to replace calls to sys.exc_info() and some attribute accesses would be trivial, it would be far more difficult for static analysis to find and fix functions that expect the values from sys.exc_info() as arguments. Similarly, this does not address the need to rewrite the documentation for all APIs that are defined in terms of sys.exc_info().

Implementation

This PEP was implemented in revisions 53342 [15] and 53349 [16]. Support for the new except syntax in 2.6 was implemented in revision 55446 [17].

pep-3111 Simple input built-in in Python 3000

PEP:3111
Title:Simple input built-in in Python 3000
Version:$Revision$
Last-Modified:$Date$
Author:Andre Roberge <andre.roberge at gmail.com >
Status:Final
Type:Standards Track
Content-Type:text/x-rst
Created:13-Sep-2006
Python-Version:3.0
Post-History:22-Dec-2006

Abstract

Input and output are core features of computer programs. Currently, Python provides a simple means of output through the print keyword and two simple means of interactive input through the input() and raw_input() built-in functions.

Python 3.0 will introduce various incompatible changes with previous Python versions[1]. Among the proposed changes, print will become a built-in function, print(), while input() and raw_input() would be removed completely from the built-in namespace, requiring importing some module to provide even the most basic input capability.

This PEP proposes that Python 3.0 retains some simple interactive user input capability, equivalent to raw_input(), within the built-in namespace.

It was accepted by the BDFL in December 2006 [5].

Motivation

With its easy readability and its support for many programming styles (e.g. procedural, object-oriented, etc.) among others, Python is perhaps the best computer language to use in introductory programming classes. Simple programs often need to provide information to the user (output) and to obtain information from the user (interactive input). Any computer language intended to be used in an educational setting should provide straightforward methods for both output and interactive input.

The current proposals for Python 3.0 [1] include a simple output pathway via a built-in function named print(), but a more complicated method for input [e.g. via sys.stdin.readline()], one that requires importing an external module. Current versions of Python (pre-3.0) include raw_input() as a built-in function. With the availability of such a function, programs that require simple input/output can be written from day one, without requiring discussions of importing modules, streams, etc.

Rationale

Current built-in functions, like input() and raw_input(), are found to be extremely useful in traditional teaching settings. (For more details, see [2] and the discussion that followed.) While the BDFL has clearly stated [3] that input() was not to be kept in Python 3000, he has also stated that he was not against revising the decision of killing raw_input().

raw_input() provides a simple mean to ask a question and obtain a response from a user. The proposed plans for Python 3.0 would require the replacement of the single statement:

name = raw_input("What is your name?")

by the more complicated:

import sys
print("What is your name?")
same = sys.stdin.readline()

However, from the point of view of many Python beginners and educators, the use of sys.stdin.readline() presents the following problems:

1. Compared to the name "raw_input", the name "sys.stdin.readline()" is clunky and inelegant.

2. The names "sys" and "stdin" have no meaning for most beginners, who are mainly interested in what the function does, and not where in the package structure it is located. The lack of meaning also makes it difficult to remember: is it "sys.stdin.readline()", or " stdin.sys.readline()"? To a programming novice, there is not any obvious reason to prefer one over the other. In contrast, functions simple and direct names like print, input, and raw_input, and open are easier to remember.

3. The use of "." notation is unmotivated and confusing to many beginners. For example, it may lead some beginners to think "." is a standard character that could be used in any identifier.

4. There is an asymmetry with the print function: why is print not called sys.stdout.print()?

Specification

The existing raw_input() function will be renamed to input().

The Python 2 to 3 conversion tool will replace calls to input() with eval(input()) and raw_input() with input().

Naming Discussion

With input() effectively removed from the language, the name raw_input() makes much less sense and alternatives should be considered. The various possibilities mentioned in various forums include:

ask()
ask_user()
get_string()
input()  # initially rejected by BDFL, later accepted
prompt()
read()
user_input()
get_response()

While it was initially rejected by the BDFL, it has been suggested that the most direct solution would be to rename "raw_input" to "input" in Python 3000. The main objection is that Python 2.x already has a function named "input", and, even though it is not going to be included in Python 3000, having a built-in function with the same name but different semantics may confuse programmers migrating from 2.x to 3000. Certainly, this is no problem for beginners, and the scope of the problem is unclear for more experienced programmers, since raw_input(), while popular with many, is not in universal use. In this instance, the good it does for beginners could be seen to outweigh the harm it does to experienced programmers - although it could cause confusion for people reading older books or tutorials.

The rationale for accepting the renaming can be found here [4].

pep-3112 Bytes literals in Python 3000

PEP:3112
Title:Bytes literals in Python 3000
Version:$Revision$
Last-Modified:$Date$
Author:Jason Orendorff <jason.orendorff at gmail.com>
Status:Final
Type:Standards Track
Content-Type:text/x-rst
Requires:358
Created:23-Feb-2007
Python-Version:3.0
Post-History:23-Feb-2007

Abstract

This PEP proposes a literal syntax for the bytes objects introduced in PEP 358. The purpose is to provide a convenient way to spell ASCII strings and arbitrary binary data.

Motivation

Existing spellings of an ASCII string in Python 3000 include:

bytes('Hello world', 'ascii')
'Hello world'.encode('ascii')

The proposed syntax is:

b'Hello world'

Existing spellings of an 8-bit binary sequence in Python 3000 include:

bytes([0x7f, 0x45, 0x4c, 0x46, 0x01, 0x01, 0x01, 0x00])
bytes('\x7fELF\x01\x01\x01\0', 'latin-1')
'7f454c4601010100'.decode('hex')

The proposed syntax is:

b'\x7f\x45\x4c\x46\x01\x01\x01\x00'
b'\x7fELF\x01\x01\x01\0'

In both cases, the advantages of the new syntax are brevity, some small efficiency gain, and the detection of encoding errors at compile time rather than at runtime. The brevity benefit is especially felt when using the string-like methods of bytes objects:

lines = bdata.split(bytes('\n', 'ascii'))  # existing syntax
lines = bdata.split(b'\n')  # proposed syntax

And when converting code from Python 2.x to Python 3000:

sok.send('EXIT\r\n')  # Python 2.x
sok.send('EXIT\r\n'.encode('ascii'))  # Python 3000 existing
sok.send(b'EXIT\r\n')  # proposed

Grammar Changes

The proposed syntax is an extension of the existing string syntax [1].

The new syntax for strings, including the new bytes literal, is:

stringliteral: [stringprefix] (shortstring | longstring)
stringprefix: "b" | "r" | "br" | "B" | "R" | "BR" | "Br" | "bR"
shortstring: "'" shortstringitem* "'" | '"' shortstringitem* '"'
longstring: "'''" longstringitem* "'''" | '"""' longstringitem* '"""'
shortstringitem: shortstringchar | escapeseq
longstringitem: longstringchar | escapeseq
shortstringchar:
  <any source character except "\" or newline or the quote>
longstringchar: <any source character except "\">
escapeseq: "\" NL
  | "\\" | "\'" | '\"'
  | "\a" | "\b" | "\f" | "\n" | "\r" | "\t" | "\v"
  | "\ooo" | "\xhh"
  | "\uxxxx" | "\Uxxxxxxxx" | "\N{name}"

The following additional restrictions apply only to bytes literals (stringliteral tokens with b or B in the stringprefix):

  • Each shortstringchar or longstringchar must be a character between 1 and 127 inclusive, regardless of any encoding declaration [2] in the source file.
  • The Unicode-specific escape sequences \uxxxx, \Uxxxxxxxx, and \N{name} are unrecognized in Python 2.x and forbidden in Python 3000.

Adjacent bytes literals are subject to the same concatenation rules as adjacent string literals [3]. A bytes literal adjacent to a string literal is an error.

Semantics

Each evaluation of a bytes literal produces a new bytes object. The bytes in the new object are the bytes represented by the shortstringitem or longstringitem parts of the literal, in the same order.

Rationale

The proposed syntax provides a cleaner migration path from Python 2.x to Python 3000 for most code involving 8-bit strings. Preserving the old 8-bit meaning of a string literal is usually as simple as adding a b prefix. The one exception is Python 2.x strings containing bytes >127, which must be rewritten using escape sequences. Transcoding a source file from one encoding to another, and fixing up the encoding declaration, should preserve the meaning of the program. Python 2.x non-Unicode strings violate this principle; Python 3000 bytes literals shouldn't.

A string literal with a b in the prefix is always a syntax error in Python 2.5, so this syntax can be introduced in Python 2.6, along with the bytes type.

A bytes literal produces a new object each time it is evaluated, like list displays and unlike string literals. This is necessary because bytes literals, like lists and unlike strings, are mutable [4].

Reference Implementation

Thomas Wouters has checked an implementation into the Py3K branch, r53872.

pep-3113 Removal of Tuple Parameter Unpacking

PEP:3113
Title:Removal of Tuple Parameter Unpacking
Version:$Revision$
Last-Modified:$Date$
Author:Brett Cannon <brett at python.org>
Status:Final
Type:Standards Track
Content-Type:text/x-rst
Created:02-Mar-2007
Python-Version:3.0
Post-History:

Abstract

Tuple parameter unpacking is the use of a tuple as a parameter in a function signature so as to have a sequence argument automatically unpacked. An example is:

def fxn(a, (b, c), d):
    pass

The use of (b, c) in the signature requires that the second argument to the function be a sequence of length two (e.g., [42, -13]). When such a sequence is passed it is unpacked and has its values assigned to the parameters, just as if the statement b, c = [42, -13] had been executed in the parameter.

Unfortunately this feature of Python's rich function signature abilities, while handy in some situations, causes more issues than they are worth. Thus this PEP proposes their removal from the language in Python 3.0.

Why They Should Go

Introspection Issues

Python has very powerful introspection capabilities. These extend to function signatures. There are no hidden details as to what a function's call signature is. In general it is fairly easy to figure out various details about a function's signature by viewing the function object and various attributes on it (including the function's func_code attribute).

But there is great difficulty when it comes to tuple parameters. The existence of a tuple parameter is denoted by its name being made of a . and a number in the co_varnames attribute of the function's code object. This allows the tuple argument to be bound to a name that only the bytecode is aware of and cannot be typed in Python source. But this does not specify the format of the tuple: its length, whether there are nested tuples, etc.

In order to get all of the details about the tuple from the function one must analyse the bytecode of the function. This is because the first bytecode in the function literally translates into the tuple argument being unpacked. Assuming the tuple parameter is named .1 and is expected to unpack to variables spam and monty (meaning it is the tuple (spam, monty)), the first bytecode in the function will be for the statement spam, monty = .1. This means that to know all of the details of the tuple parameter one must look at the initial bytecode of the function to detect tuple unpacking for parameters formatted as \.\d+ and deduce any and all information about the expected argument. Bytecode analysis is how the inspect.getargspec function is able to provide information on tuple parameters. This is not easy to do and is burdensome on introspection tools as they must know how Python bytecode works (an otherwise unneeded burden as all other types of parameters do not require knowledge of Python bytecode).

The difficulty of analysing bytecode not withstanding, there is another issue with the dependency on using Python bytecode. IronPython [3] does not use Python's bytecode. Because it is based on the .NET framework it instead stores MSIL [4] in func_code.co_code attribute of the function. This fact prevents the inspect.getargspec function from working when run under IronPython. It is unknown whether other Python implementations are affected but is reasonable to assume if the implementation is not just a re-implementation of the Python virtual machine.

No Loss of Abilities If Removed

As mentioned in Introspection Issues, to handle tuple parameters the function's bytecode starts with the bytecode required to unpack the argument into the proper parameter names. This means that there is no special support required to implement tuple parameters and thus there is no loss of abilities if they were to be removed, only a possible convenience (which is addressed in Why They Should (Supposedly) Stay).

The example function at the beginning of this PEP could easily be rewritten as:

def fxn(a, b_c, d):
    b, c = b_c
    pass

and in no way lose functionality.

Exception To The Rule

When looking at the various types of parameters that a Python function can have, one will notice that tuple parameters tend to be an exception rather than the rule.

Consider PEP 3102 (keyword-only arguments) and PEP 3107 (function annotations) [5] [6]. Both PEPs have been accepted and introduce new functionality within a function's signature. And yet for both PEPs the new feature cannot be applied to tuple parameters as a whole. PEP 3102 has no support for tuple parameters at all (which makes sense as there is no way to reference a tuple parameter by name). PEP 3107 allows annotations for each item within the tuple (e.g., (x:int, y:int)), but not the whole tuple (e.g., (x, y):int).

The existence of tuple parameters also places sequence objects separately from mapping objects in a function signature. There is no way to pass in a mapping object (e.g., a dict) as a parameter and have it unpack in the same fashion as a sequence does into a tuple parameter.

Uninformative Error Messages

Consider the following function:

def fxn((a, b), (c, d)):
    pass

If called as fxn(1, (2, 3)) one is given the error message TypeError: unpack non-sequence. This error message in no way tells you which tuple was not unpacked properly. There is also no indication that this was a result that occurred because of the arguments. Other error messages regarding arguments to functions explicitly state its relation to the signature: TypeError: fxn() takes exactly 2 arguments (0 given), etc.

Little Usage

While an informal poll of the handful of Python programmers I know personally and from the PyCon 2007 sprint indicates a huge majority of people do not know of this feature and the rest just do not use it, some hard numbers is needed to back up the claim that the feature is not heavily used.

Iterating over every line in Python's code repository in the Lib/ directory using the regular expression ^\s*def\s*\w+\s*\( to detect function and method definitions there were 22,252 matches in the trunk.

Tacking on .*,\s*\( to find def statements that contained a tuple parameter, only 41 matches were found. This means that for def statements, only 0.18% of them seem to use a tuple parameter.

Why They Should (Supposedly) Stay

Practical Use

In certain instances tuple parameters can be useful. A common example is code that expects a two-item tuple that represents a Cartesian point. While true it is nice to be able to have the unpacking of the x and y coordinates for you, the argument is that this small amount of practical usefulness is heavily outweighed by other issues pertaining to tuple parameters. And as shown in No Loss Of Abilities If Removed, their use is purely practical and in no way provide a unique ability that cannot be handled in other ways very easily.

Self-Documentation For Parameters

It has been argued that tuple parameters provide a way of self-documentation for parameters that are expected to be of a certain sequence format. Using our Cartesian point example from Practical Use, seeing (x, y) as a parameter in a function makes it obvious that a tuple of length two is expected as an argument for that parameter.

But Python provides several other ways to document what parameters are for. Documentation strings are meant to provide enough information needed to explain what arguments are expected. Tuple parameters might tell you the expected length of a sequence argument, it does not tell you what that data will be used for. One must also read the docstring to know what other arguments are expected if not all parameters are tuple parameters.

Function annotations (which do not work with tuple parameters) can also supply documentation. Because annotations can be of any form, what was once a tuple parameter can be a single argument parameter with an annotation of tuple, tuple(2), Cartesian point, (x, y), etc. Annotations provide great flexibility for documenting what an argument is expected to be for a parameter, including being a sequence of a certain length.

Transition Plan

To transition Python 2.x code to 3.x where tuple parameters are removed, two steps are suggested. First, the proper warning is to be emitted when Python's compiler comes across a tuple parameter in Python 2.6. This will be treated like any other syntactic change that is to occur in Python 3.0 compared to Python 2.6.

Second, the 2to3 refactoring tool [1] will gain a fixer [2] for translating tuple parameters to being a single parameter that is unpacked as the first statement in the function. The name of the new parameter will be changed. The new parameter will then be unpacked into the names originally used in the tuple parameter. This means that the following function:

def fxn((a, (b, c))):
    pass

will be translated into:

def fxn(a_b_c):
    (a, (b, c)) = a_b_c
    pass

As tuple parameters are used by lambdas because of the single expression limitation, they must also be supported. This is done by having the expected sequence argument bound to a single parameter and then indexing on that parameter:

lambda (x, y): x + y

will be translated into:

lambda x_y: x_y[0] + x_y[1]

pep-3114 Renaming iterator.next() to iterator.__next__()

PEP:3114
Title:Renaming iterator.next() to iterator.__next__()
Version:$Revision$
Last-Modified:$Date$
Author:Ka-Ping Yee <ping at zesty.ca>
Status:Final
Type:Standards Track
Content-Type:text/x-rst
Created:04-Mar-2007
Python-Version:3.0
Post-History:

Abstract

The iterator protocol in Python 2.x consists of two methods: __iter__() called on an iterable object to yield an iterator, and next() called on an iterator object to yield the next item in the sequence. Using a for loop to iterate over an iterable object implicitly calls both of these methods. This PEP proposes that the next method be renamed to __next__, consistent with all the other protocols in Python in which a method is implicitly called as part of a language-level protocol, and that a built-in function named next be introduced to invoke __next__ method, consistent with the manner in which other protocols are explicitly invoked.

Names With Double Underscores

In Python, double underscores before and after a name are used to distinguish names that belong to the language itself. Attributes and methods that are implicitly used or created by the interpreter employ this naming convention; some examples are:

  • __file__ - an attribute automatically created by the interpreter
  • __dict__ - an attribute with special meaning to the interpreter
  • __init__ - a method implicitly called by the interpreter

Note that this convention applies to methods such as __init__ that are explicitly defined by the programmer, as well as attributes such as __file__ that can only be accessed by naming them explicitly, so it includes names that are used or created by the interpreter.

(Not all things that are called "protocols" are made of methods with double-underscore names. For example, the __contains__ method has double underscores because the language construct x in y implicitly calls __contains__. But even though the read method is part of the file protocol, it does not have double underscores because there is no language construct that implicitly invokes x.read().)

The use of double underscores creates a separate namespace for names that are part of the Python language definition, so that programmers are free to create variables, attributes, and methods that start with letters, without fear of silently colliding with names that have a language-defined purpose. (Colliding with reserved keywords is still a concern, but at least this will immediately yield a syntax error.)

The naming of the next method on iterators is an exception to this convention. Code that nowhere contains an explicit call to a next method can nonetheless be silently affected by the presence of such a method. Therefore, this PEP proposes that iterators should have a __next__ method instead of a next method (with no change in semantics).

Double-Underscore Methods and Built-In Functions

The Python language defines several protocols that are implemented or customized by defining methods with double-underscore names. In each case, the protocol is provided by an internal method implemented as a C function in the interpreter. For objects defined in Python, this C function supports customization by implicitly invoking a Python method with a double-underscore name (it often does a little bit of additional work beyond just calling the Python method.)

Sometimes the protocol is invoked by a syntactic construct:

  • x[y] --> internal tp_getitem --> x.__getitem__(y)
  • x + y --> internal nb_add --> x.__add__(y)
  • -x --> internal nb_negative --> x.__neg__()

Sometimes there is no syntactic construct, but it is still useful to be able to explicitly invoke the protocol. For such cases Python offers a built-in function of the same name but without the double underscores.

  • len(x) --> internal sq_length --> x.__len__()
  • hash(x) --> internal tp_hash --> x.__hash__()
  • iter(x) --> internal tp_iter --> x.__iter__()

Following this pattern, the natural way to handle next is to add a next built-in function that behaves in exactly the same fashion.

  • next(x) --> internal tp_iternext --> x.__next__()

Further, it is proposed that the next built-in function accept a sentinel value as an optional second argument, following the style of the getattr and iter built-in functions. When called with two arguments, next catches the StopIteration exception and returns the sentinel value instead of propagating the exception. This creates a nice duality between iter and next:

iter(function, sentinel) <--> next(iterator, sentinel)

Previous Proposals

This proposal is not a new idea. The idea proposed here was supported by the BDFL on python-dev [1] and is even mentioned in the original iterator PEP, PEP 234:

(In retrospect, it might have been better to go for __next__()
and have a new built-in, next(it), which calls it.__next__().
But alas, it's too late; this has been deployed in Python 2.2
since December 2001.)

Objections

There have been a few objections to the addition of more built-ins. In particular, Martin von Loewis writes [2]:

I dislike the introduction of more builtins unless they have a true
generality (i.e. are likely to be needed in many programs). For this
one, I think the normal usage of __next__ will be with a for loop, so
I don't think one would often need an explicit next() invocation.

It is also not true that most protocols are explicitly invoked through
builtin functions. Instead, most protocols are can be explicitly invoked
through methods in the operator module. So following tradition, it
should be operator.next.

...

As an alternative, I propose that object grows a .next() method,
which calls __next__ by default.

Transition Plan

Two additional transformations will be added to the 2to3 translation tool [3]:

  • Method definitions named next will be renamed to __next__.
  • Explicit calls to the next method will be replaced with calls to the built-in next function. For example, x.next() will become next(x).

Collin Winter looked into the possibility of automatically deciding whether to perform the second transformation depending on the presence of a module-level binding to next [4] and found that it would be "ugly and slow". Instead, the translation tool will emit warnings upon detecting such a binding. Collin has proposed warnings for the following conditions [5]:

  • Module-level assignments to next.
  • Module-level definitions of a function named next.
  • Module-level imports of the name next.
  • Assignments to __builtin__.next.

Approval

This PEP was accepted by Guido on March 6, 2007 [5].

Implementation

A patch with the necessary changes (except the 2to3 tool) was written by Georg Brandl and committed as revision 54910.

References

[1]Single- vs. Multi-pass iterability (Guido van Rossum) http://mail.python.org/pipermail/python-dev/2002-July/026814.html
[2]PEP: rename it.next() to it.__next__()... (Martin von Loewis) http://mail.python.org/pipermail/python-3000/2007-March/005965.html
[3]2to3 refactoring tool http://svn.python.org/view/sandbox/trunk/2to3/
[4]PEP: rename it.next() to it.__next__()... (Collin Winter) http://mail.python.org/pipermail/python-3000/2007-March/006020.html
[5](1, 2) PEP 3113 transition plan http://mail.python.org/pipermail/python-3000/2007-March/006044.html
[6]PEP: rename it.next() to it.__next__()... (Guido van Rossum) http://mail.python.org/pipermail/python-3000/2007-March/006027.html

pep-3115 Metaclasses in Python 3000

PEP: 3115
Title: Metaclasses in Python 3000
Version: $Revision$
Last-Modified: $Date$
Author: Talin <talin at acm.org>
Status: Final
Type: Standards Track
Content-Type: text/plain
Created: 07-Mar-2007
Python-Version: 3.0
Post-History: 11-March-2007, 14-March-2007

Abstract

     This PEP proposes changing the syntax for declaring metaclasses,
     and alters the semantics for how classes with metaclasses are
     constructed.


Rationale

     There are two rationales for this PEP, both of which are somewhat
     subtle.

     The primary reason for changing the way metaclasses work, is that
     there are a number of interesting use cases that require the
     metaclass to get involved earlier in the class construction process
     than is currently possible. Currently, the metaclass mechanism is
     essentially a post-processing step. With the advent of class
     decorators, much of these post-processing chores can be taken over
     by the decorator mechanism.

     In particular, there is an important body of use cases where it
     would be useful to preserve the order in which a class members are
     declared. Ordinary Python objects store their members in a
     dictionary, in which ordering is unimportant, and members are
     accessed strictly by name. However, Python is often used to
     interface with external systems in which the members are organized
     according to an implicit ordering. Examples include declaration of C
     structs; COM objects; Automatic translation of Python classes into
     IDL or database schemas, such as used in an ORM; and so on.

     In such cases, it would be useful for a Python programmer to specify
     such ordering directly using the declaration order of class members.
     Currently, such orderings must be specified explicitly, using some
     other mechanism (see the ctypes module for an example.)

     Unfortunately, the current method for declaring a metaclass does
     not allow for this, since the ordering information has already been
     lost by the time the metaclass comes into play. By allowing the
     metaclass to get involved in the class construction process earlier,
     the new system allows the ordering or other early artifacts of
     construction to be preserved and examined.

     There proposed metaclass mechanism also supports a number of other
     interesting use cases beyond preserving the ordering of declarations.
     One use case is to insert symbols into the namespace of the class
     body which are only valid during class construction. An example of
     this might be "field constructors", small functions that are used in
     the creation of class members. Another interesting possibility is
     supporting forward references, i.e. references to Python
     symbols that are declared further down in the class body.

     The other, weaker, rationale is purely cosmetic: The current method
     for specifying a metaclass is by assignment to the special variable
     __metaclass__, which is considered by some to be aesthetically less
     than ideal. Others disagree strongly with that opinion. This PEP
     will not address this issue, other than to note it, since aesthetic
     debates cannot be resolved via logical proofs.


Specification

     In the new model, the syntax for specifying a metaclass is via a
     keyword argument in the list of base classes:

       class Foo(base1, base2, metaclass=mymeta):
         ...

     Additional keywords will also be allowed here, and will be passed to
     the metaclass, as in the following example:

       class Foo(base1, base2, metaclass=mymeta, private=True):
         ...

     Note that this PEP makes no attempt to define what these other
     keywords might be - that is up to metaclass implementors to
     determine.

     More generally, the parameter list passed to a class definition will
     now support all of the features of a function call, meaning that you
     can now use *args and **kwargs-style arguments in the class base
     list:

        class Foo(*bases, **kwds):
           ...

Invoking the Metaclass

     In the current metaclass system, the metaclass object can be any
     callable type. This does not change, however in order to fully
     exploit all of the new features the metaclass will need to have an
     extra attribute which is used during class pre-construction.

     This attribute is named __prepare__, which is invoked as a function
     before the evaluation of the class body. The __prepare__ function
     takes two positional arguments, and an arbitrary number of keyword
     arguments. The two positional arguments are:

       'name' - the name of the class being created.
       'bases' - the list of base classes.

     The interpreter always tests for the existence of __prepare__ before
     calling it; If it is not present, then a regular dictionary is used,
     as illustrated in the following Python snippet.

       def prepare_class(name, *bases, metaclass=None, **kwargs):
          if metaclass is None:
             metaclass = compute_default_metaclass(bases)
          prepare = getattr(metaclass, '__prepare__', None)
          if prepare is not None:
             return prepare(name, bases, **kwargs)
          else:
             return dict()

     The example above illustrates how the arguments to 'class' are
     interpreted. The class name is the first argument, followed by
     an arbitrary length list of base classes. After the base classes,
     there may be one or more keyword arguments, one of which can be
     'metaclass'. Note that the 'metaclass' argument is not included
     in kwargs, since it is filtered out by the normal parameter
     assignment algorithm. (Note also that 'metaclass' is a keyword-
     only argument as per PEP 3102 [6].)

     Even though __prepare__ is not required, the default metaclass
     ('type') implements it, for the convenience of subclasses calling
     it via super().

     __prepare__ returns a dictionary-like object which is used to store
     the class member definitions during evaluation of the class body.
     In other words, the class body is evaluated as a function block
     (just like it is now), except that the local variables dictionary
     is replaced by the dictionary returned from __prepare__. This
     dictionary object can be a regular dictionary or a custom mapping
     type.

     This dictionary-like object is not required to support the full
     dictionary interface. A dictionary which supports a limited set of
     dictionary operations will restrict what kinds of actions can occur
     during evaluation of the class body. A minimal implementation might
     only support adding and retrieving values from the dictionary - most
     class bodies will do no more than that during evaluation. For some
     classes, it may be desirable to support deletion as well. Many
     metaclasses will need to make a copy of this dictionary afterwards,
     so iteration or other means for reading out the dictionary contents
     may also be useful.

     The __prepare__ method will most often be implemented as a class
     method rather than an instance method because it is called before
     the metaclass instance (i.e. the class itself) is created.

     Once the class body has finished evaluating, the metaclass will be
     called (as a callable) with the class dictionary, which is no
     different from the current metaclass mechanism.

     Typically, a metaclass will create a custom dictionary - either a
     subclass of dict, or a wrapper around it - that will contain
     additional properties that are set either before or during the
     evaluation of the class body. Then in the second phase, the
     metaclass can use these additional properties to further customize
     the class.

     An example would be a metaclass that uses information about the
     ordering of member declarations to create a C struct. The metaclass
     would provide a custom dictionary that simply keeps a record of the
     order of insertions. This does not need to be a full 'ordered dict'
     implementation, but rather just a Python list of (key,value) pairs
     that is appended to for each insertion.

     Note that in such a case, the metaclass would be required to deal
     with the possibility of duplicate keys, but in most cases that is
     trivial. The metaclass can use the first declaration, the last,
     combine them in some fashion, or simply throw an exception. It's up
     to the metaclass to decide how it wants to handle that case.

Example:

     Here's a simple example of a metaclass which creates a list of
     the names of all class members, in the order that they were
     declared:

     # The custom dictionary
     class member_table(dict):
        def __init__(self):
           self.member_names = []

        def __setitem__(self, key, value):
           # if the key is not already defined, add to the
           # list of keys.
           if key not in self:
              self.member_names.append(key)

           # Call superclass
           dict.__setitem__(self, key, value)

     # The metaclass
     class OrderedClass(type):

         # The prepare function
         @classmethod
         def __prepare__(metacls, name, bases): # No keywords in this case
            return member_table()

         # The metaclass invocation
         def __new__(cls, name, bases, classdict):
            # Note that we replace the classdict with a regular
            # dict before passing it to the superclass, so that we
            # don't continue to record member names after the class
            # has been created.
            result = type.__new__(cls, name, bases, dict(classdict))
            result.member_names = classdict.member_names
            return result

     class MyClass(metaclass=OrderedClass):
        # method1 goes in array element 0
        def method1(self):
           pass

        # method2 goes in array element 1
        def method2(self):
           pass

Sample Implementation:

     Guido van Rossum has created a patch which implements the new
     functionality:
     
        http://python.org/sf/1681101      

Alternate Proposals

     Josiah Carlson proposed using the name 'type' instead of
     'metaclass', on the theory that what is really being specified is
     the type of the type. While this is technically correct, it is also
     confusing from the point of view of a programmer creating a new
     class. From the application programmer's point of view, the 'type'
     that they are interested in is the class that they are writing; the
     type of that type is the metaclass.

     There were some objections in the discussion to the 'two-phase'
     creation process, where the metaclass is invoked twice, once to
     create the class dictionary and once to 'finish' the class. Some
     people felt that these two phases should be completely separate, in
     that there ought to be separate syntax for specifying the custom
     dict as for specifying the metaclass. However, in most cases, the
     two will be intimately tied together, and the metaclass will most
     likely have an intimate knowledge of the internal details of the
     class dict. Requiring the programmer to insure that the correct dict
     type and the correct metaclass type are used together creates an
     additional and unneeded burden on the programmer.

     Another good suggestion was to simply use an ordered dict for all
     classes, and skip the whole 'custom dict' mechanism. This was based
     on the observation that most use cases for a custom dict were for
     the purposes of preserving order information. However, this idea has
     several drawbacks, first because it means that an ordered dict
     implementation would have to be added to the set of built-in types
     in Python, and second because it would impose a slight speed (and
     complexity) penalty on all class declarations. Later, several people
     came up with ideas for use cases for custom dictionaries other
     than preserving field orderings, so this idea was dropped.


Backwards Compatibility

     It would be possible to leave the existing __metaclass__ syntax in
     place. Alternatively, it would not be too difficult to modify the
     syntax rules of the Py3K translation tool to convert from the old to
     the new syntax.


References

     [1] [Python-3000] Metaclasses in Py3K (original proposal)
         http://mail.python.org/pipermail/python-3000/2006-December/005030.html

     [2] [Python-3000] Metaclasses in Py3K (Guido's suggested syntax)
         http://mail.python.org/pipermail/python-3000/2006-December/005033.html

     [3] [Python-3000] Metaclasses in Py3K (Objections to two-phase init)
         http://mail.python.org/pipermail/python-3000/2006-December/005108.html

     [4] [Python-3000] Metaclasses in Py3K (Always use an ordered dict)
         http://mail.python.org/pipermail/python-3000/2006-December/005118.html

     [5] PEP 359: The 'make' statement -
         http://www.python.org/dev/peps/pep-0359/

     [6] PEP 3102: Keyword-only arguments -
         http://www.python.org/dev/peps/pep-3102/

Copyright

     This document has been placed in the public domain.


pep-3116 New I/O

PEP:3116
Title:New I/O
Version:$Revision$
Last-Modified:$Date$
Author:Daniel Stutzbach <daniel at stutzbachenterprises.com>, Guido van Rossum <guido at python.org>, Mike Verdone <mike.verdone at gmail.com>
Status:Final
Type:Standards Track
Content-Type:text/x-rst
Created:26-Feb-2007
Python-Version:3.0
Post-History:26-Feb-2007

Rationale and Goals

Python allows for a variety of stream-like (a.k.a. file-like) objects that can be used via read() and write() calls. Anything that provides read() and write() is stream-like. However, more exotic and extremely useful functions like readline() or seek() may or may not be available on every stream-like object. Python needs a specification for basic byte-based I/O streams to which we can add buffering and text-handling features.

Once we have a defined raw byte-based I/O interface, we can add buffering and text handling layers on top of any byte-based I/O class. The same buffering and text handling logic can be used for files, sockets, byte arrays, or custom I/O classes developed by Python programmers. Developing a standard definition of a stream lets us separate stream-based operations like read() and write() from implementation specific operations like fileno() and isatty(). It encourages programmers to write code that uses streams as streams and not require that all streams support file-specific or socket-specific operations.

The new I/O spec is intended to be similar to the Java I/O libraries, but generally less confusing. Programmers who don't want to muck about in the new I/O world can expect that the open() factory method will produce an object backwards-compatible with old-style file objects.

Specification

The Python I/O Library will consist of three layers: a raw I/O layer, a buffered I/O layer, and a text I/O layer. Each layer is defined by an abstract base class, which may have multiple implementations. The raw I/O and buffered I/O layers deal with units of bytes, while the text I/O layer deals with units of characters.

Raw I/O

The abstract base class for raw I/O is RawIOBase. It has several methods which are wrappers around the appropriate operating system calls. If one of these functions would not make sense on the object, the implementation must raise an IOError exception. For example, if a file is opened read-only, the .write() method will raise an IOError. As another example, if the object represents a socket, then .seek(), .tell(), and .truncate() will raise an IOError. Generally, a call to one of these functions maps to exactly one operating system call.

.read(n: int) -> bytes

Read up to n bytes from the object and return them. Fewer than n bytes may be returned if the operating system call returns fewer than n bytes. If 0 bytes are returned, this indicates end of file. If the object is in non-blocking mode and no bytes are available, the call returns None.

.readinto(b: bytes) -> int

Read up to len(b) bytes from the object and stores them in b, returning the number of bytes read. Like .read, fewer than len(b) bytes may be read, and 0 indicates end of file. None is returned if a non-blocking object has no bytes available. The length of b is never changed.

.write(b: bytes) -> int

Returns number of bytes written, which may be < len(b).

.seek(pos: int, whence: int = 0) -> int

.tell() -> int

.truncate(n: int = None) -> int

.close() -> None

Additionally, it defines a few other methods:

.readable() -> bool

Returns True if the object was opened for reading, False otherwise. If False, .read() will raise an IOError if called.

.writable() -> bool

Returns True if the object was opened for writing, False otherwise. If False, .write() and .truncate() will raise an IOError if called.

.seekable() -> bool

Returns True if the object supports random access (such as disk files), or False if the object only supports sequential access (such as sockets, pipes, and ttys). If False, .seek(), .tell(), and .truncate() will raise an IOError if called.

.__enter__() -> ContextManager

Context management protocol. Returns self.

.__exit__(...) -> None

Context management protocol. Same as .close().

If and only if a RawIOBase implementation operates on an underlying file descriptor, it must additionally provide a .fileno() member function. This could be defined specifically by the implementation, or a mix-in class could be used (need to decide about this).

.fileno() -> int

Returns the underlying file descriptor (an integer)

Initially, three implementations will be provided that implement the RawIOBase interface: FileIO, SocketIO (in the socket module), and ByteIO. Each implementation must determine whether the object supports random access as the information provided by the user may not be sufficient (consider open("/dev/tty", "rw") or open("/tmp/named-pipe", "rw")). As an example, FileIO can determine this by calling the seek() system call; if it returns an error, the object does not support random access. Each implementation may provided additional methods appropriate to its type. The ByteIO object is analogous to Python 2's cStringIO library, but operating on the new bytes type instead of strings.

Buffered I/O

The next layer is the Buffered I/O layer which provides more efficient access to file-like objects. The abstract base class for all Buffered I/O implementations is BufferedIOBase, which provides similar methods to RawIOBase:

.read(n: int = -1) -> bytes

Returns the next n bytes from the object. It may return fewer than n bytes if end-of-file is reached or the object is non-blocking. 0 bytes indicates end-of-file. This method may make multiple calls to RawIOBase.read() to gather the bytes, or may make no calls to RawIOBase.read() if all of the needed bytes are already buffered.

.readinto(b: bytes) -> int

.write(b: bytes) -> int

Write b bytes to the buffer. The bytes are not guaranteed to be written to the Raw I/O object immediately; they may be buffered. Returns len(b).

.seek(pos: int, whence: int = 0) -> int

.tell() -> int

.truncate(pos: int = None) -> int

.flush() -> None

.close() -> None

.readable() -> bool

.writable() -> bool

.seekable() -> bool

.__enter__() -> ContextManager

.__exit__(...) -> None

Additionally, the abstract base class provides one member variable:

.raw

A reference to the underlying RawIOBase object.

The BufferedIOBase methods signatures are mostly identical to that of RawIOBase (exceptions: write() returns None, read()'s argument is optional), but may have different semantics. In particular, BufferedIOBase implementations may read more data than requested or delay writing data using buffers. For the most part, this will be transparent to the user (unless, for example, they open the same file through a different descriptor). Also, raw reads may return a short read without any particular reason; buffered reads will only return a short read if EOF is reached; and raw writes may return a short count (even when non-blocking I/O is not enabled!), while buffered writes will raise IOError when not all bytes could be written or buffered.

There are four implementations of the BufferedIOBase abstract base class, described below.

BufferedReader

The BufferedReader implementation is for sequential-access read-only objects. Its .flush() method is a no-op.

BufferedWriter

The BufferedWriter implementation is for sequential-access write-only objects. Its .flush() method forces all cached data to be written to the underlying RawIOBase object.

BufferedRWPair

The BufferedRWPair implementation is for sequential-access read-write objects such as sockets and ttys. As the read and write streams of these objects are completely independent, it could be implemented by simply incorporating a BufferedReader and BufferedWriter instance. It provides a .flush() method that has the same semantics as a BufferedWriter's .flush() method.

BufferedRandom

The BufferedRandom implementation is for all random-access objects, whether they are read-only, write-only, or read-write. Compared to the previous classes that operate on sequential-access objects, the BufferedRandom class must contend with the user calling .seek() to reposition the stream. Therefore, an instance of BufferedRandom must keep track of both the logical and true position within the object. It provides a .flush() method that forces all cached write data to be written to the underlying RawIOBase object and all cached read data to be forgotten (so that future reads are forced to go back to the disk).

Q: Do we want to mandate in the specification that switching between reading and writing on a read-write object implies a .flush()? Or is that an implementation convenience that users should not rely on?

For a read-only BufferedRandom object, .writable() returns False and the .write() and .truncate() methods throw IOError.

For a write-only BufferedRandom object, .readable() returns False and the .read() method throws IOError.

Text I/O

The text I/O layer provides functions to read and write strings from streams. Some new features include universal newlines and character set encoding and decoding. The Text I/O layer is defined by a TextIOBase abstract base class. It provides several methods that are similar to the BufferedIOBase methods, but operate on a per-character basis instead of a per-byte basis. These methods are:

.read(n: int = -1) -> str

.write(s: str) -> int

.tell() -> object

Return a cookie describing the current file position. The only supported use for the cookie is with .seek() with whence set to 0 (i.e. absolute seek).

.seek(pos: object, whence: int = 0) -> int

Seek to position pos. If pos is non-zero, it must be a cookie returned from .tell() and whence must be zero.

.truncate(pos: object = None) -> int

Like BufferedIOBase.truncate(), except that pos (if not None) must be a cookie previously returned by .tell().

Unlike with raw I/O, the units for .seek() are not specified - some implementations (e.g. StringIO) use characters and others (e.g. TextIOWrapper) use bytes. The special case for zero is to allow going to the start or end of a stream without a prior .tell(). An implementation could include stream encoder state in the cookie returned from .tell().

TextIOBase implementations also provide several methods that are pass-throughs to the underlaying BufferedIOBase objects:

.flush() -> None

.close() -> None

.readable() -> bool

.writable() -> bool

.seekable() -> bool

TextIOBase class implementations additionally provide the following methods:

.readline() -> str

Read until newline or EOF and return the line, or "" if EOF hit immediately.

.__iter__() -> Iterator

Returns an iterator that returns lines from the file (which happens to be self).

.next() -> str

Same as readline() except raises StopIteration if EOF hit immediately.

Two implementations will be provided by the Python library. The primary implementation, TextIOWrapper, wraps a Buffered I/O object. Each TextIOWrapper object has a property named ".buffer" that provides a reference to the underlying BufferedIOBase object. Its initializer has the following signature:

.__init__(self, buffer, encoding=None, errors=None, newline=None, line_buffering=False)

buffer is a reference to the BufferedIOBase object to be wrapped with the TextIOWrapper.

encoding refers to an encoding to be used for translating between the byte-representation and character-representation. If it is None, then the system's locale setting will be used as the default.

errors is an optional string indicating error handling. It may be set whenever encoding may be set. It defaults to 'strict'.

newline can be None, '', '\n', '\r', or '\r\n'; all other values are illegal. It controls the handling of line endings. It works as follows:

  • On input, if newline is None, universal newlines mode is enabled. Lines in the input can end in '\n', '\r', or '\r\n', and these are translated into '\n' before being returned to the caller. If it is '', universal newline mode is enabled, but line endings are returned to the caller untranslated. If it has any of the other legal values, input lines are only terminated by the given string, and the line ending is returned to the caller untranslated. (In other words, translation to '\n' only occurs if newline is None.)
  • On output, if newline is None, any '\n' characters written are translated to the system default line separator, os.linesep. If newline is '', no translation takes place. If newline is any of the other legal values, any '\n' characters written are translated to the given string. (Note that the rules guiding translation are different for output than for input.)

line_buffering, if True, causes write() calls to imply a flush() if the string written contains at least one '\n' or '\r' character. This is set by open() when it detects that the underlying stream is a TTY device, or when a buffering argument of 1 is passed.

Further notes on the newline parameter:

  • '\r' support is still needed for some OSX applications that produce files using '\r' line endings; Excel (when exporting to text) and Adobe Illustrator EPS files are the most common examples.
  • If translation is enabled, it happens regardless of which method is called for reading or writing. For example, f.read() will always produce the same result as ''.join(f.readlines()).
  • If universal newlines without translation are requested on input (i.e. newline=''), if a system read operation returns a buffer ending in '\r', another system read operation is done to determine whether it is followed by '\n' or not. In universal newlines mode with translation, the second system read operation may be postponed until the next read request, and if the following system read operation returns a buffer starting with '\n', that character is simply discarded.

Another implementation, StringIO, creates a file-like TextIO implementation without an underlying Buffered I/O object. While similar functionality could be provided by wrapping a BytesIO object in a TextIOWrapper, the StringIO object allows for much greater efficiency as it does not need to actually performing encoding and decoding. A String I/O object can just store the encoded string as-is. The StringIO object's __init__ signature takes an optional string specifying the initial value; the initial position is always 0. It does not support encodings or newline translations; you always read back exactly the characters you wrote.

Unicode encoding/decoding Issues

We should allow allow changing the encoding and error-handling setting later. The behavior of Text I/O operations in the face of Unicode problems and ambiguities (e.g. diacritics, surrogates, invalid bytes in an encoding) should be the same as that of the unicode encode()/decode() methods. UnicodeError may be raised.

Implementation note: we should be able to reuse much of the infrastructure provided by the codecs module. If it doesn't provide the exact APIs we need, we should refactor it to avoid reinventing the wheel.

Non-blocking I/O

Non-blocking I/O is fully supported on the Raw I/O level only. If a raw object is in non-blocking mode and an operation would block, then .read() and .readinto() return None, while .write() returns 0. In order to put an object in non-blocking mode, the user must extract the fileno and do it by hand.

At the Buffered I/O and Text I/O layers, if a read or write fails due a non-blocking condition, they raise an IOError with errno set to EAGAIN.

Originally, we considered propagating up the Raw I/O behavior, but many corner cases and problems were raised. To address these issues, significant changes would need to have been made to the Buffered I/O and Text I/O layers. For example, what should .flush() do on a Buffered non-blocking object? How would the user instruct the object to "Write as much as you can from your buffer, but don't block"? A non-blocking .flush() that doesn't necessarily flush all available data is counter-intuitive. Since non-blocking and blocking objects would have such different semantics at these layers, it was agreed to abandon efforts to combine them into a single type.

The open() Built-in Function

The open() built-in function is specified by the following pseudo-code:

def open(filename, mode="r", buffering=None, *,
         encoding=None, errors=None, newline=None):
    assert isinstance(filename, (str, int))
    assert isinstance(mode, str)
    assert buffering is None or isinstance(buffering, int)
    assert encoding is None or isinstance(encoding, str)
    assert newline in (None, "", "\n", "\r", "\r\n")
    modes = set(mode)
    if modes - set("arwb+t") or len(mode) > len(modes):
        raise ValueError("invalid mode: %r" % mode)
    reading = "r" in modes
    writing = "w" in modes
    binary = "b" in modes
    appending = "a" in modes
    updating = "+" in modes
    text = "t" in modes or not binary
    if text and binary:
        raise ValueError("can't have text and binary mode at once")
    if reading + writing + appending > 1:
        raise ValueError("can't have read/write/append mode at once")
    if not (reading or writing or appending):
        raise ValueError("must have exactly one of read/write/append mode")
    if binary and encoding is not None:
        raise ValueError("binary modes doesn't take an encoding arg")
    if binary and errors is not None:
        raise ValueError("binary modes doesn't take an errors arg")
    if binary and newline is not None:
        raise ValueError("binary modes doesn't take a newline arg")
    # XXX Need to spec the signature for FileIO()
    raw = FileIO(filename, mode)
    line_buffering = (buffering == 1 or buffering is None and raw.isatty())
    if line_buffering or buffering is None:
        buffering = 8*1024  # International standard buffer size
        # XXX Try setting it to fstat().st_blksize
    if buffering < 0:
        raise ValueError("invalid buffering size")
    if buffering == 0:
        if binary:
            return raw
        raise ValueError("can't have unbuffered text I/O")
    if updating:
        buffer = BufferedRandom(raw, buffering)
    elif writing or appending:
        buffer = BufferedWriter(raw, buffering)
    else:
        assert reading
        buffer = BufferedReader(raw, buffering)
    if binary:
        return buffer
    assert text
    return TextIOWrapper(buffer, encoding, errors, newline, line_buffering)

pep-3117 Postfix type declarations

PEP:3117
Title:Postfix type declarations
Version:$Revision$
Last-Modified:$Date$
Author:Georg Brandl <georg at python.org>
Status:Rejected
Type:Standards Track
Content-Type:text/x-rst
Created:01-Apr-2007
Python-Version:3.0
Post-History:

Abstract

This PEP proposes the addition of a postfix type declaration syntax to Python. It also specifies a new typedef statement which is used to create new mappings between types and declarators.

Its acceptance will greatly enhance the Python user experience as well as eliminate one of the warts that deter users of other programming languages from switching to Python.

Rationale

Python has long suffered from the lack of explicit type declarations. Being one of the few aspects in which the language deviates from its Zen, this wart has sparked many a discussion between Python heretics and members of the PSU (for a few examples, see [EX1], [EX2] or [EX3]), and it also made it a large-scale enterprise success unlikely.

However, if one wants to put an end to this misery, a decent Pythonic syntax must be found. In almost all languages that have them, type declarations lack this quality: they are verbose, often needing multiple words for a single type, or they are hard to comprehend (e.g., a certain language uses completely unrelated [1] adjectives like dim for type declaration).

Therefore, this PEP combines the move to type declarations with another bold move that will once again prove that Python is not only future-proof but future-embracing: the introduction of Unicode characters as an integral constituent of source code.

Unicode makes it possible to express much more with much less characters, which is in accordance with the Zen ("Readability counts.") [ZEN]. Additionally, it eliminates the need for a separate type declaration statement, and last but not least, it makes Python measure up to Perl 6, which already uses Unicode for its operators. [2]

Specification

When the type declaration mode is in operation, the grammar is changed so that each NAME must consist of two parts: a name and a type declarator, which is exactly one Unicode character.

The declarator uniquely specifies the type of the name, and if it occurs on the left hand side of an expression, this type is enforced: an InquisitionError exception is raised if the returned type doesn't match the declared type. [3]

Also, function call result types have to be specified. If the result of the call does not have the declared type, an InquisitionError is raised. Caution: the declarator for the result should not be confused with the declarator for the function object (see the example below).

Type declarators after names that are only read, not assigned to, are not strictly necessary but enforced anyway (see the Python Zen: "Explicit is better than implicit.").

The mapping between types and declarators is not static. It can be completely customized by the programmer, but for convenience there are some predefined mappings for some built-in types:

Type Declarator
object � (REPLACEMENT CHARACTER)
int ℕ (DOUBLE-STRUCK CAPITAL N)
float ℮ (ESTIMATED SYMBOL)
bool ✓ (CHECK MARK)
complex ℂ (DOUBLE-STRUCK CAPITAL C)
str ✎ (LOWER RIGHT PENCIL)
unicode ✒ (BLACK NIB)
tuple ⒯ (PARENTHESIZED LATIN SMALL LETTER T)
list ♨ (HOT SPRINGS)
dict ⧟ (DOUBLE-ENDED MULTIMAP)
set ∅ (EMPTY SET) (Note: this is also for full sets)
frozenset ☃ (SNOWMAN)
datetime ⌚ (WATCH)
function ƛ (LATIN SMALL LETTER LAMBDA WITH STROKE)
generator ⚛ (ATOM SYMBOL)
Exception ⌁ (ELECTRIC ARROW)

The declarator for the None type is a zero-width space.

These characters should be obvious and easy to remember and type for every programmer.

Unicode replacement units

Since even in our modern, globalized world there are still some old-fashioned rebels who can't or don't want to use Unicode in their source code, and since Python is a forgiving language, a fallback is provided for those:

Instead of the single Unicode character, they can type name${UNICODE NAME OF THE DECLARATOR}$. For example, these two function definitions are equivalent:

def fooƛ(xℂ):
    return None

and

def foo${LATIN SMALL LETTER LAMBDA WITH STROKE}$(x${DOUBLE-STRUCK CAPITAL C}$):
    return None${ZERO WIDTH NO-BREAK SPACE}$

This is still easy to read and makes the full power of type-annotated Python available to ASCII believers.

The typedef statement

The mapping between types and declarators can be extended with this new statement.

The syntax is as follows:

typedef_stmt  ::=  "typedef" expr DECLARATOR

where expr resolves to a type object. For convenience, the typedef statement can also be mixed with the class statement for new classes, like so:

typedef class Foo☺(object�):
    pass

Example

This is the standard os.path.normpath function, converted to type declaration syntax:

def normpathƛ(path✎)✎:
    """Normalize path, eliminating double slashes, etc."""
    if path✎ == '':
        return '.'
    initial_slashes✓ = path✎.startswithƛ('/')✓
    # POSIX allows one or two initial slashes, but treats three or more
    # as single slash.
    if (initial_slashes✓ and
        path✎.startswithƛ('//')✓ and not path✎.startswithƛ('///')✓)✓:
        initial_slashesℕ = 2
    comps♨ = path✎.splitƛ('/')♨
    new_comps♨ = []♨
    for comp✎ in comps♨:
        if comp✎ in ('', '.')⒯:
            continue
        if (comp✎ != '..' or (not initial_slashesℕ and not new_comps♨)✓ or
             (new_comps♨ and new_comps♨[-1]✎ == '..')✓)✓:
            new_comps♨.appendƛ(comp✎)
        elif new_comps♨:
            new_comps♨.popƛ()✎
    comps♨ = new_comps♨
    path✎ = '/'.join(comps♨)✎
    if initial_slashesℕ:
        path✎ = '/'*initial_slashesℕ + path✎
    return path✎ or '.'

As you can clearly see, the type declarations add expressiveness, while at the same time they make the code look much more professional.

Compatibility issues

To enable type declaration mode, one has to write:

from __future__ import type_declarations

which enables Unicode parsing of the source [4], makes typedef a keyword and enforces correct types for all assignments and function calls.

Rejection

After careful considering, much soul-searching, gnashing of teeth and rending of garments, it has been decided to reject this PEP.

References

[EX1]http://mail.python.org/pipermail/python-list/2003-June/210588.html
[EX2]http://mail.python.org/pipermail/python-list/2000-May/034685.html
[EX3]http://groups.google.com/group/comp.lang.python/browse_frm/thread/6ae8c6add913635a/de40d4ffe9bd4304?lnk=gst&q=type+declarations&rnum=6
[1]Though, if you know the language in question, it may not be that unrelated.
[ZEN]http://www.python.org/dev/peps/pep-0020/
[2]Well, it would, if there was a Perl 6.
[3]Since the name TypeError is already in use, this name has been chosen for obvious reasons.
[4]The encoding in which the code is written is read from a standard coding cookie. There will also be an autodetection mechanism, invoked by from __future__ import encoding_hell.

Acknowledgements

Many thanks go to Armin Ronacher, Alexander Schremmer and Marek Kubica who helped find the most suitable and mnemonic declarator for built-in types.

Thanks also to the Unicode Consortium for including all those useful characters in the Unicode standard.

pep-3118 Revising the buffer protocol

PEP:3118
Title:Revising the buffer protocol
Version:$Revision$
Last-Modified:$Date$
Author:Travis Oliphant <oliphant at ee.byu.edu>, Carl Banks <pythondev at aerojockey.com>
Status:Final
Type:Standards Track
Content-Type:text/x-rst
Created:28-Aug-2006
Python-Version:3000
Post-History:

Abstract

This PEP proposes re-designing the buffer interface (PyBufferProcs function pointers) to improve the way Python allows memory sharing in Python 3.0

In particular, it is proposed that the character buffer portion of the API be eliminated and the multiple-segment portion be re-designed in conjunction with allowing for strided memory to be shared. In addition, the new buffer interface will allow the sharing of any multi-dimensional nature of the memory and what data-format the memory contains.

This interface will allow any extension module to either create objects that share memory or create algorithms that use and manipulate raw memory from arbitrary objects that export the interface.

Rationale

The Python 2.X buffer protocol allows different Python types to exchange a pointer to a sequence of internal buffers. This functionality is extremely useful for sharing large segments of memory between different high-level objects, but it is too limited and has issues:

  1. There is the little used "sequence-of-segments" option (bf_getsegcount) that is not well motivated.

  2. There is the apparently redundant character-buffer option (bf_getcharbuffer)

  3. There is no way for a consumer to tell the buffer-API-exporting object it is "finished" with its view of the memory and therefore no way for the exporting object to be sure that it is safe to reallocate the pointer to the memory that it owns (for example, the array object reallocating its memory after sharing it with the buffer object which held the original pointer led to the infamous buffer-object problem).

  4. Memory is just a pointer with a length. There is no way to describe what is "in" the memory (float, int, C-structure, etc.)

  5. There is no shape information provided for the memory. But, several array-like Python types could make use of a standard way to describe the shape-interpretation of the memory (wxPython, GTK, pyQT, CVXOPT, PyVox, Audio and Video Libraries, ctypes, NumPy, data-base interfaces, etc.)

  6. There is no way to share discontiguous memory (except through the sequence of segments notion).

    There are two widely used libraries that use the concept of discontiguous memory: PIL and NumPy. Their view of discontiguous arrays is different, though. The proposed buffer interface allows sharing of either memory model. Exporters will typically use only one approach and consumers may choose to support discontiguous arrays of each type however they choose.

    NumPy uses the notion of constant striding in each dimension as its basic concept of an array. With this concept, a simple sub-region of a larger array can be described without copying the data. Thus, stride information is the additional information that must be shared.

    The PIL uses a more opaque memory representation. Sometimes an image is contained in a contiguous segment of memory, but sometimes it is contained in an array of pointers to the contiguous segments (usually lines) of the image. The PIL is where the idea of multiple buffer segments in the original buffer interface came from.

    NumPy's strided memory model is used more often in computational libraries and because it is so simple it makes sense to support memory sharing using this model. The PIL memory model is sometimes used in C-code where a 2-d array can then be accessed using double pointer indirection: e.g. image[i][j].

    The buffer interface should allow the object to export either of these memory models. Consumers are free to either require contiguous memory or write code to handle one or both of these memory models.

Proposal Overview

  • Eliminate the char-buffer and multiple-segment sections of the buffer-protocol.
  • Unify the read/write versions of getting the buffer.
  • Add a new function to the interface that should be called when the consumer object is "done" with the memory area.
  • Add a new variable to allow the interface to describe what is in memory (unifying what is currently done now in struct and array)
  • Add a new variable to allow the protocol to share shape information
  • Add a new variable for sharing stride information
  • Add a new mechanism for sharing arrays that must be accessed using pointer indirection.
  • Fix all objects in the core and the standard library to conform to the new interface
  • Extend the struct module to handle more format specifiers
  • Extend the buffer object into a new memory object which places a Python veneer around the buffer interface.
  • Add a few functions to make it easy to copy contiguous data in and out of object supporting the buffer interface.

Specification

While the new specification allows for complicated memory sharing, simple contiguous buffers of bytes can still be obtained from an object. In fact, the new protocol allows a standard mechanism for doing this even if the original object is not represented as a contiguous chunk of memory.

The easiest way to obtain a simple contiguous chunk of memory is to use the provided C-API to obtain a chunk of memory.

Change the PyBufferProcs structure to

typedef struct {
     getbufferproc bf_getbuffer;
     releasebufferproc bf_releasebuffer;
} PyBufferProcs;

Both of these routines are optional for a type object

typedef int (*getbufferproc)(PyObject *obj, PyBuffer *view, int flags)

This function returns 0 on success and -1 on failure (and raises an error). The first variable is the "exporting" object. The second argument is the address to a bufferinfo structure. Both arguments must never be NULL.

The third argument indicates what kind of buffer the consumer is prepared to deal with and therefore what kind of buffer the exporter is allowed to return. The new buffer interface allows for much more complicated memory sharing possibilities. Some consumers may not be able to handle all the complexibity but may want to see if the exporter will let them take a simpler view to its memory.

In addition, some exporters may not be able to share memory in every possible way and may need to raise errors to signal to some consumers that something is just not possible. These errors should be PyErr_BufferError unless there is another error that is actually causing the problem. The exporter can use flags information to simplify how much of the PyBuffer structure is filled in with non-default values and/or raise an error if the object can't support a simpler view of its memory.

The exporter should always fill in all elements of the buffer structure (with defaults or NULLs if nothing else is requested). The PyBuffer_FillInfo function can be used for simple cases.

Access flags

Some flags are useful for requesting a specific kind of memory segment, while others indicate to the exporter what kind of information the consumer can deal with. If certain information is not asked for by the consumer, but the exporter cannot share its memory without that information, then a PyErr_BufferError should be raised.

PyBUF_SIMPLE

This is the default flag state (0). The returned buffer may or may not have writable memory. The format will be assumed to be unsigned bytes. This is a "stand-alone" flag constant. It never needs to be |'d to the others. The exporter will raise an error if it cannot provide such a contiguous buffer of bytes.

PyBUF_WRITABLE

The returned buffer must be writable. If it is not writable, then raise an error.

PyBUF_FORMAT

The returned buffer must have true format information if this flag is provided. This would be used when the consumer is going to be checking for what 'kind' of data is actually stored. An exporter should always be able to provide this information if requested. If format is not explicitly requested then the format must be returned as NULL (which means "B", or unsigned bytes)

PyBUF_ND

The returned buffer must provide shape information. The memory will be assumed C-style contiguous (last dimension varies the fastest). The exporter may raise an error if it cannot provide this kind of contiguous buffer. If this is not given then shape will be NULL.

PyBUF_STRIDES (implies PyBUF_ND)

The returned buffer must provide strides information (i.e. the strides cannot be NULL). This would be used when the consumer can handle strided, discontiguous arrays. Handling strides automatically assumes you can handle shape. The exporter may raise an error if cannot provide a strided-only representation of the data (i.e. without the suboffsets).
PyBUF_C_CONTIGUOUS
PyBUF_F_CONTIGUOUS
PyBUF_ANY_CONTIGUOUS
These flags indicate that the returned buffer must be respectively, C-contiguous (last dimension varies the fastest), Fortran contiguous (first dimension varies the fastest) or either one. All of these flags imply PyBUF_STRIDES and guarantee that the strides buffer info structure will be filled in correctly.

PyBUF_INDIRECT (implies PyBUF_STRIDES)

The returned buffer must have suboffsets information (which can be NULL if no suboffsets are needed). This would be used when the consumer can handle indirect array referencing implied by these suboffsets.

Specialized combinations of flags for specific kinds of memory_sharing.

Multi-dimensional (but contiguous)

PyBUF_CONTIG (PyBUF_ND | PyBUF_WRITABLE)
PyBUF_CONTIG_RO (PyBUF_ND)

Multi-dimensional using strides but aligned

PyBUF_STRIDED (PyBUF_STRIDES | PyBUF_WRITABLE)
PyBUF_STRIDED_RO (PyBUF_STRIDES)

Multi-dimensional using strides and not necessarily aligned

PyBUF_RECORDS (PyBUF_STRIDES | PyBUF_WRITABLE | PyBUF_FORMAT)
PyBUF_RECORDS_RO (PyBUF_STRIDES | PyBUF_FORMAT)

Multi-dimensional using sub-offsets

PyBUF_FULL (PyBUF_INDIRECT | PyBUF_WRITABLE | PyBUF_FORMAT)
PyBUF_FULL_RO (PyBUF_INDIRECT | PyBUF_FORMAT)

Thus, the consumer simply wanting a contiguous chunk of bytes from the object would use PyBUF_SIMPLE, while a consumer that understands how to make use of the most complicated cases could use PyBUF_FULL.

The format information is only guaranteed to be non-NULL if PyBUF_FORMAT is in the flag argument, otherwise it is expected the consumer will assume unsigned bytes.

There is a C-API that simple exporting objects can use to fill-in the buffer info structure correctly according to the provided flags if a contiguous chunk of "unsigned bytes" is all that can be exported.

The Py_buffer struct

The bufferinfo structure is:

struct bufferinfo {
     void *buf;
     Py_ssize_t len;
     int readonly;
     const char *format;
     int ndim;
     Py_ssize_t *shape;
     Py_ssize_t *strides;
     Py_ssize_t *suboffsets;
     Py_ssize_t itemsize;
     void *internal;
} Py_buffer;

Before calling the bf_getbuffer function, the bufferinfo structure can be filled with whatever, but the buf field must be NULL when requesting a new buffer. Upon return from bf_getbuffer, the bufferinfo structure is filled in with relevant information about the buffer. This same bufferinfo structure must be passed to bf_releasebuffer (if available) when the consumer is done with the memory. The caller is responsible for keeping a reference to obj until releasebuffer is called (i.e. the call to bf_getbuffer does not alter the reference count of obj).

The members of the bufferinfo structure are:

buf
a pointer to the start of the memory for the object
len
the total bytes of memory the object uses. This should be the same as the product of the shape array multiplied by the number of bytes per item of memory.
readonly
an integer variable to hold whether or not the memory is readonly. 1 means the memory is readonly, zero means the memory is writable.
format
a NULL-terminated format-string (following the struct-style syntax including extensions) indicating what is in each element of memory. The number of elements is len / itemsize, where itemsize is the number of bytes implied by the format. This can be NULL which implies standard unsigned bytes ("B").
ndim
a variable storing the number of dimensions the memory represents. Must be >=0. A value of 0 means that shape and strides and suboffsets must be NULL (i.e. the memory represents a scalar).
shape
an array of Py_ssize_t of length ndims indicating the shape of the memory as an N-D array. Note that ((*shape)[0] * ... * (*shape)[ndims-1])*itemsize = len. If ndims is 0 (indicating a scalar), then this must be NULL.
strides
address of a Py_ssize_t* variable that will be filled with a pointer to an array of Py_ssize_t of length ndims (or NULL if ndims is 0). indicating the number of bytes to skip to get to the next element in each dimension. If this is not requested by the caller (PyBUF_STRIDES is not set), then this should be set to NULL which indicates a C-style contiguous array or a PyExc_BufferError raised if this is not possible.
suboffsets

address of a Py_ssize_t * variable that will be filled with a pointer to an array of Py_ssize_t of length *ndims. If these suboffset numbers are >=0, then the value stored along the indicated dimension is a pointer and the suboffset value dictates how many bytes to add to the pointer after de-referencing. A suboffset value that it negative indicates that no de-referencing should occur (striding in a contiguous memory block). If all suboffsets are negative (i.e. no de-referencing is needed, then this must be NULL (the default value). If this is not requested by the caller (PyBUF_INDIRECT is not set), then this should be set to NULL or an PyExc_BufferError raised if this is not possible.

For clarity, here is a function that returns a pointer to the element in an N-D array pointed to by an N-dimesional index when there are both non-NULL strides and suboffsets:

void *get_item_pointer(int ndim, void *buf, Py_ssize_t *strides,
                       Py_ssize_t *suboffsets, Py_ssize_t *indices) {
    char *pointer = (char*)buf;
    int i;
    for (i = 0; i < ndim; i++) {
        pointer += strides[i] * indices[i];
        if (suboffsets[i] >=0 ) {
            pointer = *((char**)pointer) + suboffsets[i];
        }
    }
    return (void*)pointer;
}

Notice the suboffset is added "after" the dereferencing occurs. Thus slicing in the ith dimension would add to the suboffsets in the (i-1)st dimension. Slicing in the first dimension would change the location of the starting pointer directly (i.e. buf would be modified).

itemsize
This is a storage for the itemsize (in bytes) of each element of the shared memory. It is technically un-necessary as it can be obtained using PyBuffer_SizeFromFormat, however an exporter may know this information without parsing the format string and it is necessary to know the itemsize for proper interpretation of striding. Therefore, storing it is more convenient and faster.
internal
This is for use internally by the exporting object. For example, this might be re-cast as an integer by the exporter and used to store flags about whether or not the shape, strides, and suboffsets arrays must be freed when the buffer is released. The consumer should never alter this value.

The exporter is responsible for making sure that any memory pointed to by buf, format, shape, strides, and suboffsets is valid until releasebuffer is called. If the exporter wants to be able to change an object's shape, strides, and/or suboffsets before releasebuffer is called then it should allocate those arrays when getbuffer is called (pointing to them in the buffer-info structure provided) and free them when releasebuffer is called.

Releasing the buffer

The same bufferinfo struct should be used in the release-buffer interface call. The caller is responsible for the memory of the Py_buffer structure itself.

typedef void (*releasebufferproc)(PyObject *obj, Py_buffer *view)

Callers of getbufferproc must make sure that this function is called when memory previously acquired from the object is no longer needed. The exporter of the interface must make sure that any memory pointed to in the bufferinfo structure remains valid until releasebuffer is called.

If the bf_releasebuffer function is not provided (i.e. it is NULL), then it does not ever need to be called.

Exporters will need to define a bf_releasebuffer function if they can re-allocate their memory, strides, shape, suboffsets, or format variables which they might share through the struct bufferinfo. Several mechanisms could be used to keep track of how many getbuffer calls have been made and shared. Either a single variable could be used to keep track of how many "views" have been exported, or a linked-list of bufferinfo structures filled in could be maintained in each object.

All that is specifically required by the exporter, however, is to ensure that any memory shared through the bufferinfo structure remains valid until releasebuffer is called on the bufferinfo structure exporting that memory.

New C-API calls are proposed

int PyObject_CheckBuffer(PyObject *obj)

Return 1 if the getbuffer function is available otherwise 0.

int PyObject_GetBuffer(PyObject *obj, Py_buffer *view,
                       int flags)

This is a C-API version of the getbuffer function call. It checks to make sure object has the required function pointer and issues the call. Returns -1 and raises an error on failure and returns 0 on success.

void PyBuffer_Release(PyObject *obj, Py_buffer *view)

This is a C-API version of the releasebuffer function call. It checks to make sure the object has the required function pointer and issues the call. This function always succeeds even if there is no releasebuffer function for the object.

PyObject *PyObject_GetMemoryView(PyObject *obj)

Return a memory-view object from an object that defines the buffer interface.

A memory-view object is an extended buffer object that could replace the buffer object (but doesn't have to as that could be kept as a simple 1-d memory-view object). Its C-structure is

typedef struct {
    PyObject_HEAD
    PyObject *base;
    Py_buffer view;
} PyMemoryViewObject;

This is functionally similar to the current buffer object except a reference to base is kept and the memory view is not re-grabbed. Thus, this memory view object holds on to the memory of base until it is deleted.

This memory-view object will support multi-dimensional slicing and be the first object provided with Python to do so. Slices of the memory-view object are other memory-view objects with the same base but with a different view of the base object.

When an "element" from the memory-view is returned it is always a bytes object whose format should be interpreted by the format attribute of the memoryview object. The struct module can be used to "decode" the bytes in Python if desired. Or the contents can be passed to a NumPy array or other object consuming the buffer protocol.

The Python name will be

__builtin__.memoryview

Methods:

__getitem__ (will support multi-dimensional slicing)
__setitem__ (will support multi-dimensional slicing)
tobytes (obtain a new bytes-object of a copy of the memory).
tolist (obtain a "nested" list of the memory. Everything is interpreted into standard Python objects as the struct module unpack would do -- in fact it uses struct.unpack to accomplish it).

Attributes (taken from the memory of the base object):

  • format
  • itemsize
  • shape
  • strides
  • suboffsets
  • readonly
  • ndim
Py_ssize_t PyBuffer_SizeFromFormat(const char *)

Return the implied itemsize of the data-format area from a struct-style description.

PyObject * PyMemoryView_GetContiguous(PyObject *obj,  int buffertype,
                                      char fortran)

Return a memoryview object to a contiguous chunk of memory represented by obj. If a copy must be made (because the memory pointed to by obj is not contiguous), then a new bytes object will be created and become the base object for the returned memory view object.

The buffertype argument can be PyBUF_READ, PyBUF_WRITE, PyBUF_UPDATEIFCOPY to determine whether the returned buffer should be readable, writable, or set to update the original buffer if a copy must be made. If buffertype is PyBUF_WRITE and the buffer is not contiguous an error will be raised. In this circumstance, the user can use PyBUF_UPDATEIFCOPY to ensure that a a writable temporary contiguous buffer is returned. The contents of this contiguous buffer will be copied back into the original object after the memoryview object is deleted as long as the original object is writable. If this is not allowed by the original object, then a BufferError is raised.

If the object is multi-dimensional, then if fortran is 'F', the first dimension of the underlying array will vary the fastest in the buffer. If fortran is 'C', then the last dimension will vary the fastest (C-style contiguous). If fortran is 'A', then it does not matter and you will get whatever the object decides is more efficient. If a copy is made, then the memory must be freed by calling PyMem_Free.

You receive a new reference to the memoryview object.

int PyObject_CopyToObject(PyObject *obj, void *buf, Py_ssize_t len,
                          char fortran)

Copy len bytes of data pointed to by the contiguous chunk of memory pointed to by buf into the buffer exported by obj. Return 0 on success and return -1 and raise an error on failure. If the object does not have a writable buffer, then an error is raised. If fortran is 'F', then if the object is multi-dimensional, then the data will be copied into the array in Fortran-style (first dimension varies the fastest). If fortran is 'C', then the data will be copied into the array in C-style (last dimension varies the fastest). If fortran is 'A', then it does not matter and the copy will be made in whatever way is more efficient.

int PyObject_CopyData(PyObject *dest, PyObject *src)

These last three C-API calls allow a standard way of getting data in and out of Python objects into contiguous memory areas no matter how it is actually stored. These calls use the extended buffer interface to perform their work.

int PyBuffer_IsContiguous(Py_buffer *view, char fortran)

Return 1 if the memory defined by the view object is C-style (fortran = 'C') or Fortran-style (fortran = 'F') contiguous or either one (fortran = 'A'). Return 0 otherwise.

void PyBuffer_FillContiguousStrides(int ndim, Py_ssize_t *shape,
                                    Py_ssize_t *strides, Py_ssize_t itemsize,
                                    char fortran)

Fill the strides array with byte-strides of a contiguous (C-style if fortran is 'C' or Fortran-style if fortran is 'F' array of the given shape with the given number of bytes per element.

int PyBuffer_FillInfo(Py_buffer *view, void *buf,
                      Py_ssize_t len, int readonly, int infoflags)

Fills in a buffer-info structure correctly for an exporter that can only share a contiguous chunk of memory of "unsigned bytes" of the given length. Returns 0 on success and -1 (with raising an error) on error.

PyExc_BufferError

A new error object for returning buffer errors which arise because an exporter cannot provide the kind of buffer that a consumer expects. This will also be raised when a consumer requests a buffer from an object that does not provide the protocol.

Additions to the struct string-syntax

The struct string-syntax is missing some characters to fully implement data-format descriptions already available elsewhere (in ctypes and NumPy for example). The Python 2.5 specification is at http://docs.python.org/library/struct.html.

Here are the proposed additions:

Character Description
't' bit (number before states how many bits)
'?' platform _Bool type
'g' long double
'c' ucs-1 (latin-1) encoding
'u' ucs-2
'w' ucs-4
'O' pointer to Python Object
'Z' complex (whatever the next specifier is)
'&' specific pointer (prefix before another character)
'T{}' structure (detailed layout inside {})
'(k1,k2,...,kn)' multi-dimensional array of whatever follows
':name:' optional name of the preceeding element
'X{}'
pointer to a function (optional function
signature inside {} with any return value preceeded by -> and placed at the end)

The struct module will be changed to understand these as well and return appropriate Python objects on unpacking. Unpacking a long-double will return a decimal object or a ctypes long-double. Unpacking 'u' or 'w' will return Python unicode. Unpacking a multi-dimensional array will return a list (of lists if >1d). Unpacking a pointer will return a ctypes pointer object. Unpacking a function pointer will return a ctypes call-object (perhaps). Unpacking a bit will return a Python Bool. White-space in the struct-string syntax will be ignored if it isn't already. Unpacking a named-object will return some kind of named-tuple-like object that acts like a tuple but whose entries can also be accessed by name. Unpacking a nested structure will return a nested tuple.

Endian-specification ('!', '@','=','>','<', '^') is also allowed inside the string so that it can change if needed. The previously-specified endian string is in force until changed. The default endian is '@' which means native data-types and alignment. If un-aligned, native data-types are requested, then the endian specification is '^'.

According to the struct-module, a number can preceed a character code to specify how many of that type there are. The (k1,k2,...,kn) extension also allows specifying if the data is supposed to be viewed as a (C-style contiguous, last-dimension varies the fastest) multi-dimensional array of a particular format.

Functions should be added to ctypes to create a ctypes object from a struct description, and add long-double, and ucs-2 to ctypes.

Examples of Data-Format Descriptions

Here are some examples of C-structures and how they would be represented using the struct-style syntax.

<named> is the constructor for a named-tuple (not-specified yet).

float
'd' <--> Python float
complex double
'Zd' <--> Python complex
RGB Pixel data
'BBB' <--> (int, int, int) 'B:r: B:g: B:b:' <--> <named>((int, int, int), ('r','g','b'))
Mixed endian (weird but possible)
'>i:big: <i:little:' <--> <named>((int, int), ('big', 'little'))
Nested structure
struct {
     int ival;
     struct {
         unsigned short sval;
         unsigned char bval;
         unsigned char cval;
     } sub;
}
"""i:ival:
   T{
      H:sval:
      B:bval:
      B:cval:
    }:sub:
"""
Nested array
struct {
     int ival;
     double data[16*4];
}
"""i:ival:
   (16,4)d:data:
"""

Note, that in the last example, the C-structure compared against is intentionally a 1-d array and not a 2-d array data[16][4]. The reason for this is to avoid the confusions between static multi-dimensional arrays in C (which are layed out contiguously) and dynamic multi-dimensional arrays which use the same syntax to access elements, data[0][1], but whose memory is not necessarily contiguous. The struct-syntax always uses contiguous memory and the multi-dimensional character is information about the memory to be communicated by the exporter.

In other words, the struct-syntax description does not have to match the C-syntax exactly as long as it describes the same memory layout. The fact that a C-compiler would think of the memory as a 1-d array of doubles is irrelevant to the fact that the exporter wanted to communicate to the consumer that this field of the memory should be thought of as a 2-d array where a new dimension is considered after every 4 elements.

Code to be affected

All objects and modules in Python that export or consume the old buffer interface will be modified. Here is a partial list.

  • buffer object
  • bytes object
  • string object
  • unicode object
  • array module
  • struct module
  • mmap module
  • ctypes module

Anything else using the buffer API.

Issues and Details

It is intended that this PEP will be back-ported to Python 2.6 by adding the C-API and the two functions to the existing buffer protocol.

Previous versions of this PEP proposed a read/write locking scheme, but it was later perceived as a) too complicated for common simple use cases that do not require any locking and b) too simple for use cases that required concurrent read/write access to a buffer with changing, short-living locks. It is therefore left to users to implement their own specific locking scheme around buffer objects if they require consistent views across concurrent read/write access. A future PEP may be proposed which includes a separate locking API after some experience with these user-schemes is obtained

The sharing of strided memory and suboffsets is new and can be seen as a modification of the multiple-segment interface. It is motivated by NumPy and the PIL. NumPy objects should be able to share their strided memory with code that understands how to manage strided memory because strided memory is very common when interfacing with compute libraries.

Also, with this approach it should be possible to write generic code that works with both kinds of memory without copying.

Memory management of the format string, the shape array, the strides array, and the suboffsets array in the bufferinfo structure is always the responsibility of the exporting object. The consumer should not set these pointers to any other memory or try to free them.

Several ideas were discussed and rejected:

Having a "releaser" object whose release-buffer was called. This was deemed unacceptable because it caused the protocol to be asymmetric (you called release on something different than you "got" the buffer from). It also complicated the protocol without providing a real benefit.

Passing all the struct variables separately into the function. This had the advantage that it allowed one to set NULL to variables that were not of interest, but it also made the function call more difficult. The flags variable allows the same ability of consumers to be "simple" in how they call the protocol.

Code

The authors of the PEP promise to contribute and maintain the code for this proposal but will welcome any help.

Examples

Ex. 1

This example shows how an image object that uses contiguous lines might expose its buffer:

struct rgba {
    unsigned char r, g, b, a;
};

struct ImageObject {
    PyObject_HEAD;
    ...
    struct rgba** lines;
    Py_ssize_t height;
    Py_ssize_t width;
    Py_ssize_t shape_array[2];
    Py_ssize_t stride_array[2];
    Py_ssize_t view_count;
};

"lines" points to malloced 1-D array of (struct rgba*). Each pointer in THAT block points to a seperately malloced array of (struct rgba).

In order to access, say, the red value of the pixel at x=30, y=50, you'd use "lines[50][30].r".

So what does ImageObject's getbuffer do? Leaving error checking out:

int Image_getbuffer(PyObject *self, Py_buffer *view, int flags) {

    static Py_ssize_t suboffsets[2] = { 0, -1};

    view->buf = self->lines;
    view->len = self->height*self->width;
    view->readonly = 0;
    view->ndims = 2;
    self->shape_array[0] = height;
    self->shape_array[1] = width;
    view->shape = &self->shape_array;
    self->stride_array[0] = sizeof(struct rgba*);
    self->stride_array[1] = sizeof(struct rgba);
    view->strides = &self->stride_array;
    view->suboffsets = suboffsets;

    self->view_count ++;

    return 0;
}


int Image_releasebuffer(PyObject *self, Py_buffer *view) {
    self->view_count--;
    return 0;
}

Ex. 2

This example shows how an object that wants to expose a contiguous chunk of memory (which will never be re-allocated while the object is alive) would do that.

int myobject_getbuffer(PyObject *self, Py_buffer *view, int flags) {

    void *buf;
    Py_ssize_t len;
    int readonly=0;

    buf = /* Point to buffer */
    len = /* Set to size of buffer */
    readonly = /* Set to 1 if readonly */

    return PyObject_FillBufferInfo(view, buf, len, readonly, flags);
}

/* No releasebuffer is necessary because the memory will never
   be re-allocated
*/

Ex. 3

A consumer that wants to only get a simple contiguous chunk of bytes from a Python object, obj would do the following:

Py_buffer view;
int ret;

if (PyObject_GetBuffer(obj, &view, Py_BUF_SIMPLE) < 0) {
     /* error return */
}

/* Now, view.buf is the pointer to memory
        view.len is the length
        view.readonly is whether or not the memory is read-only.
 */


/* After using the information and you don't need it anymore */

if (PyBuffer_Release(obj, &view) < 0) {
        /* error return */
}

Ex. 4

A consumer that wants to be able to use any object's memory but is writing an algorithm that only handle contiguous memory could do the following:

void *buf;
Py_ssize_t len;
char *format;
int copy;

copy = PyObject_GetContiguous(obj, &buf, &len, &format, 0, 'A');
if (copy < 0) {
   /* error return */
}

/* process memory pointed to by buffer if format is correct */

/* Optional:

   if, after processing, we want to copy data from buffer back
   into the object

   we could do
   */

if (PyObject_CopyToObject(obj, buf, len, 'A') < 0) {
       /*        error return */
}

/* Make sure that if a copy was made, the memory is freed */
if (copy == 1) PyMem_Free(buf);

pep-3119 Introducing Abstract Base Classes

PEP:3119
Title:Introducing Abstract Base Classes
Version:$Revision$
Last-Modified:$Date$
Author:Guido van Rossum <guido at python.org>, Talin <talin at acm.org>
Status:Final
Type:Standards Track
Content-Type:text/x-rst
Created:18-Apr-2007
Post-History:26-Apr-2007, 11-May-2007

Abstract

This is a proposal to add Abstract Base Class (ABC) support to Python 3000. It proposes:

  • A way to overload isinstance() and issubclass().
  • A new module abc which serves as an "ABC support framework". It defines a metaclass for use with ABCs and a decorator that can be used to define abstract methods.
  • Specific ABCs for containers and iterators, to be added to the collections module.

Much of the thinking that went into the proposal is not about the specific mechanism of ABCs, as contrasted with Interfaces or Generic Functions (GFs), but about clarifying philosophical issues like "what makes a set", "what makes a mapping" and "what makes a sequence".

There's also a companion PEP 3141, which defines ABCs for numeric types.

Acknowledgements

Talin wrote the Rationale below [1] as well as most of the section on ABCs vs. Interfaces. For that alone he deserves co-authorship. The rest of the PEP uses "I" referring to the first author.

Rationale

In the domain of object-oriented programming, the usage patterns for interacting with an object can be divided into two basic categories, which are 'invocation' and 'inspection'.

Invocation means interacting with an object by invoking its methods. Usually this is combined with polymorphism, so that invoking a given method may run different code depending on the type of an object.

Inspection means the ability for external code (outside of the object's methods) to examine the type or properties of that object, and make decisions on how to treat that object based on that information.

Both usage patterns serve the same general end, which is to be able to support the processing of diverse and potentially novel objects in a uniform way, but at the same time allowing processing decisions to be customized for each different type of object.

In classical OOP theory, invocation is the preferred usage pattern, and inspection is actively discouraged, being considered a relic of an earlier, procedural programming style. However, in practice this view is simply too dogmatic and inflexible, and leads to a kind of design rigidity that is very much at odds with the dynamic nature of a language like Python.

In particular, there is often a need to process objects in a way that wasn't anticipated by the creator of the object class. It is not always the best solution to build in to every object methods that satisfy the needs of every possible user of that object. Moreover, there are many powerful dispatch philosophies that are in direct contrast to the classic OOP requirement of behavior being strictly encapsulated within an object, examples being rule or pattern-match driven logic.

On the other hand, one of the criticisms of inspection by classic OOP theorists is the lack of formalisms and the ad hoc nature of what is being inspected. In a language such as Python, in which almost any aspect of an object can be reflected and directly accessed by external code, there are many different ways to test whether an object conforms to a particular protocol or not. For example, if asking 'is this object a mutable sequence container?', one can look for a base class of 'list', or one can look for a method named '__getitem__'. But note that although these tests may seem obvious, neither of them are correct, as one generates false negatives, and the other false positives.

The generally agreed-upon remedy is to standardize the tests, and group them into a formal arrangement. This is most easily done by associating with each class a set of standard testable properties, either via the inheritance mechanism or some other means. Each test carries with it a set of promises: it contains a promise about the general behavior of the class, and a promise as to what other class methods will be available.

This PEP proposes a particular strategy for organizing these tests known as Abstract Base Classes, or ABC. ABCs are simply Python classes that are added into an object's inheritance tree to signal certain features of that object to an external inspector. Tests are done using isinstance(), and the presence of a particular ABC means that the test has passed.

In addition, the ABCs define a minimal set of methods that establish the characteristic behavior of the type. Code that discriminates objects based on their ABC type can trust that those methods will always be present. Each of these methods are accompanied by an generalized abstract semantic definition that is described in the documentation for the ABC. These standard semantic definitions are not enforced, but are strongly recommended.

Like all other things in Python, these promises are in the nature of a gentlemen's agreement, which in this case means that while the language does enforce some of the promises made in the ABC, it is up to the implementer of the concrete class to insure that the remaining ones are kept.

Specification

The specification follows the categories listed in the abstract:

  • A way to overload isinstance() and issubclass().
  • A new module abc which serves as an "ABC support framework". It defines a metaclass for use with ABCs and a decorator that can be used to define abstract methods.
  • Specific ABCs for containers and iterators, to be added to the collections module.

Overloading isinstance() and issubclass()

During the development of this PEP and of its companion, PEP 3141, we repeatedly faced the choice between standardizing more, fine-grained ABCs or fewer, course-grained ones. For example, at one stage, PEP 3141 introduced the following stack of base classes used for complex numbers: MonoidUnderPlus, AdditiveGroup, Ring, Field, Complex (each derived from the previous). And the discussion mentioned several other algebraic categorizations that were left out: Algebraic, Transcendental, and IntegralDomain, and PrincipalIdealDomain. In earlier versions of the current PEP, we considered the use cases for separate classes like Set, ComposableSet, MutableSet, HashableSet, MutableComposableSet, HashableComposableSet.

The dilemma here is that we'd rather have fewer ABCs, but then what should a user do who needs a less refined ABC? Consider e.g. the plight of a mathematician who wants to define his own kind of Transcendental numbers, but also wants float and int to be considered Transcendental. PEP 3141 originally proposed to patch float.__bases__ for that purpose, but there are some good reasons to keep the built-in types immutable (for one, they are shared between all Python interpreters running in the same address space, as is used by mod_python [16]).

Another example would be someone who wants to define a generic function (PEP 3124) for any sequence that has an append() method. The Sequence ABC (see below) doesn't promise the append() method, while MutableSequence requires not only append() but also various other mutating methods.

To solve these and similar dilemmas, the next section will propose a metaclass for use with ABCs that will allow us to add an ABC as a "virtual base class" (not the same concept as in C++) to any class, including to another ABC. This allows the standard library to define ABCs Sequence and MutableSequence and register these as virtual base classes for built-in types like basestring, tuple and list, so that for example the following conditions are all true:

isinstance([], Sequence)
issubclass(list, Sequence)
issubclass(list, MutableSequence)
isinstance((), Sequence)
not issubclass(tuple, MutableSequence)
isinstance("", Sequence)
issubclass(bytearray, MutableSequence)

The primary mechanism proposed here is to allow overloading the built-in functions isinstance() and issubclass(). The overloading works as follows: The call isinstance(x, C) first checks whether C.__instancecheck__ exists, and if so, calls C.__instancecheck__(x) instead of its normal implementation. Similarly, the call issubclass(D, C) first checks whether C.__subclasscheck__ exists, and if so, calls C.__subclasscheck__(D) instead of its normal implementation.

Note that the magic names are not __isinstance__ and __issubclass__; this is because the reversal of the arguments could cause confusion, especially for the issubclass() overloader.

A prototype implementation of this is given in [12].

Here is an example with (naively simple) implementations of __instancecheck__ and __subclasscheck__:

class ABCMeta(type):

    def __instancecheck__(cls, inst):
        """Implement isinstance(inst, cls)."""
        return any(cls.__subclasscheck__(c)
                   for c in {type(inst), inst.__class__})

    def __subclasscheck__(cls, sub):
        """Implement issubclass(sub, cls)."""
        candidates = cls.__dict__.get("__subclass__", set()) | {cls}
        return any(c in candidates for c in sub.mro())

class Sequence(metaclass=ABCMeta):
    __subclass__ = {list, tuple}

assert issubclass(list, Sequence)
assert issubclass(tuple, Sequence)

class AppendableSequence(Sequence):
    __subclass__ = {list}

assert issubclass(list, AppendableSequence)
assert isinstance([], AppendableSequence)

assert not issubclass(tuple, AppendableSequence)
assert not isinstance((), AppendableSequence)

The next section proposes a full-fledged implementation.

The abc Module: an ABC Support Framework

The new standard library module abc, written in pure Python, serves as an ABC support framework. It defines a metaclass ABCMeta and decorators @abstractmethod and @abstractproperty. A sample implementation is given by [13].

The ABCMeta class overrides __instancecheck__ and __subclasscheck__ and defines a register method. The register method takes one argument, which must be a class; after the call B.register(C), the call issubclass(C, B) will return True, by virtue of B.__subclasscheck__(C) returning True. Also, isinstance(x, B) is equivalent to issubclass(x.__class__, B) or issubclass(type(x), B). (It is possible type(x) and x.__class__ are not the same object, e.g. when x is a proxy object.)

These methods are intended to be be called on classes whose metaclass is (derived from) ABCMeta; for example:

from abc import ABCMeta

class MyABC(metaclass=ABCMeta):
    pass

MyABC.register(tuple)

assert issubclass(tuple, MyABC)
assert isinstance((), MyABC)

The last two asserts are equivalent to the following two:

assert MyABC.__subclasscheck__(tuple)
assert MyABC.__instancecheck__(())

Of course, you can also directly subclass MyABC:

class MyClass(MyABC):
    pass

assert issubclass(MyClass, MyABC)
assert isinstance(MyClass(), MyABC)

Also, of course, a tuple is not a MyClass:

assert not issubclass(tuple, MyClass)
assert not isinstance((), MyClass)

You can register another class as a subclass of MyClass:

MyClass.register(list)

assert issubclass(list, MyClass)
assert issubclass(list, MyABC)

You can also register another ABC:

class AnotherClass(metaclass=ABCMeta):
    pass

AnotherClass.register(basestring)

MyClass.register(AnotherClass)

assert isinstance(str, MyABC)

That last assert requires tracing the following superclass-subclass relationships:

MyABC -> MyClass (using regular subclassing)
MyClass -> AnotherClass (using registration)
AnotherClass -> basestring (using registration)
basestring -> str (using regular subclassing)

The abc module also defines a new decorator, @abstractmethod, to be used to declare abstract methods. A class containing at least one method declared with this decorator that hasn't been overridden yet cannot be instantiated. Such methods may be called from the overriding method in the subclass (using super or direct invocation). For example:

from abc import ABCMeta, abstractmethod

class A(metaclass=ABCMeta):
    @abstractmethod
    def foo(self): pass

A()  # raises TypeError

class B(A):
    pass

B()  # raises TypeError

class C(A):
    def foo(self): print(42)

C()  # works

Note: The @abstractmethod decorator should only be used inside a class body, and only for classes whose metaclass is (derived from) ABCMeta. Dynamically adding abstract methods to a class, or attempting to modify the abstraction status of a method or class once it is created, are not supported. The @abstractmethod only affects subclasses derived using regular inheritance; "virtual subclasses" registered with the register() method are not affected.

Implementation: The @abstractmethod decorator sets the function attribute __isabstractmethod__ to the value True. The ABCMeta.__new__ method computes the type attribute __abstractmethods__ as the set of all method names that have an __isabstractmethod__ attribute whose value is true. It does this by combining the __abstractmethods__ attributes of the base classes, adding the names of all methods in the new class dict that have a true __isabstractmethod__ attribute, and removing the names of all methods in the new class dict that don't have a true __isabstractmethod__ attribute. If the resulting __abstractmethods__ set is non-empty, the class is considered abstract, and attempts to instantiate it will raise TypeError. (If this were implemented in CPython, an internal flag Py_TPFLAGS_ABSTRACT could be used to speed up this check [6].)

Discussion: Unlike Java's abstract methods or C++'s pure abstract methods, abstract methods as defined here may have an implementation. This implementation can be called via the super mechanism from the class that overrides it. This could be useful as an end-point for a super-call in framework using cooperative multiple-inheritance [7], [8].

A second decorator, @abstractproperty, is defined in order to define abstract data attributes. Its implementation is a subclass of the built-in property class that adds an __isabstractmethod__ attribute:

class abstractproperty(property):
    __isabstractmethod__ = True

It can be used in two ways:

class C(metaclass=ABCMeta):

    # A read-only property:

    @abstractproperty
    def readonly(self):
        return self.__x

    # A read-write property (cannot use decorator syntax):

    def getx(self):
        return self.__x
    def setx(self, value):
        self.__x = value
    x = abstractproperty(getx, setx)

Similar to abstract methods, a subclass inheriting an abstract property (declared using either the decorator syntax or the longer form) cannot be instantiated unless it overrides that abstract property with a concrete property.

ABCs for Containers and Iterators

The collections module will define ABCs necessary and sufficient to work with sets, mappings, sequences, and some helper types such as iterators and dictionary views. All ABCs have the above-mentioned ABCMeta as their metaclass.

The ABCs provide implementations of their abstract methods that are technically valid but fairly useless; e.g. __hash__ returns 0, and __iter__ returns an empty iterator. In general, the abstract methods represent the behavior of an empty container of the indicated type.

Some ABCs also provide concrete (i.e. non-abstract) methods; for example, the Iterator class has an __iter__ method returning itself, fulfilling an important invariant of iterators (which in Python 2 has to be implemented anew by each iterator class). These ABCs can be considered "mix-in" classes.

No ABCs defined in the PEP override __init__, __new__, __str__ or __repr__. Defining a standard constructor signature would unnecessarily constrain custom container types, for example Patricia trees or gdbm files. Defining a specific string representation for a collection is similarly left up to individual implementations.

Note: There are no ABCs for ordering operations (__lt__, __le__, __ge__, __gt__). Defining these in a base class (abstract or not) runs into problems with the accepted type for the second operand. For example, if class Ordering defined __lt__, one would assume that for any Ordering instances x and y, x < y would be defined (even if it just defines a partial ordering). But this cannot be the case: If both list and str derived from Ordering, this would imply that [1, 2] < (1, 2) should be defined (and presumably return False), while in fact (in Python 3000!) such "mixed-mode comparisons" operations are explicitly forbidden and raise TypeError. See PEP 3100 and [14] for more information. (This is a special case of a more general issue with operations that take another argument of the same type).

One Trick Ponies

These abstract classes represent single methods like __iter__ or __len__.

Hashable

The base class for classes defining __hash__. The __hash__ method should return an integer. The abstract __hash__ method always returns 0, which is a valid (albeit inefficient) implementation. Invariant: If classes C1 and C2 both derive from Hashable, the condition o1 == o2 must imply hash(o1) == hash(o2) for all instances o1 of C1 and all instances o2 of C2. In other words, two objects should never compare equal if they have different hash values.

Another constraint is that hashable objects, once created, should never change their value (as compared by ==) or their hash value. If a class cannot guarantee this, it should not derive from Hashable; if it cannot guarantee this for certain instances, __hash__ for those instances should raise a TypeError exception.

Note: being an instance of this class does not imply that an object is immutable; e.g. a tuple containing a list as a member is not immutable; its __hash__ method raises TypeError. (This is because it recursively tries to compute the hash of each member; if a member is unhashable it raises TypeError.)

Iterable
The base class for classes defining __iter__. The __iter__ method should always return an instance of Iterator (see below). The abstract __iter__ method returns an empty iterator.
Iterator
The base class for classes defining __next__. This derives from Iterable. The abstract __next__ method raises StopIteration. The concrete __iter__ method returns self. Note the distinction between Iterable and Iterator: an Iterable can be iterated over, i.e. supports the __iter__ methods; an Iterator is what the built-in function iter() returns, i.e. supports the __next__ method.
Sized
The base class for classes defining __len__. The __len__ method should return an Integer (see "Numbers" below) >= 0. The abstract __len__ method returns 0. Invariant: If a class C derives from Sized as well as from Iterable, the invariant sum(1 for x in c) == len(c) should hold for any instance c of C.
Container
The base class for classes defining __contains__. The __contains__ method should return a bool. The abstract __contains__ method returns False. Invariant: If a class C derives from Container as well as from Iterable, then (x in c for x in c) should be a generator yielding only True values for any instance c of C.

Open issues: Conceivably, instead of using the ABCMeta metaclass, these classes could override __instancecheck__ and __subclasscheck__ to check for the presence of the applicable special method; for example:

class Sized(metaclass=ABCMeta):
    @abstractmethod
    def __hash__(self):
        return 0
    @classmethod
    def __instancecheck__(cls, x):
        return hasattr(x, "__len__")
    @classmethod
    def __subclasscheck__(cls, C):
        return hasattr(C, "__bases__") and hasattr(C, "__len__")

This has the advantage of not requiring explicit registration. However, the semantics are hard to get exactly right given the confusing semantics of instance attributes vs. class attributes, and that a class is an instance of its metaclass; the check for __bases__ is only an approximation of the desired semantics. Strawman: Let's do it, but let's arrange it in such a way that the registration API also works.

Sets

These abstract classes represent read-only sets and mutable sets. The most fundamental set operation is the membership test, written as x in s and implemented by s.__contains__(x). This operation is already defined by the Container class defined above. Therefore, we define a set as a sized, iterable container for which certain invariants from mathematical set theory hold.

The built-in type set derives from MutableSet. The built-in type frozenset derives from Set and Hashable.

Set

This is a sized, iterable container, i.e., a subclass of Sized, Iterable and Container. Not every subclass of those three classes is a set though! Sets have the additional invariant that each element occurs only once (as can be determined by iteration), and in addition sets define concrete operators that implement the inequality operations as subclass/superclass tests. In general, the invariants for finite sets in mathematics hold. [11]

Sets with different implementations can be compared safely, (usually) efficiently and correctly using the mathematical definitions of the subclass/superclass operations for finite sets. The ordering operations have concrete implementations; subclasses may override these for speed but should maintain the semantics. Because Set derives from Sized, __eq__ may take a shortcut and return False immediately if two sets of unequal length are compared. Similarly, __le__ may return False immediately if the first set has more members than the second set. Note that set inclusion implements only a partial ordering; e.g. {1, 2} and {1, 3} are not ordered (all three of <, == and > return False for these arguments). Sets cannot be ordered relative to mappings or sequences, but they can be compared to those for equality (and then they always compare unequal).

This class also defines concrete operators to compute union, intersection, symmetric and asymmetric difference, respectively __or__, __and__, __xor__ and __sub__. These operators should return instances of Set. The default implementations call the overridable class method _from_iterable() with an iterable argument. This factory method's default implementation returns a frozenset instance; it may be overridden to return another appropriate Set subclass.

Finally, this class defines a concrete method _hash which computes the hash value from the elements. Hashable subclasses of Set can implement __hash__ by calling _hash or they can reimplement the same algorithm more efficiently; but the algorithm implemented should be the same. Currently the algorithm is fully specified only by the source code [15].

Note: the issubset and issuperset methods found on the set type in Python 2 are not supported, as these are mostly just aliases for __le__ and __ge__.

MutableSet

This is a subclass of Set implementing additional operations to add and remove elements. The supported methods have the semantics known from the set type in Python 2 (except for discard, which is modeled after Java):

.add(x)
Abstract method returning a bool that adds the element x if it isn't already in the set. It should return True if x was added, False if it was already there. The abstract implementation raises NotImplementedError.
.discard(x)
Abstract method returning a bool that removes the element x if present. It should return True if the element was present and False if it wasn't. The abstract implementation raises NotImplementedError.
.pop()
Concrete method that removes and returns an arbitrary item. If the set is empty, it raises KeyError. The default implementation removes the first item returned by the set's iterator.
.toggle(x)
Concrete method returning a bool that adds x to the set if it wasn't there, but removes it if it was there. It should return True if x was added, False if it was removed.
.clear()
Concrete method that empties the set. The default implementation repeatedly calls self.pop() until KeyError is caught. (Note: this is likely much slower than simply creating a new set, even if an implementation overrides it with a faster approach; but in some cases object identity is important.)

This also supports the in-place mutating operations |=, &=, ^=, -=. These are concrete methods whose right operand can be an arbitrary Iterable, except for &=, whose right operand must be a Container. This ABC does not provide the named methods present on the built-in concrete set type that perform (almost) the same operations.

Mappings

These abstract classes represent read-only mappings and mutable mappings. The Mapping class represents the most common read-only mapping API.

The built-in type dict derives from MutableMapping.

Mapping

A subclass of Container, Iterable and Sized. The keys of a mapping naturally form a set. The (key, value) pairs (which must be tuples) are also referred to as items. The items also form a set. Methods:

.__getitem__(key)
Abstract method that returns the value corresponding to key, or raises KeyError. The implementation always raises KeyError.
.get(key, default=None)
Concrete method returning self[key] if this does not raise KeyError, and the default value if it does.
.__contains__(key)
Concrete method returning True if self[key] does not raise KeyError, and False if it does.
.__len__()
Abstract method returning the number of distinct keys (i.e., the length of the key set).
.__iter__()
Abstract method returning each key in the key set exactly once.
.keys()
Concrete method returning the key set as a Set. The default concrete implementation returns a "view" on the key set (meaning if the underlying mapping is modified, the view's value changes correspondingly); subclasses are not required to return a view but they should return a Set.
.items()
Concrete method returning the items as a Set. The default concrete implementation returns a "view" on the item set; subclasses are not required to return a view but they should return a Set.
.values()
Concrete method returning the values as a sized, iterable container (not a set!). The default concrete implementation returns a "view" on the values of the mapping; subclasses are not required to return a view but they should return a sized, iterable container.

The following invariants should hold for any mapping m:

len(m.values()) == len(m.keys()) == len(m.items()) == len(m)
[value for value in m.values()] == [m[key] for key in m.keys()]
[item for item in m.items()] == [(key, m[key]) for key in m.keys()]

i.e. iterating over the items, keys and values should return results in the same order.

MutableMapping
A subclass of Mapping that also implements some standard mutating methods. Abstract methods include __setitem__, __delitem__. Concrete methods include pop, popitem, clear, update. Note: setdefault is not included. Open issues: Write out the specs for the methods.

Sequences

These abstract classes represent read-only sequences and mutable sequences.

The built-in list and bytes types derive from MutableSequence. The built-in tuple and str types derive from Sequence and Hashable.

Sequence

A subclass of Iterable, Sized, Container. It defines a new abstract method __getitem__ that has a somewhat complicated signature: when called with an integer, it returns an element of the sequence or raises IndexError; when called with a slice object, it returns another Sequence. The concrete __iter__ method iterates over the elements using __getitem__ with integer arguments 0, 1, and so on, until IndexError is raised. The length should be equal to the number of values returned by the iterator.

Open issues: Other candidate methods, which can all have default concrete implementations that only depend on __len__ and __getitem__ with an integer argument: __reversed__, index, count, __add__, __mul__.

MutableSequence
A subclass of Sequence adding some standard mutating methods. Abstract mutating methods: __setitem__ (for integer indices as well as slices), __delitem__ (ditto), insert. Concrete mutating methods: append, reverse, extend, pop, remove. Concrete mutating operators: +=, *= (these mutate the object in place). Note: this does not define sort() -- that is only required to exist on genuine list instances.

Strings

Python 3000 will likely have at least two built-in string types: byte strings (bytes), deriving from MutableSequence, and (Unicode) character strings (str), deriving from Sequence and Hashable.

Open issues: define the base interfaces for these so alternative implementations and subclasses know what they are in for. This may be the subject of a new PEP or PEPs (PEP 358 should be co-opted for the bytes type).

ABCs vs. Alternatives

In this section I will attempt to compare and contrast ABCs to other approaches that have been proposed.

ABCs vs. Duck Typing

Does the introduction of ABCs mean the end of Duck Typing? I don't think so. Python will not require that a class derives from BasicMapping or Sequence when it defines a __getitem__ method, nor will the x[y] syntax require that x is an instance of either ABC. You will still be able to assign any "file-like" object to sys.stdout, as long as it has a write method.

Of course, there will be some carrots to encourage users to derive from the appropriate base classes; these vary from default implementations for certain functionality to an improved ability to distinguish between mappings and sequences. But there are no sticks. If hasattr(x, "__len__") works for you, great! ABCs are intended to solve problems that don't have a good solution at all in Python 2, such as distinguishing between mappings and sequences.

ABCs vs. Generic Functions

ABCs are compatible with Generic Functions (GFs). For example, my own Generic Functions implementation [4] uses the classes (types) of the arguments as the dispatch key, allowing derived classes to override base classes. Since (from Python's perspective) ABCs are quite ordinary classes, using an ABC in the default implementation for a GF can be quite appropriate. For example, if I have an overloaded prettyprint function, it would make total sense to define pretty-printing of sets like this:

@prettyprint.register(Set)
def pp_set(s):
    return "{" + ... + "}"  # Details left as an exercise

and implementations for specific subclasses of Set could be added easily.

I believe ABCs also won't present any problems for RuleDispatch, Phillip Eby's GF implementation in PEAK [5].

Of course, GF proponents might claim that GFs (and concrete, or implementation, classes) are all you need. But even they will not deny the usefulness of inheritance; and one can easily consider the ABCs proposed in this PEP as optional implementation base classes; there is no requirement that all user-defined mappings derive from BasicMapping.

ABCs vs. Interfaces

ABCs are not intrinsically incompatible with Interfaces, but there is considerable overlap. For now, I'll leave it to proponents of Interfaces to explain why Interfaces are better. I expect that much of the work that went into e.g. defining the various shades of "mapping-ness" and the nomenclature could easily be adapted for a proposal to use Interfaces instead of ABCs.

"Interfaces" in this context refers to a set of proposals for additional metadata elements attached to a class which are not part of the regular class hierarchy, but do allow for certain types of inheritance testing.

Such metadata would be designed, at least in some proposals, so as to be easily mutable by an application, allowing application writers to override the normal classification of an object.

The drawback to this idea of attaching mutable metadata to a class is that classes are shared state, and mutating them may lead to conflicts of intent. Additionally, the need to override the classification of an object can be done more cleanly using generic functions: In the simplest case, one can define a "category membership" generic function that simply returns False in the base implementation, and then provide overrides that return True for any classes of interest.

References

[1]An Introduction to ABC's, by Talin (http://mail.python.org/pipermail/python-3000/2007-April/006614.html)
[2]Incomplete implementation prototype, by GvR (http://svn.python.org/view/sandbox/trunk/abc/)
[3]Possible Python 3K Class Tree?, wiki page created by Bill Janssen (http://wiki.python.org/moin/AbstractBaseClasses)
[4]Generic Functions implementation, by GvR (http://svn.python.org/view/sandbox/trunk/overload/)
[5]Charming Python: Scaling a new PEAK, by David Mertz (http://www-128.ibm.com/developerworks/library/l-cppeak2/)
[6]Implementation of @abstractmethod (http://python.org/sf/1706989)
[7]Unifying types and classes in Python 2.2, by GvR (http://www.python.org/download/releases/2.2.3/descrintro/)
[8]Putting Metaclasses to Work: A New Dimension in Object-Oriented Programming, by Ira R. Forman and Scott H. Danforth (http://www.amazon.com/gp/product/0201433052)
[9]Partial order, in Wikipedia (http://en.wikipedia.org/wiki/Partial_order)
[10]Total order, in Wikipedia (http://en.wikipedia.org/wiki/Total_order)
[11]Finite set, in Wikipedia (http://en.wikipedia.org/wiki/Finite_set)
[12]Make isinstance/issubclass overloadable (http://python.org/sf/1708353)
[13]ABCMeta sample implementation (http://svn.python.org/view/sandbox/trunk/abc/xyz.py)
[14]python-dev email ("Comparing heterogeneous types") http://mail.python.org/pipermail/python-dev/2004-June/045111.html
[15]Function frozenset_hash() in Object/setobject.c (http://svn.python.org/view/python/trunk/Objects/setobject.c)
[16]Multiple interpreters in mod_python (http://www.modpython.org/live/current/doc-html/pyapi-interps.html)

pep-3120 Using UTF-8 as the default source encoding

PEP:3120
Title:Using UTF-8 as the default source encoding
Version:$Revision$
Last-Modified:$Date$
Author:Martin von Lรถwis <martin at v.loewis.de>
Status:Final
Type:Standards Track
Content-Type:text/x-rst
Created:15-Apr-2007
Python-Version:3.0
Post-History:

Specification

This PEP proposes to change the default source encoding from ASCII to UTF-8. Support for alternative source encodings [1] continues to exist; an explicit encoding declaration takes precedence over the default.

A Bit of History

In Python 1, the source encoding was unspecified, except that the source encoding had to be a superset of the system's basic execution character set (i.e. an ASCII superset, on most systems). The source encoding was only relevant for the lexis itself (bytes representing letters for keywords, identifiers, punctuation, line breaks, etc). The contents of a string literal was copied literally from the file on source.

In Python 2.0, the source encoding changed to Latin-1 as a side effect of introducing Unicode. For Unicode string literals, the characters were still copied literally from the source file, but widened on a character-by-character basis. As Unicode gives a fixed interpretation to code points, this algorithm effectively fixed a source encoding, at least for files containing non-ASCII characters in Unicode literals.

PEP 263 identified the problem that you can use only those Unicode characters in a Unicode literal which are also in Latin-1, and introduced a syntax for declaring the source encoding. If no source encoding was given, the default should be ASCII. For compatibility with Python 2.0 and 2.1, files were interpreted as Latin-1 for a transitional period. This transition ended with Python 2.5, which gives an error if non-ASCII characters are encountered and no source encoding is declared.

Rationale

With PEP 263, using arbitrary non-ASCII characters in a Python file is possible, but tedious. One has to explicitly add an encoding declaration. Even though some editors (like IDLE and Emacs) support the declarations of PEP 263, many editors still do not (and never will); users have to explicitly adjust the encoding which the editor assumes on a file-by-file basis.

When the default encoding is changed to UTF-8, adding non-ASCII text to Python files becomes easier and more portable: On some systems, editors will automatically choose UTF-8 when saving text (e.g. on Unix systems where the locale uses UTF-8). On other systems, editors will guess the encoding when reading the file, and UTF-8 is easy to guess. Yet other editors support associating a default encoding with a file extension, allowing users to associate .py with UTF-8.

For Python 2, an important reason for using non-UTF-8 encodings was that byte string literals would be in the source encoding at run-time, allowing then to output them to a file or render them to the user as-is. With Python 3, all strings will be Unicode strings, so the original encoding of the source will have no impact at run-time.

Implementation

The parser needs to be changed to accept bytes > 127 if no source encoding is specified; instead of giving an error, it needs to check that the bytes are well-formed UTF-8 (decoding is not necessary, as the parser converts all source code to UTF-8, anyway).

IDLE needs to be changed to use UTF-8 as the default encoding.

pep-3121 Extension Module Initialization and Finalization

PEP:3121
Title:Extension Module Initialization and Finalization
Version:$Revision$
Last-Modified:$Date$
Author:Martin von Lรถwis <martin at v.loewis.de>
Status:Accepted
Type:Standards Track
Content-Type:text/x-rst
Created:27-Apr-2007
Python-Version:3.0
Post-History:

Abstract

Extension module initialization currently has a few deficiencies. There is no cleanup for modules, the entry point name might give naming conflicts, the entry functions don't follow the usual calling convention, and multiple interpreters are not supported well. This PEP addresses these issues.

Problems

Module Finalization

Currently, extension modules are initialized usually once and then "live" forever. The only exception is when Py_Finalize() is called: then the initialization routine is invoked a second time. This is bad from a resource management point of view: memory and other resources might get allocated each time initialization is called, but there is no way to reclaim them. As a result, there is currently no way to completely release all resources Python has allocated.

Entry point name conflicts

The entry point is currently called init<module>. This might conflict with other symbols also called init<something>. In particular, initsocket is known to have conflicted in the past (this specific problem got resolved as a side effect of renaming the module to _socket).

Entry point signature

The entry point is currently a procedure (returning void). This deviates from the usual calling conventions; callers can find out whether there was an error during initialization only by checking PyErr_Occurred. The entry point should return a PyObject*, which will be the module created, or NULL in case of an exception.

Multiple Interpreters

Currently, extension modules share their state across all interpreters. This allows for undesirable information leakage across interpreters: one script could permanently corrupt objects in an extension module, possibly breaking all scripts in other interpreters.

Specification

The module initialization routines change their signature to:

PyObject *PyInit_<modulename>()

The initialization routine will be invoked once per interpreter, when the module is imported. It should return a new module object each time.

In order to store per-module state in C variables, each module object will contain a block of memory that is interpreted only by the module. The amount of memory used for the module is specified at the point of creation of the module.

In addition to the initialization function, a module may implement a number of additional callback functions, which are invoked when the module's tp_traverse, tp_clear, and tp_free functions are invoked, and when the module is reloaded.

The entire module definition is combined in a struct PyModuleDef:

struct PyModuleDef{
  PyModuleDef_Base m_base;  /* To be filled out by the interpreter */
  Py_ssize_t m_size; /* Size of per-module data */
  PyMethodDef *m_methods;
  inquiry m_reload;
  traverseproc m_traverse;
  inquiry m_clear;
  freefunc m_free;
};

Creation of a module is changed to expect an optional PyModuleDef*. The module state will be null-initialized.

Each module method will be passed the module object as the first parameter. To access the module data, a function:

void* PyModule_GetState(PyObject*);

will be provided. In addition, to lookup a module more efficiently than going through sys.modules, a function:

PyObject* PyState_FindModule(struct PyModuleDef*);

will be provided. This lookup function will use an index located in the m_base field, to find the module by index, not by name.

As all Python objects should be controlled through the Python memory management, usage of "static" type objects is discouraged, unless the type object itself has no memory-managed state. To simplify definition of heap types, a new method:

PyTypeObject* PyType_Copy(PyTypeObject*);

is added.

Example

xxmodule.c would be changed to remove the initxx function, and add the following code instead:

struct xxstate{
  PyObject *ErrorObject;
  PyObject *Xxo_Type;
};

#define xxstate(o) ((struct xxstate*)PyModule_GetState(o))

static int xx_traverse(PyObject *m, visitproc v,
                       void *arg)
{
  Py_VISIT(xxstate(m)->ErrorObject);
  Py_VISIT(xxstate(m)->Xxo_Type);
  return 0;
}

static int xx_clear(PyObject *m)
{
  Py_CLEAR(xxstate(m)->ErrorObject);
  Py_CLEAR(xxstate(m)->Xxo_Type);
  return 0;
}

static struct PyModuleDef xxmodule = {
  {}, /* m_base */
  sizeof(struct xxstate),
  &xx_methods,
  0,  /* m_reload */
  xx_traverse,
  xx_clear,
  0,  /* m_free - not needed, since all is done in m_clear */
}

PyObject*
PyInit_xx()
{
  PyObject *res = PyModule_New("xx", &xxmodule);
  if (!res) return NULL;
  xxstate(res)->ErrorObject = PyErr_NewException("xx.error", NULL, NULL);
  if (!xxstate(res)->ErrorObject) {
    Py_DECREF(res);
    return NULL;
  }
  xxstate(res)->XxoType = PyType_Copy(&Xxo_Type);
  if (!xxstate(res)->Xxo_Type) {
    Py_DECREF(res);
    return NULL;
  }
  return res;
}

Discussion

Tim Peters reports in [1] that PythonLabs considered such a feature at one point, and lists the following additional hooks which aren't currently supported in this PEP:

  • when the module object is deleted from sys.modules
  • when Py_Finalize is called
  • when Python exits
  • when the Python DLL is unloaded (Windows only)

References

[1]Tim Peters, reporting earlier conversation about such a feature http://mail.python.org/pipermail/python-3000/2006-April/000726.html

pep-3122 Delineation of the main module

PEP:3122
Title:Delineation of the main module
Version:$Revision$
Last-Modified:$Date$
Author:Brett Cannon
Status:Rejected
Type:Standards Track
Content-Type:text/x-rst
Created:27-Apr-2007
Post-History:

Attention!

This PEP has been rejected. Guido views running scripts within a package as an anti-pattern [3].

Abstract

Because of how name resolution works for relative imports in a world where PEP 328 is implemented, the ability to execute modules within a package ceases being possible. This failing stems from the fact that the module being executed as the "main" module replaces its __name__ attribute with "__main__" instead of leaving it as the absolute name of the module. This breaks import's ability to resolve relative imports from the main module into absolute names.

In order to resolve this issue, this PEP proposes to change how the main module is delineated. By leaving the __name__ attribute in a module alone and setting sys.main to the name of the main module this will allow at least some instances of executing a module within a package that uses relative imports.

This PEP does not address the idea of introducing a module-level function that is automatically executed like PEP 299 proposes.

The Problem

With the introduction of PEP 328, relative imports became dependent on the __name__ attribute of the module performing the import. This is because the use of dots in a relative import are used to strip away parts of the calling module's name to calculate where in the package hierarchy an import should fall (prior to PEP 328 relative imports could fail and would fall back on absolute imports which had a chance of succeeding).

For instance, consider the import from .. import spam made from the bacon.ham.beans module (bacon.ham.beans is not a package itself, i.e., does not define __path__). Name resolution of the relative import takes the caller's name (bacon.ham.beans), splits on dots, and then slices off the last n parts based on the level (which is 2). In this example both ham and beans are dropped and spam is joined with what is left (bacon). This leads to the proper import of the module bacon.spam.

This reliance on the __name__ attribute of a module when handling relative imports becomes an issue when executing a script within a package. Because the executing script has its name set to '__main__', import cannot resolve any relative imports, leading to an ImportError.

For example, assume we have a package named bacon with an __init__.py file containing:

from . import spam

Also create a module named spam within the bacon package (it can be an empty file). Now if you try to execute the bacon package (either through python bacon/__init__.py or python -m bacon) you will get an ImportError about trying to do a relative import from within a non-package. Obviously the import is valid, but because of the setting of __name__ to '__main__' import thinks that bacon/__init__.py is not in a package since no dots exist in __name__. To see how the algorithm works in more detail, see importlib.Import._resolve_name() in the sandbox [2].

Currently a work-around is to remove all relative imports in the module being executed and make them absolute. This is unfortunate, though, as one should not be required to use a specific type of resource in order to make a module in a package be able to be executed.

The Solution

The solution to the problem is to not change the value of __name__ in modules. But there still needs to be a way to let executing code know it is being executed as a script. This is handled with a new attribute in the sys module named main.

When a module is being executed as a script, sys.main will be set to the name of the module. This changes the current idiom of:

if __name__ == '__main__':
    ...

to:

import sys
if __name__ == sys.main:
    ...

The newly proposed solution does introduce an added line of boilerplate which is a module import. But as the solution does not introduce a new built-in or module attribute (as discussed in Rejected Ideas) it has been deemed worth the extra line.

Another issue with the proposed solution (which also applies to all rejected ideas as well) is that it does not directly solve the problem of discovering the name of a file. Consider python bacon/spam.py. By the file name alone it is not obvious whether bacon is a package. In order to properly find this out both the current direction must exist on sys.path as well as bacon/__init__.py existing.

But this is the simple example. Consider python ../spam.py. From the file name alone it is not at all clear if spam.py is in a package or not. One possible solution is to find out what the absolute name of .., check if a file named __init__.py exists, and then look if the directory is on sys.path. If it is not, then continue to walk up the directory until no more __init__.py files are found or the directory is found on sys.path.

This could potentially be an expensive process. If the package depth happens to be deep then it could require a large amount of disk access to discover where the package is anchored on sys.path, if at all. The stat calls alone can be expensive if the file system the executed script is on is something like NFS.

Because of these issues, only when the -m command-line argument (introduced by PEP 338) is used will __name__ be set. Otherwise the fallback semantics of setting __name__ to "__main__" will occur. sys.main will still be set to the proper value, regardless of what __name__ is set to.

Implementation

When the -m option is used, sys.main will be set to the argument passed in. sys.argv will be adjusted as it is currently. Then the equivalent of __import__(self.main) will occur. This differs from current semantics as the runpy module fetches the code object for the file specified by the module name in order to explicitly set __name__ and other attributes. This is no longer needed as import can perform its normal operation in this situation.

If a file name is specified, then sys.main will be set to "__main__". The specified file will then be read and have a code object created and then be executed with __name__ set to "__main__". This mirrors current semantics.

Transition Plan

In order for Python 2.6 to be able to support both the current semantics and the proposed semantics, sys.main will always be set to "__main__". Otherwise no change will occur for Python 2.6. This unfortunately means that no benefit from this change will occur in Python 2.6, but it maximizes compatibility for code that is to work as much as possible with 2.6 and 3.0.

To help transition to the new idiom, 2to3 [1] will gain a rule to transform the current if __name__ == '__main__': ... idiom to the new one. This will not help with code that checks __name__ outside of the idiom, though.

Rejected Ideas

__main__ built-in

A counter-proposal to introduce a built-in named __main__. The value of the built-in would be the name of the module being executed (just like the proposed sys.main). This would lead to a new idiom of:

if __name__ == __main__:
    ...

A drawback is that the syntactic difference is subtle; the dropping of quotes around "__main__". Some believe that for existing Python programmers bugs will be introduced where the quotation marks will be put on by accident. But one could argue that the bug would be discovered quickly through testing as it is a very shallow bug.

While the name of built-in could obviously be different (e.g., main) the other drawback is that it introduces a new built-in. With a simple solution such as sys.main being possible without adding another built-in to Python, this proposal was rejected.

__main__ module attribute

Another proposal was to add a __main__ attribute to every module. For the one that was executing as the main module, the attribute would have a true value while all other modules had a false value. This has a nice consequence of simplify the main module idiom to:

if __main__:
    ...

The drawback was the introduction of a new module attribute. It also required more integration with the import machinery than the proposed solution.

Use __file__ instead of __name__

Any of the proposals could be changed to use the __file__ attribute on modules instead of __name__, including the current semantics. The problem with this is that with the proposed solutions there is the issue of modules having no __file__ attribute defined or having the same value as other modules.

The problem that comes up with the current semantics is you still have to try to resolve the file path to a module name for the import to work.

Special string subclass for __name__ that overrides __eq__

One proposal was to define a subclass of str that overrode the __eq__ method so that it would compare equal to "__main__" as well as the actual name of the module. In all other respects the subclass would be the same as str.

This was rejected as it seemed like too much of a hack.

pep-3123 Making PyObject_HEAD conform to standard C

PEP:3123
Title:Making PyObject_HEAD conform to standard C
Version:$Revision$
Last-Modified:$Date$
Author:Martin von Lรถwis <martin at v.loewis.de>
Status:Final
Type:Standards Track
Content-Type:text/x-rst
Created:27-Apr-2007
Python-Version:3.0
Post-History:

Abstract

Python currently relies on undefined C behavior, with its usage of PyObject_HEAD. This PEP proposes to change that into standard C.

Rationale

Standard C defines that an object must be accessed only through a pointer of its type, and that all other accesses are undefined behavior, with a few exceptions. In particular, the following code has undefined behavior:

struct FooObject{
  PyObject_HEAD
  int data;
};

PyObject *foo(struct FooObject*f){
 return (PyObject*)f;
}

int bar(){
 struct FooObject *f = malloc(sizeof(struct FooObject));
 struct PyObject *o = foo(f);
 f->ob_refcnt = 0;
 o->ob_refcnt = 1;
 return f->ob_refcnt;
}

The problem here is that the storage is both accessed as if it where struct PyObject, and as struct FooObject.

Historically, compilers did not have any problems with this code. However, modern compilers use that clause as an optimization opportunity, finding that f->ob_refcnt and o->ob_refcnt cannot possibly refer to the same memory, and that therefore the function should return 0, without having to fetch the value of ob_refcnt at all in the return statement. For GCC, Python now uses -fno-strict-aliasing to work around that problem; with other compilers, it may just see undefined behavior. Even with GCC, using -fno-strict-aliasing may pessimize the generated code unnecessarily.

Specification

Standard C has one specific exception to its aliasing rules precisely designed to support the case of Python: a value of a struct type may also be accessed through a pointer to the first field. E.g. if a struct starts with an int, the struct * may also be cast to an int *, allowing to write int values into the first field.

For Python, PyObject_HEAD and PyObject_VAR_HEAD will be changed to not list all fields anymore, but list a single field of type PyObject/PyVarObject:

typedef struct _object {
  _PyObject_HEAD_EXTRA
  Py_ssize_t ob_refcnt;
  struct _typeobject *ob_type;
} PyObject;

typedef struct {
  PyObject ob_base;
  Py_ssize_t ob_size;
} PyVarObject;

#define PyObject_HEAD        PyObject ob_base;
#define PyObject_VAR_HEAD    PyVarObject ob_base;

Types defined as fixed-size structure will then include PyObject as its first field, PyVarObject for variable-sized objects. E.g.:

typedef struct {
  PyObject ob_base;
  PyObject *start, *stop, *step;
} PySliceObject;

typedef struct {
  PyVarObject ob_base;
  PyObject **ob_item;
  Py_ssize_t allocated;
} PyListObject;

The above definitions of PyObject_HEAD are normative, so extension authors MAY either use the macro, or put the ob_base field explicitly into their structs.

As a convention, the base field SHOULD be called ob_base. However, all accesses to ob_refcnt and ob_type MUST cast the object pointer to PyObject* (unless the pointer is already known to have that type), and SHOULD use the respective accessor macros. To simplify access to ob_type, ob_refcnt, and ob_size, macros:

#define Py_TYPE(o)    (((PyObject*)(o))->ob_type)
#define Py_REFCNT(o)  (((PyObject*)(o))->ob_refcnt)
#define Py_SIZE(o)    (((PyVarObject*)(o))->ob_size)

are added. E.g. the code blocks

#define PyList_CheckExact(op) ((op)->ob_type == &PyList_Type)

return func->ob_type->tp_name;

needs to be changed to:

#define PyList_CheckExact(op) (Py_TYPE(op) == &PyList_Type)

return Py_TYPE(func)->tp_name;

For initialization of type objects, the current sequence

PyObject_HEAD_INIT(NULL)
0, /* ob_size */

becomes incorrect, and must be replaced with

PyVarObject_HEAD_INIT(NULL, 0)

Compatibility with Python 2.6

To support modules that compile with both Python 2.6 and Python 3.0, the Py_* macros are added to Python 2.6. The macros Py_INCREF and Py_DECREF will be changed to cast their argument to PyObject *, so that module authors can also explicitly declare the ob_base field in modules designed for Python 2.6.

pep-3124 Overloading, Generic Functions, Interfaces, and Adaptation

PEP:3124
Title:Overloading, Generic Functions, Interfaces, and Adaptation
Version:$Revision$
Last-Modified:$Date$
Author:Phillip J. Eby <pje at telecommunity.com>
Discussions-To:Python 3000 List <python-3000 at python.org>
Status:Deferred
Type:Standards Track
Content-Type:text/x-rst
Requires:3107 3115 3119
Created:28-Apr-2007
Post-History:30-Apr-2007
Replaces:245 246

Abstract

This PEP proposes a new standard library module, overloading, to provide generic programming features including dynamic overloading (aka generic functions), interfaces, adaptation, method combining (ala CLOS and AspectJ), and simple forms of aspect-oriented programming (AOP).

The proposed API is also open to extension; that is, it will be possible for library developers to implement their own specialized interface types, generic function dispatchers, method combination algorithms, etc., and those extensions will be treated as first-class citizens by the proposed API.

The API will be implemented in pure Python with no C, but may have some dependency on CPython-specific features such as sys._getframe and the func_code attribute of functions. It is expected that e.g. Jython and IronPython will have other ways of implementing similar functionality (perhaps using Java or C#).

Rationale and Goals

Python has always provided a variety of built-in and standard-library generic functions, such as len(), iter(), pprint.pprint(), and most of the functions in the operator module. However, it currently:

  1. does not have a simple or straightforward way for developers to create new generic functions,
  2. does not have a standard way for methods to be added to existing generic functions (i.e., some are added using registration functions, others require defining __special__ methods, possibly by monkeypatching), and
  3. does not allow dispatching on multiple argument types (except in a limited form for arithmetic operators, where "right-hand" (__r*__) methods can be used to do two-argument dispatch.

In addition, it is currently a common anti-pattern for Python code to inspect the types of received arguments, in order to decide what to do with the objects. For example, code may wish to accept either an object of some type, or a sequence of objects of that type.

Currently, the "obvious way" to do this is by type inspection, but this is brittle and closed to extension. A developer using an already-written library may be unable to change how their objects are treated by such code, especially if the objects they are using were created by a third party.

Therefore, this PEP proposes a standard library module to address these, and related issues, using decorators and argument annotations (PEP 3107). The primary features to be provided are:

  • a dynamic overloading facility, similar to the static overloading found in languages such as Java and C++, but including optional method combination features as found in CLOS and AspectJ.
  • a simple "interfaces and adaptation" library inspired by Haskell's typeclasses (but more dynamic, and without any static type-checking), with an extension API to allow registering user-defined interface types such as those found in PyProtocols and Zope.
  • a simple "aspect" implementation to make it easy to create stateful adapters and to do other stateful AOP.

These features are to be provided in such a way that extended implementations can be created and used. For example, it should be possible for libraries to define new dispatching criteria for generic functions, and new kinds of interfaces, and use them in place of the predefined features. For example, it should be possible to use a zope.interface interface object to specify the desired type of a function argument, as long as the zope.interface package registered itself correctly (or a third party did the registration).

In this way, the proposed API simply offers a uniform way of accessing the functionality within its scope, rather than prescribing a single implementation to be used for all libraries, frameworks, and applications.

User API

The overloading API will be implemented as a single module, named overloading, providing the following features:

Overloading/Generic Functions

The @overload decorator allows you to define alternate implementations of a function, specialized by argument type(s). A function with the same name must already exist in the local namespace. The existing function is modified in-place by the decorator to add the new implementation, and the modified function is returned by the decorator. Thus, the following code:

from overloading import overload
from collections import Iterable

def flatten(ob):
    """Flatten an object to its component iterables"""
    yield ob

@overload
def flatten(ob: Iterable):
    for o in ob:
        for ob in flatten(o):
            yield ob

@overload
def flatten(ob: basestring):
    yield ob

creates a single flatten() function whose implementation roughly equates to:

def flatten(ob):
    if isinstance(ob, basestring) or not isinstance(ob, Iterable):
        yield ob
    else:
        for o in ob:
            for ob in flatten(o):
                yield ob

except that the flatten() function defined by overloading remains open to extension by adding more overloads, while the hardcoded version cannot be extended.

For example, if someone wants to use flatten() with a string-like type that doesn't subclass basestring, they would be out of luck with the second implementation. With the overloaded implementation, however, they can either write this:

@overload
def flatten(ob: MyString):
    yield ob

or this (to avoid copying the implementation):

from overloading import RuleSet
RuleSet(flatten).copy_rules((basestring,), (MyString,))

(Note also that, although PEP 3119 proposes that it should be possible for abstract base classes like Iterable to allow classes like MyString to claim subclass-hood, such a claim is global, throughout the application. In contrast, adding a specific overload or copying a rule is specific to an individual function, and therefore less likely to have undesired side effects.)

@overload vs. @when

The @overload decorator is a common-case shorthand for the more general @when decorator. It allows you to leave out the name of the function you are overloading, at the expense of requiring the target function to be in the local namespace. It also doesn't support adding additional criteria besides the ones specified via argument annotations. The following function definitions have identical effects, except for name binding side-effects (which will be described below):

from overloading import when

@overload
def flatten(ob: basestring):
    yield ob

@when(flatten)
def flatten(ob: basestring):
    yield ob

@when(flatten)
def flatten_basestring(ob: basestring):
    yield ob

@when(flatten, (basestring,))
def flatten_basestring(ob):
    yield ob

The first definition above will bind flatten to whatever it was previously bound to. The second will do the same, if it was already bound to the when decorator's first argument. If flatten is unbound or bound to something else, it will be rebound to the function definition as given. The last two definitions above will always bind flatten_basestring to the function definition as given.

Using this approach allows you to both give a method a descriptive name (often useful in tracebacks!) and to reuse the method later.

Except as otherwise specified, all overloading decorators have the same signature and binding rules as @when. They accept a function and an optional "predicate" object.

The default predicate implementation is a tuple of types with positional matching to the overloaded function's arguments. However, an arbitrary number of other kinds of of predicates can be created and registered using the Extension API, and will then be usable with @when and other decorators created by this module (like @before, @after, and @around).

Method Combination and Overriding

When an overloaded function is invoked, the implementation with the signature that most specifically matches the calling arguments is the one used. If no implementation matches, a NoApplicableMethods error is raised. If more than one implementation matches, but none of the signatures are more specific than the others, an AmbiguousMethods error is raised.

For example, the following pair of implementations are ambiguous, if the foo() function is ever called with two integer arguments, because both signatures would apply, but neither signature is more specific than the other (i.e., neither implies the other):

def foo(bar:int, baz:object):
    pass

@overload
def foo(bar:object, baz:int):
    pass

In contrast, the following pair of implementations can never be ambiguous, because one signature always implies the other; the int/int signature is more specific than the object/object signature:

def foo(bar:object, baz:object):
    pass

@overload
def foo(bar:int, baz:int):
    pass

A signature S1 implies another signature S2, if whenever S1 would apply, S2 would also. A signature S1 is "more specific" than another signature S2, if S1 implies S2, but S2 does not imply S1.

Although the examples above have all used concrete or abstract types as argument annotations, there is no requirement that the annotations be such. They can also be "interface" objects (discussed in the Interfaces and Adaptation section), including user-defined interface types. (They can also be other objects whose types are appropriately registered via the Extension API.)

Proceeding to the "Next" Method

If the first parameter of an overloaded function is named __proceed__, it will be passed a callable representing the next most-specific method. For example, this code:

def foo(bar:object, baz:object):
    print "got objects!"

@overload
def foo(__proceed__, bar:int, baz:int):
    print "got integers!"
    return __proceed__(bar, baz)

Will print "got integers!" followed by "got objects!".

If there is no next most-specific method, __proceed__ will be bound to a NoApplicableMethods instance. When called, a new NoApplicableMethods instance will be raised, with the arguments passed to the first instance.

Similarly, if the next most-specific methods have ambiguous precedence with respect to each other, __proceed__ will be bound to an AmbiguousMethods instance, and if called, it will raise a new instance.

Thus, a method can either check if __proceed__ is an error instance, or simply invoke it. The NoApplicableMethods and AmbiguousMethods error classes have a common DispatchError base class, so isinstance(__proceed__, overloading.DispatchError) is sufficient to identify whether __proceed__ can be safely called.

(Implementation note: using a magic argument name like __proceed__ could potentially be replaced by a magic function that would be called to obtain the next method. A magic function, however, would degrade performance and might be more difficult to implement on non-CPython platforms. Method chaining via magic argument names, however, can be efficiently implemented on any Python platform that supports creating bound methods from functions -- one simply recursively binds each function to be chained, using the following function or error as the im_self of the bound method.)

"Before" and "After" Methods

In addition to the simple next-method chaining shown above, it is sometimes useful to have other ways of combining methods. For example, the "observer pattern" can sometimes be implemented by adding extra methods to a function, that execute before or after the normal implementation.

To support these use cases, the overloading module will supply @before, @after, and @around decorators, that roughly correspond to the same types of methods in the Common Lisp Object System (CLOS), or the corresponding "advice" types in AspectJ.

Like @when, all of these decorators must be passed the function to be overloaded, and can optionally accept a predicate as well:

from overloading import before, after

def begin_transaction(db):
    print "Beginning the actual transaction"

@before(begin_transaction)
def check_single_access(db: SingletonDB):
    if db.inuse:
        raise TransactionError("Database already in use")

@after(begin_transaction)
def start_logging(db: LoggableDB):
    db.set_log_level(VERBOSE)

@before and @after methods are invoked either before or after the main function body, and are never considered ambiguous. That is, it will not cause any errors to have multiple "before" or "after" methods with identical or overlapping signatures. Ambiguities are resolved using the order in which the methods were added to the target function.

"Before" methods are invoked most-specific method first, with ambiguous methods being executed in the order they were added. All "before" methods are called before any of the function's "primary" methods (i.e. normal @overload methods) are executed.

"After" methods are invoked in the reverse order, after all of the function's "primary" methods are executed. That is, they are executed least-specific methods first, with ambiguous methods being executed in the reverse of the order in which they were added.

The return values of both "before" and "after" methods are ignored, and any uncaught exceptions raised by any methods (primary or other) immediately end the dispatching process. "Before" and "after" methods cannot have __proceed__ arguments, as they are not responsible for calling any other methods. They are simply called as a notification before or after the primary methods.

Thus, "before" and "after" methods can be used to check or establish preconditions (e.g. by raising an error if the conditions aren't met) or to ensure postconditions, without needing to duplicate any existing functionality.

"Around" Methods

The @around decorator declares a method as an "around" method. "Around" methods are much like primary methods, except that the least-specific "around" method has higher precedence than the most-specific "before" method.

Unlike "before" and "after" methods, however, "Around" methods are responsible for calling their __proceed__ argument, in order to continue the invocation process. "Around" methods are usually used to transform input arguments or return values, or to wrap specific cases with special error handling or try/finally conditions, e.g.:

from overloading import around

@around(commit_transaction)
def lock_while_committing(__proceed__, db: SingletonDB):
    with db.global_lock:
        return __proceed__(db)

They can also be used to replace the normal handling for a specific case, by not invoking the __proceed__ function.

The __proceed__ given to an "around" method will either be the next applicable "around" method, a DispatchError instance, or a synthetic method object that will call all the "before" methods, followed by the primary method chain, followed by all the "after" methods, and return the result from the primary method chain.

Thus, just as with normal methods, __proceed__ can be checked for DispatchError-ness, or simply invoked. The "around" method should return the value returned by __proceed__, unless of course it wishes to modify or replace it with a different return value for the function as a whole.

Custom Combinations

The decorators described above (@overload, @when, @before, @after, and @around) collectively implement what in CLOS is called the "standard method combination" -- the most common patterns used in combining methods.

Sometimes, however, an application or library may have use for a more sophisticated type of method combination. For example, if you would like to have "discount" methods that return a percentage off, to be subtracted from the value returned by the primary method(s), you might write something like this:

from overloading import always_overrides, merge_by_default
from overloading import Around, Before, After, Method, MethodList

class Discount(MethodList):
    """Apply return values as discounts"""

    def __call__(self, *args, **kw):
        retval = self.tail(*args, **kw)
        for sig, body in self.sorted():
            retval -= retval * body(*args, **kw)
        return retval

# merge discounts by priority
merge_by_default(Discount)

# discounts have precedence over before/after/primary methods
always_overrides(Discount, Before)
always_overrides(Discount, After)
always_overrides(Discount, Method)

# but not over "around" methods
always_overrides(Around, Discount)

# Make a decorator called "discount" that works just like the
# standard decorators...
discount = Discount.make_decorator('discount')

# and now let's use it...
def price(product):
    return product.list_price

@discount(price)
def ten_percent_off_shoes(product: Shoe)
    return Decimal('0.1')

Similar techniques can be used to implement a wide variety of CLOS-style method qualifiers and combination rules. The process of creating custom method combination objects and their corresponding decorators is described in more detail under the Extension API section.

Note, by the way, that the @discount decorator shown will work correctly with any new predicates defined by other code. For example, if zope.interface were to register its interface types to work correctly as argument annotations, you would be able to specify discounts on the basis of its interface types, not just classes or overloading-defined interface types.

Similarly, if a library like RuleDispatch or PEAK-Rules were to register an appropriate predicate implementation and dispatch engine, one would then be able to use those predicates for discounts as well, e.g.:

from somewhere import Pred  # some predicate implementation

@discount(
    price,
    Pred("isinstance(product,Shoe) and"
         " product.material.name=='Blue Suede'")
)
def forty_off_blue_suede_shoes(product):
    return Decimal('0.4')

The process of defining custom predicate types and dispatching engines is also described in more detail under the Extension API section.

Overloading Inside Classes

All of the decorators above have a special additional behavior when they are directly invoked within a class body: the first parameter (other than __proceed__, if present) of the decorated function will be treated as though it had an annotation equal to the class in which it was defined.

That is, this code:

class And(object):
    # ...
    @when(get_conjuncts)
    def __conjuncts(self):
        return self.conjuncts

produces the same effect as this (apart from the existence of a private method):

class And(object):
    # ...

@when(get_conjuncts)
def get_conjuncts_of_and(ob: And):
    return ob.conjuncts

This behavior is both a convenience enhancement when defining lots of methods, and a requirement for safely distinguishing multi-argument overloads in subclasses. Consider, for example, the following code:

class A(object):
    def foo(self, ob):
        print "got an object"

    @overload
    def foo(__proceed__, self, ob:Iterable):
        print "it's iterable!"
        return __proceed__(self, ob)


class B(A):
    foo = A.foo     # foo must be defined in local namespace

    @overload
    def foo(__proceed__, self, ob:Iterable):
        print "B got an iterable!"
        return __proceed__(self, ob)

Due to the implicit class rule, calling B().foo([]) will print "B got an iterable!" followed by "it's iterable!", and finally, "got an object", while A().foo([]) would print only the messages defined in A.

Conversely, without the implicit class rule, the two "Iterable" methods would have the exact same applicability conditions, so calling either A().foo([]) or B().foo([]) would result in an AmbiguousMethods error.

It is currently an open issue to determine the best way to implement this rule in Python 3.0. Under Python 2.x, a class' metaclass was not chosen until the end of the class body, which means that decorators could insert a custom metaclass to do processing of this sort. (This is how RuleDispatch, for example, implements the implicit class rule.)

PEP 3115, however, requires that a class' metaclass be determined before the class body has executed, making it impossible to use this technique for class decoration any more.

At this writing, discussion on this issue is ongoing.

Interfaces and Adaptation

The overloading module provides a simple implementation of interfaces and adaptation. The following example defines an IStack interface, and declares that list objects support it:

from overloading import abstract, Interface

class IStack(Interface):
    @abstract
    def push(self, ob)
        """Push 'ob' onto the stack"""

    @abstract
    def pop(self):
        """Pop a value and return it"""


when(IStack.push, (list, object))(list.append)
when(IStack.pop, (list,))(list.pop)

mylist = []
mystack = IStack(mylist)
mystack.push(42)
assert mystack.pop()==42

The Interface class is a kind of "universal adapter". It accepts a single argument: an object to adapt. It then binds all its methods to the target object, in place of itself. Thus, calling mystack.push(42) is the same as calling IStack.push(mylist, 42).

The @abstract decorator marks a function as being abstract: i.e., having no implementation. If an @abstract function is called, it raises NoApplicableMethods. To become executable, overloaded methods must be added using the techniques previously described. (That is, methods can be added using @when, @before, @after, @around, or any custom method combination decorators.)

In the example above, the list.append method is added as a method for IStack.push() when its arguments are a list and an arbitrary object. Thus, IStack.push(mylist, 42) is translated to list.append(mylist, 42), thereby implementing the desired operation.

Abstract and Concrete Methods

Note, by the way, that the @abstract decorator is not limited to use in interface definitions; it can be used anywhere that you wish to create an "empty" generic function that initially has no methods. In particular, it need not be used inside a class.

Also note that interface methods need not be abstract; one could, for example, write an interface like this:

class IWriteMapping(Interface):
    @abstract
    def __setitem__(self, key, value):
        """This has to be implemented"""

    def update(self, other:IReadMapping):
        for k, v in IReadMapping(other).items():
            self[k] = v

As long as __setitem__ is defined for some type, the above interface will provide a usable update() implementation. However, if some specific type (or pair of types) has a more efficient way of handling update() operations, an appropriate overload can still be registered for use in that case.

Subclassing and Re-assembly

Interfaces can be subclassed:

class ISizedStack(IStack):
    @abstract
    def __len__(self):
        """Return the number of items on the stack"""

# define __len__ support for ISizedStack
when(ISizedStack.__len__, (list,))(list.__len__)

Or assembled by combining functions from existing interfaces:

class Sizable(Interface):
    __len__ = ISizedStack.__len__

# list now implements Sizable as well as ISizedStack, without
# making any new declarations!

A class can be considered to "adapt to" an interface at a given point in time, if no method defined in the interface is guaranteed to raise a NoApplicableMethods error if invoked on an instance of that class at that point in time.

In normal usage, however, it is "easier to ask forgiveness than permission". That is, it is easier to simply use an interface on an object by adapting it to the interface (e.g. IStack(mylist)) or invoking interface methods directly (e.g. IStack.push(mylist, 42)), than to try to figure out whether the object is adaptable to (or directly implements) the interface.

Implementing an Interface in a Class

It is possible to declare that a class directly implements an interface, using the declare_implementation() function:

from overloading import declare_implementation

class Stack(object):
    def __init__(self):
        self.data = []
    def push(self, ob):
        self.data.append(ob)
    def pop(self):
        return self.data.pop()

declare_implementation(IStack, Stack)

The declare_implementation() call above is roughly equivalent to the following steps:

when(IStack.push, (Stack,object))(lambda self, ob: self.push(ob))
when(IStack.pop, (Stack,))(lambda self, ob: self.pop())

That is, calling IStack.push() or IStack.pop() on an instance of any subclass of Stack, will simply delegate to the actual push() or pop() methods thereof.

For the sake of efficiency, calling IStack(s) where s is an instance of Stack, may return s rather than an IStack adapter. (Note that calling IStack(x) where x is already an IStack adapter will always return x unchanged; this is an additional optimization allowed in cases where the adaptee is known to directly implement the interface, without adaptation.)

For convenience, it may be useful to declare implementations in the class header, e.g.:

class Stack(metaclass=Implementer, implements=IStack):
    ...

Instead of calling declare_implementation() after the end of the suite.

Interfaces as Type Specifiers

Interface subclasses can be used as argument annotations to indicate what type of objects are acceptable to an overload, e.g.:

@overload
def traverse(g: IGraph, s: IStack):
    g = IGraph(g)
    s = IStack(s)
    # etc....

Note, however, that the actual arguments are not changed or adapted in any way by the mere use of an interface as a type specifier. You must explicitly cast the objects to the appropriate interface, as shown above.

Note, however, that other patterns of interface use are possible. For example, other interface implementations might not support adaptation, or might require that function arguments already be adapted to the specified interface. So the exact semantics of using an interface as a type specifier are dependent on the interface objects you actually use.

For the interface objects defined by this PEP, however, the semantics are as described above. An interface I1 is considered "more specific" than another interface I2, if the set of descriptors in I1's inheritance hierarchy are a proper superset of the descriptors in I2's inheritance hierarchy.

So, for example, ISizedStack is more specific than both ISizable and ISizedStack, irrespective of the inheritance relationships between these interfaces. It is purely a question of what operations are included within those interfaces -- and the names of the operations are unimportant.

Interfaces (at least the ones provided by overloading) are always considered less-specific than concrete classes. Other interface implementations can decide on their own specificity rules, both between interfaces and other interfaces, and between interfaces and classes.

Non-Method Attributes in Interfaces

The Interface implementation actually treats all attributes and methods (i.e. descriptors) in the same way: their __get__ (and __set__ and __delete__, if present) methods are called with the wrapped (adapted) object as "self". For functions, this has the effect of creating a bound method linking the generic function to the wrapped object.

For non-function attributes, it may be easiest to specify them using the property built-in, and the corresponding fget, fset, and fdel attributes:

class ILength(Interface):
    @property
    @abstract
    def length(self):
        """Read-only length attribute"""

# ILength(aList).length == list.__len__(aList)
when(ILength.length.fget, (list,))(list.__len__)

Alternatively, methods such as _get_foo() and _set_foo() may be defined as part of the interface, and the property defined in terms of those methods, but this is a bit more difficult for users to implement correctly when creating a class that directly implements the interface, as they would then need to match all the individual method names, not just the name of the property or attribute.

Aspects

The adaptation system described above assumes that adapters are "stateless", which is to say that adapters have no attributes or state apart from that of the adapted object. This follows the "typeclass/instance" model of Haskell, and the concept of "pure" (i.e., transitively composable) adapters.

However, there are occasionally cases where, to provide a complete implementation of some interface, some sort of additional state is required.

One possibility of course, would be to attach monkeypatched "private" attributes to the adaptee. But this is subject to name collisions, and complicates the process of initialization (since any code using these attributes has to check for their existence and initialize them if necessary). It also doesn't work on objects that don't have a __dict__ attribute.

So the Aspect class is provided to make it easy to attach extra information to objects that either:

  1. have a __dict__ attribute (so aspect instances can be stored in it, keyed by aspect class),
  2. support weak referencing (so aspect instances can be managed using a global but thread-safe weak-reference dictionary), or
  3. implement or can be adapt to the overloading.IAspectOwner interface (technically, #1 or #2 imply this).

Subclassing Aspect creates an adapter class whose state is tied to the life of the adapted object.

For example, suppose you would like to count all the times a certain method is called on instances of Target (a classic AOP example). You might do something like:

from overloading import Aspect

class Count(Aspect):
    count = 0

@after(Target.some_method)
def count_after_call(self:Target, *args, **kw):
    Count(self).count += 1

The above code will keep track of the number of times that Target.some_method() is successfully called on an instance of Target (i.e., it will not count errors unless they occur in a more-specific "after" method). Other code can then access the count using Count(someTarget).count.

Aspect instances can of course have __init__ methods, to initialize any data structures. They can use either __slots__ or dictionary-based attributes for storage.

While this facility is rather primitive compared to a full-featured AOP tool like AspectJ, persons who wish to build pointcut libraries or other AspectJ-like features can certainly use Aspect objects and method-combination decorators as a base for building more expressive AOP tools.

XXX spec out full aspect API, including keys, N-to-1 aspects, manual
attach/detach/delete of aspect instances, and the IAspectOwner interface.

Extension API

TODO: explain how all of these work

implies(o1, o2)

declare_implementation(iface, class)

predicate_signatures(ob)

parse_rule(ruleset, body, predicate, actiontype, localdict, globaldict)

combine_actions(a1, a2)

rules_for(f)

Rule objects

ActionDef objects

RuleSet objects

Method objects

MethodList objects

IAspectOwner

Overloading Usage Patterns

In discussion on the Python-3000 list, the proposed feature of allowing arbitrary functions to be overloaded has been somewhat controversial, with some people expressing concern that this would make programs more difficult to understand.

The general thrust of this argument is that one cannot rely on what a function does, if it can be changed from anywhere in the program at any time. Even though in principle this can already happen through monkeypatching or code substitution, it is considered poor practice to do so.

However, providing support for overloading any function (or so the argument goes), is implicitly blessing such changes as being an acceptable practice.

This argument appears to make sense in theory, but it is almost entirely mooted in practice for two reasons.

First, people are generally not perverse, defining a function to do one thing in one place, and then summarily defining it to do the opposite somewhere else! The principal reasons to extend the behavior of a function that has not been specifically made generic are to:

  • Add special cases not contemplated by the original function's author, such as support for additional types.
  • Be notified of an action in order to cause some related operation to be performed, either before the original operation is performed, after it, or both. This can include general-purpose operations like adding logging, timing, or tracing, as well as application-specific behavior.

None of these reasons for adding overloads imply any change to the intended default or overall behavior of the existing function, however. Just as a base class method may be overridden by a subclass for these same two reasons, so too may a function be overloaded to provide for such enhancements.

In other words, universal overloading does not equal arbitrary overloading, in the sense that we need not expect people to randomly redefine the behavior of existing functions in illogical or unpredictable ways. If they did so, it would be no less of a bad practice than any other way of writing illogical or unpredictable code!

However, to distinguish bad practice from good, it is perhaps necessary to clarify further what good practice for defining overloads is. And that brings us to the second reason why generic functions do not necessarily make programs harder to understand: overloading patterns in actual programs tend to follow very predictable patterns. (Both in Python and in languages that have no non-generic functions.)

If a module is defining a new generic operation, it will usually also define any required overloads for existing types in the same place. Likewise, if a module is defining a new type, then it will usually define overloads there for any generic functions that it knows or cares about.

As a result, the vast majority of overloads can be found adjacent to either the function being overloaded, or to a newly-defined type for which the overload is adding support. Thus, overloads are highly- discoverable in the common case, as you are either looking at the function or the type, or both.

It is only in rather infrequent cases that one will have overloads in a module that contains neither the function nor the type(s) for which the overload is added. This would be the case if, say, a third-party created a bridge of support between one library's types and another library's generic function(s). In such a case, however, best practice suggests prominently advertising this, especially by way of the module name.

For example, PyProtocols defines such bridge support for working with Zope interfaces and legacy Twisted interfaces, using modules called protocols.twisted_support and protocols.zope_support. (These bridges are done with interface adapters, rather than generic functions, but the basic principle is the same.)

In short, understanding programs in the presence of universal overloading need not be any more difficult, given that the vast majority of overloads will either be adjacent to a function, or the definition of a type that is passed to that function.

And, in the absence of incompetence or deliberate intention to be obscure, the few overloads that are not adjacent to the relevant type(s) or function(s), will generally not need to be understood or known about outside the scope where those overloads are defined. (Except in the "support modules" case, where best practice suggests naming them accordingly.)

Implementation Notes

Most of the functionality described in this PEP is already implemented in the in-development version of the PEAK-Rules framework. In particular, the basic overloading and method combination framework (minus the @overload decorator) already exists there. The implementation of all of these features in peak.rules.core is 656 lines of Python at this writing.

peak.rules.core currently relies on the DecoratorTools and BytecodeAssembler modules, but both of these dependencies can be replaced, as DecoratorTools is used mainly for Python 2.3 compatibility and to implement structure types (which can be done with named tuples in later versions of Python). The use of BytecodeAssembler can be replaced using an "exec" or "compile" workaround, given a reasonable effort. (It would be easier to do this if the func_closure attribute of function objects was writable.)

The Interface class has been previously prototyped, but is not included in PEAK-Rules at the present time.

The "implicit class rule" has previously been implemented in the RuleDispatch library. However, it relies on the __metaclass__ hook that is currently eliminated in PEP 3115.

I don't currently know how to make @overload play nicely with classmethod and staticmethod in class bodies. It's not really clear if it needs to, however.

pep-3125 Remove Backslash Continuation

PEP:3125
Title:Remove Backslash Continuation
Version:$Revision$
Last-Modified:$Date$
Author:Jim J. Jewett <JimJJewett at gmail.com>
Status:Rejected
Type:Standards Track
Content-Type:text/x-rst
Created:29-Apr-2007
Post-History:29-Apr-2007, 30-Apr-2007, 04-May-2007

Rejection Notice

This PEP is rejected. There wasn't enough support in favor, the feature to be removed isn't all that harmful, and there are some use cases that would become harder.

Abstract

Python initially inherited its parsing from C. While this has been generally useful, there are some remnants which have been less useful for Python, and should be eliminated.

This PEP proposes elimination of terminal \ as a marker for line continuation.

Motivation

One goal for Python 3000 should be to simplify the language by removing unnecessary or duplicated features. There are currently several ways to indicate that a logical line is continued on the following physical line.

The other continuation methods are easily explained as a logical consequence of the semantics they provide; \ is simply an escape character that needs to be memorized.

Existing Line Continuation Methods

Parenthetical Expression - ([{}])

Open a parenthetical expression. It doesn't matter whether people view the "line" as continuing; they do immediately recognize that the expression needs to be closed before the statement can end.

Examples using each of (), [], and {}:

def fn(long_argname1,
       long_argname2):
    settings = {"background": "random noise",
                "volume": "barely audible"}
    restrictions = ["Warrantee void if used",
                    "Notice must be received by yesterday",
                    "Not responsible for sales pitch"]

Note that it is always possible to parenthesize an expression, but it can seem odd to parenthesize an expression that needs parentheses only for the line break:

assert val>4, (
    "val is too small")

Triple-Quoted Strings

Open a triple-quoted string; again, people recognize that the string needs to finish before the next statement starts.

banner_message = """
    Satisfaction Guaranteed,
    or DOUBLE YOUR MONEY BACK!!!





                                    some minor restrictions apply"""

Terminal \ in the general case

A terminal \ indicates that the logical line is continued on the following physical line (after whitespace). There are no particular semantics associated with this. This form is never required, although it may look better (particularly for people with a C language background) in some cases:

>>> assert val>4, \
        "val is too small"

Also note that the \ must be the final character in the line. If your editor navigation can add whitespace to the end of a line, that invisible change will alter the semantics of the program. Fortunately, the typical result is only a syntax error, rather than a runtime bug:

>>> assert val>4, \
        "val is too small"

SyntaxError: unexpected character after line continuation character

This PEP proposes to eliminate this redundant and potentially confusing alternative.

Terminal \ within a string

A terminal \ within a single-quoted string, at the end of the line. This is arguably a special case of the terminal \, but it is a special case that may be worth keeping.

>>> "abd\
 def"
'abd def'
  • Pro: Many of the objections to removing \ termination were really just objections to removing it within literal strings; several people clarified that they want to keep this literal-string usage, but don't mind losing the general case.
  • Pro: The use of \ for an escape character within strings is well known.
  • Contra: But note that this particular usage is odd, because the escaped character (the newline) is invisible, and the special treatment is to delete the character. That said, the \ of \(newline) is still an escape which changes the meaning of the following character.

Alternate Proposals

Several people have suggested alternative ways of marking the line end. Most of these were rejected for not actually simplifying things.

The one exception was to let any unfinished expression signify a line continuation, possibly in conjunction with increased indentation.

This is attractive because it is a generalization of the rule for parentheses.

The initial objections to this were:

  • The amount of whitespace may be contentious; expression continuation should not be confused with opening a new suite.

  • The "expression continuation" markers are not as clearly marked in Python as the grouping punctuation "(), [], {}" marks are:

    # Plus needs another operand, so the line continues
    "abc" +
        "def"
    
    # String ends an expression, so the line does not
    # not continue.  The next line is a syntax error because
    # unary plus does not apply to strings.
    "abc"
        + "def"
    
  • Guido objected for technical reasons. [1] The most obvious implementation would require allowing INDENT or DEDENT tokens anywhere, or at least in a widely expanded (and ill-defined) set of locations. While this is of concern only for the internal parsing mechanism (rather than for users), it would be a major new source of complexity.

Andrew Koenig then pointed out [2] a better implementation strategy, and said that it had worked quite well in other languages. [3] The improved suggestion boiled down to:

The whitespace that follows an (operator or) open bracket or parenthesis can include newline characters.

It would be implemented at a very low lexical level -- even before the decision is made to turn a newline followed by spaces into an INDENT or DEDENT token.

There is still some concern that it could mask bugs, as in this example [4]:

# Used to be y+1, the 1 got dropped.  Syntax Error (today)
# would become nonsense.
x = y+
f(x)

Requiring that the continuation be indented more than the initial line would add both safety and complexity.

Open Issues

  • Should \-continuation be removed even inside strings?
  • Should the continuation markers be expanded from just ([{}]) to include lines ending with an operator?
  • As a safety measure, should the continuation line be required to be more indented than the initial line?

References

[1](email subject) PEP 30XZ: Simplified Parsing, van Rossum http://mail.python.org/pipermail/python-3000/2007-April/007063.html
[2](email subject) PEP-3125 -- remove backslash continuation, Koenig http://mail.python.org/pipermail/python-3000/2007-May/007237.html
[3]The Snocone Programming Language, Koenig http://www.snobol4.com/report.htm
[4](email subject) PEP-3125 -- remove backslash continuation, van Rossum http://mail.python.org/pipermail/python-3000/2007-May/007244.html

pep-3126 Remove Implicit String Concatenation

PEP:3126
Title:Remove Implicit String Concatenation
Version:$Revision$
Last-Modified:$Date$
Author:Jim J. Jewett <JimJJewett at gmail.com>, Raymond Hettinger <python at rcn.com>
Status:Rejected
Type:Standards Track
Content-Type:text/x-rst
Created:29-Apr-2007
Post-History:29-Apr-2007, 30-Apr-2007, 07-May-2007

Rejection Notice

This PEP is rejected. There wasn't enough support in favor, the feature to be removed isn't all that harmful, and there are some use cases that would become harder.

Abstract

Python inherited many of its parsing rules from C. While this has been generally useful, there are some individual rules which are less useful for python, and should be eliminated.

This PEP proposes to eliminate implicit string concatenation based only on the adjacency of literals.

Instead of:

"abc" "def" == "abcdef"

authors will need to be explicit, and either add the strings:

"abc" + "def" == "abcdef"

or join them:

"".join(["abc", "def"]) == "abcdef"

Motivation

One goal for Python 3000 should be to simplify the language by removing unnecessary features. Implicit string concatenation should be dropped in favor of existing techniques. This will simplify the grammar and simplify a user's mental picture of Python. The latter is important for letting the language "fit in your head". A large group of current users do not even know about implicit concatenation. Of those who do know about it, a large portion never use it or habitually avoid it. Of those who both know about it and use it, very few could state with confidence the implicit operator precedence and under what circumstances it is computed when the definition is compiled versus when it is run.

History or Future

Many Python parsing rules are intentionally compatible with C. This is a useful default, but Special Cases need to be justified based on their utility in Python. We should no longer assume that python programmers will also be familiar with C, so compatibility between languages should be treated as a tie-breaker, rather than a justification.

In C, implicit concatenation is the only way to join strings without using a (run-time) function call to store into a variable. In Python, the strings can be joined (and still recognized as immutable) using more standard Python idioms, such + or "".join.

Problem

Implicit String concatentation leads to tuples and lists which are shorter than they appear; this is turn can lead to confusing, or even silent, errors. For example, given a function which accepts several parameters, but offers a default value for some of them:

def f(fmt, *args):
    print fmt % args

This looks like a valid call, but isn't:

>>> f("User %s got a message %s",
      "Bob"
      "Time for dinner")

Traceback (most recent call last):
  File "<pyshell#8>", line 2, in <module>
    "Bob"
  File "<pyshell#3>", line 2, in f
    print fmt % args
TypeError: not enough arguments for format string

Calls to this function can silently do the wrong thing:

def g(arg1, arg2=None):
    ...

# silently transformed into the possibly very different
# g("arg1 on this linearg2 on this line", None)
g("arg1 on this line"
  "arg2 on this line")

To quote Jason Orendorff [#Orendorff]

Oh. I just realized this happens a lot out here. Where I work, we use scons, and each SConscript has a long list of filenames:

sourceFiles = [
    'foo.c'
    'bar.c',
    #...many lines omitted...
    'q1000x.c']

It's a common mistake to leave off a comma, and then scons complains that it can't find 'foo.cbar.c'. This is pretty bewildering behavior even if you are a Python programmer, and not everyone here is.

Solution

In Python, strings are objects and they support the __add__ operator, so it is possible to write:

"abc" + "def"

Because these are literals, this addition can still be optimized away by the compiler; the CPython compiler already does so. [2]

Other existing alternatives include multiline (triple-quoted) strings, and the join method:

"""This string
   extends across
   multiple lines, but you may want to use something like
   Textwrap.dedent
   to clear out the leading spaces
   and/or reformat.
"""


>>> "".join(["empty", "string", "joiner"]) == "emptystringjoiner"
True

>>> " ".join(["space", "string", "joiner"]) == "space string joiner"

>>> "\n".join(["multiple", "lines"]) == "multiple\nlines" == (
"""multiple
lines""")
True

Concerns

Operator Precedence

Guido indicated [2] that this change should be handled by PEP, because there were a few edge cases with other string operators, such as the %. (Assuming that str % stays -- it may be eliminated in favor of PEP 3101 -- Advanced String Formatting. [3] [4])

The resolution is to use parentheses to enforce precedence -- the same solution that can be used today:

# Clearest, works today, continues to work, optimization is
# already possible.
("abc %s def" + "ghi") % var

# Already works today; precedence makes the optimization more
# difficult to recognize, but does not change the semantics.
"abc" + "def %s ghi" % var

as opposed to:

# Already fails because modulus (%) is higher precedence than
# addition (+)
("abc %s def" + "ghi" % var)

# Works today only because adjacency is higher precedence than
# modulus.  This will no longer be available.
"abc %s" "def" % var

# So the 2-to-3 translator can automically replace it with the
# (already valid):
("abc %s" + "def") % var

Long Commands

... build up (what I consider to be) readable SQL queries [5]:

rows = self.executesql("select cities.city, state, country"
                       "    from cities, venues, events, addresses"
                       "    where cities.city like %s"
                       "      and events.active = 1"
                       "      and venues.address = addresses.id"
                       "      and addresses.city = cities.id"
                       "      and events.venue = venues.id",
                       (city,))

Alternatives again include triple-quoted strings, +, and .join:

query="""select cities.city, state, country
             from cities, venues, events, addresses
             where cities.city like %s
               and events.active = 1"
               and venues.address = addresses.id
               and addresses.city = cities.id
               and events.venue = venues.id"""

query=( "select cities.city, state, country"
      + "    from cities, venues, events, addresses"
      + "    where cities.city like %s"
      + "      and events.active = 1"
      + "      and venues.address = addresses.id"
      + "      and addresses.city = cities.id"
      + "      and events.venue = venues.id"
      )

query="\n".join(["select cities.city, state, country",
                 "    from cities, venues, events, addresses",
                 "    where cities.city like %s",
                 "      and events.active = 1",
                 "      and venues.address = addresses.id",
                 "      and addresses.city = cities.id",
                 "      and events.venue = venues.id"])

# And yes, you *could* inline any of the above querystrings
# the same way the original was inlined.
rows = self.executesql(query, (city,))

Regular Expressions

Complex regular expressions are sometimes stated in terms of several implicitly concatenated strings with each regex component on a different line and followed by a comment. The plus operator can be inserted here but it does make the regex harder to read. One alternative is to use the re.VERBOSE option. Another alternative is to build-up the regex with a series of += lines:

# Existing idiom which relies on implicit concatenation
r = ('a{20}'  # Twenty A's
     'b{5}'   # Followed by Five B's
     )

# Mechanical replacement
r = ('a{20}'  +# Twenty A's
     'b{5}'   # Followed by Five B's
     )

# already works today
r = '''a{20}  # Twenty A's
       b{5}   # Followed by Five B's
    '''                 # Compiled with the re.VERBOSE flag

# already works today
r = 'a{20}'   # Twenty A's
r += 'b{5}'   # Followed by Five B's

Internationalization

Some internationalization tools -- notably xgettext -- have already been special-cased for implicit concatenation, but not for Python's explicit concatenation. [6]

These tools will fail to extract the (already legal):

_("some string" +
  " and more of it")

but often have a special case for:

_("some string"
  " and more of it")

It should also be possible to just use an overly long line (xgettext limits messages to 2048 characters [8], which is less than Python's enforced limit) or triple-quoted strings, but these solutions sacrifice some readability in the code:

# Lines over a certain length are unpleasant.
_("some string and more of it")

# Changing whitespace is not ideal.
_("""Some string
     and more of it""")
_("""Some string
and more of it""")
_("Some string \
and more of it")

I do not see a good short-term resolution for this.

Transition

The proposed new constructs are already legal in current Python, and can be used immediately.

The 2 to 3 translator can be made to mechanically change:

"str1" "str2"
("line1"  #comment
 "line2")

into:

("str1" + "str2")
("line1"   +#comments
 "line2")

If users want to use one of the other idioms, they can; as these idioms are all already legal in python 2, the edits can be made to the original source, rather than patching up the translator.

Open Issues

Is there a better way to support external text extraction tools, or at least xgettext [7] in particular?

References

[1]Implicit String Concatenation, Orendorff http://mail.python.org/pipermail/python-ideas/2007-April/000397.html
[2](1, 2) Reminder: Py3k PEPs due by April, Hettinger, van Rossum http://mail.python.org/pipermail/python-3000/2007-April/006563.html
[3]PEP 3101, Advanced String Formatting, Talin http://www.python.org/dev/peps/pep-3101/
[4]ps to question Re: Need help completing ABC pep, van Rossum http://mail.python.org/pipermail/python-3000/2007-April/006737.html
[5](email Subject) PEP 30XZ: Simplified Parsing, Skip, http://mail.python.org/pipermail/python-3000/2007-May/007261.html
[6](email Subject) PEP 30XZ: Simplified Parsing http://mail.python.org/pipermail/python-3000/2007-May/007305.html
[7]GNU gettext manual http://www.gnu.org/software/gettext/
[8]Unix man page for xgettext -- Notes section http://www.scit.wlv.ac.uk/cgi-bin/mansec?1+xgettext

pep-3127 Integer Literal Support and Syntax

PEP:3127
Title:Integer Literal Support and Syntax
Version:$Revision$
Last-Modified:$Date$
Author:Patrick Maupin <pmaupin at gmail.com>
Discussions-To:Python-3000 at python.org
Status:Final
Type:Standards Track
Content-Type:text/x-rst
Created:14-Mar-2007
Python-Version:3.0
Post-History:18-Mar-2007

Abstract

This PEP proposes changes to the Python core to rationalize the treatment of string literal representations of integers in different radices (bases). These changes are targeted at Python 3.0, but the backward-compatible parts of the changes should be added to Python 2.6, so that all valid 3.0 integer literals will also be valid in 2.6.

The proposal is that:

  1. octal literals must now be specified with a leading "0o" or "0O" instead of "0";
  2. binary literals are now supported via a leading "0b" or "0B"; and
  3. provision will be made for binary numbers in string formatting.

Motivation

This PEP was motivated by two different issues:

  • The default octal representation of integers is silently confusing to people unfamiliar with C-like languages. It is extremely easy to inadvertently create an integer object with the wrong value, because '013' means 'decimal 11', not 'decimal 13', to the Python language itself, which is not the meaning that most humans would assign to this literal.
  • Some Python users have a strong desire for binary support in the language.

Specification

Grammar specification

The grammar will be changed. For Python 2.6, the changed and new token definitions will be:

integer        ::=     decimalinteger | octinteger | hexinteger |
                       bininteger | oldoctinteger

octinteger     ::=     "0" ("o" | "O") octdigit+

bininteger     ::=     "0" ("b" | "B") bindigit+

oldoctinteger  ::=     "0" octdigit+

bindigit       ::=     "0" | "1"

For Python 3.0, "oldoctinteger" will not be supported, and an exception will be raised if a literal has a leading "0" and a second character which is a digit.

For both versions, this will require changes to PyLong_FromString as well as the grammar.

The documentation will have to be changed as well: grammar.txt, as well as the integer literal section of the reference manual.

PEP 306 should be checked for other issues, and that PEP should be updated if the procedure described therein is insufficient.

int() specification

int(s, 0) will also match the new grammar definition.

This should happen automatically with the changes to PyLong_FromString required for the grammar change.

Also the documentation for int() should be changed to explain that int(s) operates identically to int(s, 10), and the word "guess" should be removed from the description of int(s, 0).

long() specification

For Python 2.6, the long() implementation and documentation should be changed to reflect the new grammar.

Tokenizer exception handling

If an invalid token contains a leading "0", the exception error message should be more informative than the current "SyntaxError: invalid token". It should explain that decimal numbers may not have a leading zero, and that octal numbers require an "o" after the leading zero.

int() exception handling

The ValueError raised for any call to int() with a string should at least explicitly contain the base in the error message, e.g.:

ValueError: invalid literal for base 8 int(): 09

oct() function

oct() should be updated to output '0o' in front of the octal digits (for 3.0, and 2.6 compatibility mode).

Output formatting

In 3.0, the string % operator alternate syntax for the 'o' option will need to be updated to add '0o' in front, instead of '0'. In 2.6, alternate octal formatting will continue to add only '0'. In neither 2.6 nor 3.0 will the % operator support binary output. This is because binary output is already supported by PEP 3101 (str.format), which is the prefered string formatting method.

Transition from 2.6 to 3.0

The 2to3 translator will have to insert 'o' into any octal string literal.

The Py3K compatible option to Python 2.6 should cause attempts to use oldoctinteger literals to raise an exception.

Rationale

Most of the discussion on these issues occurred on the Python-3000 mailing list starting 14-Mar-2007, prompted by an observation that the average human being would be completely mystified upon finding that prepending a "0" to a string of digits changes the meaning of that digit string entirely.

It was pointed out during this discussion that a similar, but shorter, discussion on the subject occurred in January of 2006, prompted by a discovery of the same issue.

Background

For historical reasons, Python's string representation of integers in different bases (radices), for string formatting and token literals, borrows heavily from C. [1] [2] Usage has shown that the historical method of specifying an octal number is confusing, and also that it would be nice to have additional support for binary literals.

Throughout this document, unless otherwise noted, discussions about the string representation of integers relate to these features:

  • Literal integer tokens, as used by normal module compilation, by eval(), and by int(token, 0). (int(token) and int(token, 2-36) are not modified by this proposal.)

    • Under 2.6, long() is treated the same as int()
  • Formatting of integers into strings, either via the % string operator or the new PEP 3101 advanced string formatting method.

It is presumed that:

  • All of these features should have an identical set of supported radices, for consistency.
  • Python source code syntax and int(mystring, 0) should continue to share identical behavior.

Removal of old octal syntax

This PEP proposes that the ability to specify an octal number by using a leading zero will be removed from the language in Python 3.0 (and the Python 3.0 preview mode of 2.6), and that a SyntaxError will be raised whenever a leading "0" is immediately followed by another digit.

During the present discussion, it was almost universally agreed that:

eval('010') == 8

should no longer be true, because that is confusing to new users. It was also proposed that:

eval('0010') == 10

should become true, but that is much more contentious, because it is so inconsistent with usage in other computer languages that mistakes are likely to be made.

Almost all currently popular computer languages, including C/C++, Java, Perl, and JavaScript, treat a sequence of digits with a leading zero as an octal number. Proponents of treating these numbers as decimal instead have a very valid point -- as discussed in Supported radices, below, the entire non-computer world uses decimal numbers almost exclusively. There is ample anecdotal evidence that many people are dismayed and confused if they are confronted with non-decimal radices.

However, in most situations, most people do not write gratuitous zeros in front of their decimal numbers. The primary exception is when an attempt is being made to line up columns of numbers. But since PEP 8 specifically discourages the use of spaces to try to align Python code, one would suspect the same argument should apply to the use of leading zeros for the same purpose.

Finally, although the email discussion often focused on whether anybody actually uses octal any more, and whether we should cater to those old-timers in any case, that is almost entirely besides the point.

Assume the rare complete newcomer to computing who does, either occasionally or as a matter of habit, use leading zeros for decimal numbers. Python could either:

  1. silently do the wrong thing with his numbers, as it does now;
  2. immediately disabuse him of the notion that this is viable syntax (and yes, the SyntaxWarning should be more gentle than it currently is, but that is a subject for a different PEP); or
  3. let him continue to think that computers are happy with multi-digit decimal integers which start with "0".

Some people passionately believe that (c) is the correct answer, and they would be absolutely right if we could be sure that new users will never blossom and grow and start writing AJAX applications.

So while a new Python user may (currently) be mystified at the delayed discovery that his numbers don't work properly, we can fix it by explaining to him immediately that Python doesn't like leading zeros (hopefully with a reasonable message!), or we can delegate this teaching experience to the JavaScript interpreter in the Internet Explorer browser, and let him try to debug his issue there.

Supported radices

This PEP proposes that the supported radices for the Python language will be 2, 8, 10, and 16.

Once it is agreed that the old syntax for octal (radix 8) representation of integers must be removed from the language, the next obvious question is "Do we actually need a way to specify (and display) numbers in octal?"

This question is quickly followed by "What radices does the language need to support?" Because computers are so adept at doing what you tell them to, a tempting answer in the discussion was "all of them." This answer has obviously been given before -- the int() constructor will accept an explicit radix with a value between 2 and 36, inclusive, with the latter number bearing a suspicious arithmetic similarity to the sum of the number of numeric digits and the number of same-case letters in the ASCII alphabet.

But the best argument for inclusion will have a use-case to back it up, so the idea of supporting all radices was quickly rejected, and the only radices left with any real support were decimal, hexadecimal, octal, and binary.

Just because a particular radix has a vocal supporter on the mailing list does not mean that it really should be in the language, so the rest of this section is a treatise on the utility of these particular radices, vs. other possible choices.

Humans use other numeric bases constantly. If I tell you that it is 12:30 PM, I have communicated quantitative information arguably composed of three separate bases (12, 60, and 2), only one of which is in the "agreed" list above. But the communication of that information used two decimal digits each for the base 12 and base 60 information, and, perversely, two letters for information which could have fit in a single decimal digit.

So, in general, humans communicate "normal" (non-computer) numerical information either via names (AM, PM, January, ...) or via use of decimal notation. Obviously, names are seldom used for large sets of items, so decimal is used for everything else. There are studies which attempt to explain why this is so, typically reaching the expected conclusion that the Arabic numeral system is well-suited to human cognition. [3]

There is even support in the history of the design of computers to indicate that decimal notation is the correct way for computers to communicate with humans. One of the first modern computers, ENIAC [4] computed in decimal, even though there were already existing computers which operated in binary.

Decimal computer operation was important enough that many computers, including the ubiquitous PC, have instructions designed to operate on "binary coded decimal" (BCD) [5], a representation which devotes 4 bits to each decimal digit. These instructions date from a time when the most strenuous calculations ever performed on many numbers were the calculations actually required to perform textual I/O with them. It is possible to display BCD without having to perform a divide/remainder operation on every displayed digit, and this was a huge computational win when most hardware didn't have fast divide capability. Another factor contributing to the use of BCD is that, with BCD calculations, rounding will happen exactly the same way that a human would do it, so BCD is still sometimes used in fields like finance, despite the computational and storage superiority of binary.

So, if it weren't for the fact that computers themselves normally use binary for efficient computation and data storage, string representations of integers would probably always be in decimal.

Unfortunately, computer hardware doesn't think like humans, so programmers and hardware engineers must often resort to thinking like the computer, which means that it is important for Python to have the ability to communicate binary data in a form that is understandable to humans.

The requirement that the binary data notation must be cognitively easy for humans to process means that it should contain an integral number of binary digits (bits) per symbol, while otherwise conforming quite closely to the standard tried-and-true decimal notation (position indicates power, larger magnitude on the left, not too many symbols in the alphabet, etc.).

The obvious "sweet spot" for this binary data notation is thus octal, which packs the largest integral number of bits possible into a single symbol chosen from the Arabic numeral alphabet.

In fact, some computer architectures, such as the PDP8 and the 8080/Z80, were defined in terms of octal, in the sense of arranging the bitfields of instructions in groups of three, and using octal representations to describe the instruction set.

Even today, octal is important because of bit-packed structures which consist of 3 bits per field, such as Unix file permission masks.

But octal has a drawback when used for larger numbers. The number of bits per symbol, while integral, is not itself a power of two. This limitation (given that the word size of most computers these days is a power of two) has resulted in hexadecimal, which is more popular than octal despite the fact that it requires a 60% larger alphabet than decimal, because each symbol contains 4 bits.

Some numbers, such as Unix file permission masks, are easily decoded by humans when represented in octal, but difficult to decode in hexadecimal, while other numbers are much easier for humans to handle in hexadecimal.

Unfortunately, there are also binary numbers used in computers which are not very well communicated in either hexadecimal or octal. Thankfully, fewer people have to deal with these on a regular basis, but on the other hand, this means that several people on the discussion list questioned the wisdom of adding a straight binary representation to Python.

One example of where these numbers is very useful is in reading and writing hardware registers. Sometimes hardware designers will eschew human readability and opt for address space efficiency, by packing multiple bit fields into a single hardware register at unaligned bit locations, and it is tedious and error-prone for a human to reconstruct a 5 bit field which consists of the upper 3 bits of one hex digit, and the lower 2 bits of the next hex digit.

Even if the ability of Python to communicate binary information to humans is only useful for a small technical subset of the population, it is exactly that population subset which contains most, if not all, members of the Python core team, so even straight binary, the least useful of these notations, has several enthusiastic supporters and few, if any, staunch opponents, among the Python community.

Syntax for supported radices

This proposal is to to use a "0o" prefix with either uppercase or lowercase "o" for octal, and a "0b" prefix with either uppercase or lowercase "b" for binary.

There was strong support for not supporting uppercase, but this is a separate subject for a different PEP, as 'j' for complex numbers, 'e' for exponent, and 'r' for raw string (to name a few) already support uppercase.

The syntax for delimiting the different radices received a lot of attention in the discussion on Python-3000. There are several (sometimes conflicting) requirements and "nice-to-haves" for this syntax:

  • It should be as compatible with other languages and previous versions of Python as is reasonable, both for the input syntax and for the output (e.g. string % operator) syntax.
  • It should be as obvious to the casual observer as possible.
  • It should be easy to visually distinguish integers formatted in the different bases.

Proposed syntaxes included things like arbitrary radix prefixes, such as 16r100 (256 in hexadecimal), and radix suffixes, similar to the 100h assembler-style suffix. The debate on whether the letter "O" could be used for octal was intense -- an uppercase "O" looks suspiciously similar to a zero in some fonts. Suggestions were made to use a "c" (the second letter of "oCtal"), or even to use a "t" for "ocTal" and an "n" for "biNary" to go along with the "x" for "heXadecimal".

For the string % operator, "o" was already being used to denote octal. Binary formatting is not being added to the % operator because PEP 3101 (Advanced String Formatting) already supports binary, % formatting will be deprecated in the future.

At the end of the day, since uppercase "O" can look like a zero and uppercase "B" can look like an 8, it was decided that these prefixes should be lowercase only, but, like 'r' for raw string, that can be a preference or style-guide issue.

Open Issues

It was suggested in the discussion that lowercase should be used for all numeric and string special modifiers, such as 'x' for hexadecimal, 'r' for raw strings, 'e' for exponentiation, and 'j' for complex numbers. This is an issue for a separate PEP.

This PEP takes no position on uppercase or lowercase for input, just noting that, for consistency, if uppercase is not to be removed from input parsing for other letters, it should be added for octal and binary, and documenting the changes under this assumption, as there is not yet a PEP about the case issue.

Output formatting may be a different story -- there is already ample precedence for case sensitivity in the output format string, and there would need to be a consensus that there is a valid use-case for the "alternate form" of the string % operator to support uppercase 'B' or 'O' characters for binary or octal output. Currently, PEP 3101 does not even support this alternate capability, and the hex() function does not allow the programmer to specify the case of the 'x' character.

There are still some strong feelings that '0123' should be allowed as a literal decimal in Python 3.0. If this is the right thing to do, this can easily be covered in an additional PEP. This proposal only takes the first step of making '0123' not be a valid octal number, for reasons covered in the rationale.

Is there (or should there be) an option for the 2to3 translator which only makes the 2.6 compatible changes? Should this be run on 2.6 library code before the 2.6 release?

Should a bin() function which matches hex() and oct() be added?

Is hex() really that useful once we have advanced string formatting?

pep-3128 BList: A Faster List-like Type

PEP:3128
Title:BList: A Faster List-like Type
Version:$Revision$
Last-Modified:$Date$
Author:Daniel Stutzbach <daniel at stutzbachenterprises.com>
Discussions-To:Python 3000 List <python-3000 at python.org>
Status:Rejected
Type:Standards Track
Content-Type:text/x-rst
Created:30-Apr-2007
Python-Version:2.6 and/or 3.0
Post-History:30-Apr-2007

Rejection Notice

Rejected based on Raymond Hettinger's sage advice [4]:

After looking at the source, I think this has almost zero chance for replacing list(). There is too much value in a simple C API, low space overhead for small lists, good performance is common use cases, and having performance that is easily understood. The BList implementation lacks these virtues and it trades-off a little performance in common cases in for much better performance in uncommon cases. As a Py3.0 PEP, I think it can be rejected.

Depending on its success as a third-party module, it still has a chance for inclusion in the collections module. The essential criteria for that is whether it is a superior choice for some real-world use cases. I've scanned my own code and found no instances where BList would have been preferable to a regular list. However, that scan has a selection bias because it doesn't reflect what I would have written had BList been available. So, after a few months, I intend to poll comp.lang.python for BList success stories. If they exist, then I have no problem with inclusion in the collections module. After all, its learning curve is near zero -- the only cost is the clutter factor stemming from indecision about the most appropriate data structure for a given task.

Abstract

The common case for list operations is on small lists. The current array-based list implementation excels at small lists due to the strong locality of reference and infrequency of memory allocation operations. However, an array takes O(n) time to insert and delete elements, which can become problematic as the list gets large.

This PEP introduces a new data type, the BList, that has array-like and tree-like aspects. It enjoys the same good performance on small lists as the existing array-based implementation, but offers superior asymptotic performance for most operations. This PEP proposes replacing the makes two mutually exclusive proposals for including the BList type in Python:

  1. Add it to the collections module, or
  2. Replace the existing list type

Motivation

The BList grew out of the frustration of needing to rewrite intuitive algorithms that worked fine for small inputs but took O(n**2) time for large inputs due to the underlying O(n) behavior of array-based lists. The deque type, introduced in Python 2.4, solved the most common problem of needing a fast FIFO queue. However, the deque type doesn't help if we need to repeatedly insert or delete elements from the middle of a long list.

A wide variety of data structure provide good asymptotic performance for insertions and deletions, but they either have O(n) performance for other operations (e.g., linked lists) or have inferior performance for small lists (e.g., binary trees and skip lists).

The BList type proposed in this PEP is based on the principles of B+Trees, which have array-like and tree-like aspects. The BList offers array-like performance on small lists, while offering O(log n) asymptotic performance for all insert and delete operations. Additionally, the BList implements copy-on-write under-the-hood, so even operations like getslice take O(log n) time. The table below compares the asymptotic performance of the current array-based list implementation with the asymptotic performance of the BList.

Operation Array-based list BList
Copy O(n) O(1)
Append O(1) O(log n)
Insert O(n) O(log n)
Get Item O(1) O(log n)
Set Item O(1) O(log n)
Del Item O(n) O(log n)
Iteration O(n) O(n)
Get Slice O(k) O(log n)
Del Slice O(n) O(log n)
Set Slice O(n+k) O(log k + log n)
Extend O(k) O(log k + log n)
Sort O(n log n) O(n log n)
Multiply O(nk) O(log k)

An extensive empirical comparison of Python's array-based list and the BList are available at [2].

Use Case Trade-offs

The BList offers superior performance for many, but not all, operations. Choosing the correct data type for a particular use case depends on which operations are used. Choosing the correct data type as a built-in depends on balancing the importance of different use cases and the magnitude of the performance differences.

For the common uses cases of small lists, the array-based list and the BList have similar performance characteristics.

For the slightly less common case of large lists, there are two common uses cases where the existing array-based list outperforms the existing BList reference implementation. These are:

  1. A large LIFO stack, where there are many .append() and .pop(-1) operations. Each operation is O(1) for an array-based list, but O(log n) for the BList.
  2. A large list that does not change size. The getitem and setitem calls are O(1) for an array-based list, but O(log n) for the BList.

In performance tests on a 10,000 element list, BLists exhibited a 50% and 5% increase in execution time for these two uses cases, respectively.

The performance for the LIFO use case could be improved to O(n) time, by caching a pointer to the right-most leaf within the root node. For lists that do not change size, the common case of sequential access could also be improved to O(n) time via caching in the root node. However, the performance of these approaches has not been empirically tested.

Many operations exhibit a tremendous speed-up (O(n) to O(log n)) when switching from the array-based list to BLists. In performance tests on a 10,000 element list, operations such as getslice, setslice, and FIFO-style insert and deletes on a BList take only 1% of the time needed on array-based lists.

In light of the large performance speed-ups for many operations, the small performance costs for some operations will be worthwhile for many (but not all) applications.

Implementation

The BList is based on the B+Tree data structure. The BList is a wide, bushy tree where each node contains an array of up to 128 pointers to its children. If the node is a leaf, its children are the user-visible objects that the user has placed in the list. If node is not a leaf, its children are other BList nodes that are not user-visible. If the list contains only a few elements, they will all be a children of single node that is both the root and a leaf. Since a node is little more than array of pointers, small lists operate in effectively the same way as an array-based data type and share the same good performance characteristics.

The BList maintains a few invariants to ensure good (O(log n)) asymptotic performance regardless of the sequence of insert and delete operations. The principle invariants are as follows:

  1. Each node has at most 128 children.
  2. Each non-root node has at least 64 children.
  3. The root node has at least 2 children, unless the list contains fewer than 2 elements.
  4. The tree is of uniform depth.

If an insert would cause a node to exceed 128 children, the node spawns a sibling and transfers half of its children to the sibling. The sibling is inserted into the node's parent. If the node is the root node (and thus has no parent), a new parent is created and the depth of the tree increases by one.

If a deletion would cause a node to have fewer than 64 children, the node moves elements from one of its siblings if possible. If both of its siblings also only have 64 children, then two of the nodes merge and the empty one is removed from its parent. If the root node is reduced to only one child, its single child becomes the new root (i.e., the depth of the tree is reduced by one).

In addition to tree-like asymptotic performance and array-like performance on small-lists, BLists support transparent copy-on-write. If a non-root node needs to be copied (as part of a getslice, copy, setslice, etc.), the node is shared between multiple parents instead of being copied. If it needs to be modified later, it will be copied at that time. This is completely behind-the-scenes; from the user's point of view, the BList works just like a regular Python list.

Memory Usage

In the worst case, the leaf nodes of a BList have only 64 children each, rather than a full 128, meaning that memory usage is around twice that of a best-case array implementation. Non-leaf nodes use up a negligible amount of additional memory, since there are at least 63 times as many leaf nodes as non-leaf nodes.

The existing array-based list implementation must grow and shrink as items are added and removed. To be efficient, it grows and shrinks only when the list has grow or shrunk exponentially. In the worst case, it, too, uses twice as much memory as the best case.

In summary, the BList's memory footprint is not significantly different from the existing array-based implementation.

Backwards Compatibility

If the BList is added to the collections module, backwards compatibility is not an issue. This section focuses on the option of replacing the existing array-based list with the BList. For users of the Python interpreter, a BList has an identical interface to the current list-implementation. For virtually all operations, the behavior is identical, aside from execution speed.

For the C API, BList has a different interface than the existing list-implementation. Due to its more complex structure, the BList does not lend itself well to poking and prodding by external sources. Thankfully, the existing list-implementation defines an API of functions and macros for accessing data from list objects. Google Code Search suggests that the majority of third-party modules uses the well-defined API rather than relying on the list's structure directly. The table below summarizes the search queries and results:

Search String Number of Results
PyList_GetItem 2,000
PySequence_GetItem 800
PySequence_Fast_GET_ITEM 100
PyList_GET_ITEM 400
[^a-zA-Z_]ob_item 100

This can be achieved in one of two ways:

  1. Redefine the various accessor functions and macros in listobject.h to access a BList instead. The interface would be unchanged. The functions can easily be redefined. The macros need a bit more care and would have to resort to function calls for large lists.

    The macros would need to evaluate their arguments more than once, which could be a problem if the arguments have side effects. A Google Code Search for "PyList_GET_ITEM([^)]+(" found only a handful of cases where this occurs, so the impact appears to be low.

    The few extension modules that use list's undocumented structure directly, instead of using the API, would break. The core code itself uses the accessor macros fairly consistently and should be easy to port.

  2. Deprecate the existing list type, but continue to include it. Extension modules wishing to use the new BList type must do so explicitly. The BList C interface can be changed to match the existing PyList interface so that a simple search-replace will be sufficient for 99% of module writers.

    Existing modules would continue to compile and work without change, but they would need to make a deliberate (but small) effort to migrate to the BList.

    The downside of this approach is that mixing modules that use BLists and array-based lists might lead to slow down if conversions are frequently necessary.

Reference Implementation

A reference implementations of the BList is available for CPython at [1].

The source package also includes a pure Python implementation, originally developed as a prototype for the CPython version. Naturally, the pure Python version is rather slow and the asymptotic improvements don't win out until the list is quite large.

When compiled with Py_DEBUG, the C implementation checks the BList invariants when entering and exiting most functions.

An extensive set of test cases is also included in the source package. The test cases include the existing Python sequence and list test cases as a subset. When the interpreter is built with Py_DEBUG, the test cases also check for reference leaks.

Porting to Other Python Variants

If the BList is added to the collections module, other Python variants can support it in one of three ways:

  1. Make blist an alias for list. The asymptotic performance won't be as good, but it'll work.
  2. Use the pure Python reference implementation. The performance for small lists won't be as good, but it'll work.
  3. Port the reference implementation.

Discussion

This proposal has been discussed briefly on the Python-3000 mailing list [3]. Although a number of people favored the proposal, there were also some objections. Below summarizes the pros and cons as observed by posters to the thread.

General comments:

  • Pro: Will outperform the array-based list in most cases
  • Pro: "I've implemented variants of this ... a few different times"
  • Con: Desirability and performance in actual applications is unproven

Comments on adding BList to the collections module:

  • Pro: Matching the list-API reduces the learning curve to near-zero
  • Pro: Useful for intermediate-level users; won't get in the way of beginners
  • Con: Proliferation of data types makes the choices for developers harder.

Comments on replacing the array-based list with the BList:

  • Con: Impact on extension modules (addressed in Backwards Compatibility)
  • Con: The use cases where BLists are slower are important (see Use Case Trade-Offs for how these might be addressed).
  • Con: The array-based list code is simple and easy to maintain

To assess the desirability and performance in actual applications, Raymond Hettinger suggested releasing the BList as an extension module (now available at [1]). If it proves useful, he felt it would be a strong candidate for inclusion in 2.6 as part of the collections module. If widely popular, then it could be considered for replacing the array-based list, but not otherwise.

Guido van Rossum commented that he opposed the proliferation of data types, but favored replacing the array-based list if backwards compatibility could be addressed and the BList's performance was uniformly better.

On-going Tasks

  • Reduce the memory footprint of small lists
  • Implement TimSort for BLists, so that best-case sorting is O(n) instead of O(log n).
  • Implement __reversed__
  • Cache a pointer in the root to the rightmost leaf, to make LIFO operation O(n) time.

References

[1](1, 2) Reference Implementations for C and Python: http://www.python.org/pypi/blist/
[2]Empirical performance comparison between Python's array-based list and the blist: http://stutzbachenterprises.com/blist/
[3]Discussion on python-3000 starting at post: http://mail.python.org/pipermail/python-3000/2007-April/006757.html
[4]Raymond Hettinger's feedback on python-3000: http://mail.python.org/pipermail/python-3000/2007-May/007491.html

pep-3129 Class Decorators

PEP:3129
Title:Class Decorators
Version:$Revision$
Last-Modified:$Date$
Author:Collin Winter <collinwinter at google.com>
Status:Final
Type:Standards Track
Content-Type:text/x-rst
Created:1-May-2007
Python-Version:3.0
Post-History:7-May-2007

Abstract

This PEP proposes class decorators, an extension to the function and method decorators introduced in PEP 318.

Rationale

When function decorators were originally debated for inclusion in Python 2.4, class decorators were seen as obscure and unnecessary [1] thanks to metaclasses. After several years' experience with the Python 2.4.x series of releases and an increasing familiarity with function decorators and their uses, the BDFL and the community re-evaluated class decorators and recommended their inclusion in Python 3.0 [2].

The motivating use-case was to make certain constructs more easily expressed and less reliant on implementation details of the CPython interpreter. While it is possible to express class decorator-like functionality using metaclasses, the results are generally unpleasant and the implementation highly fragile [3]. In addition, metaclasses are inherited, whereas class decorators are not, making metaclasses unsuitable for some, single class-specific uses of class decorators. The fact that large-scale Python projects like Zope were going through these wild contortions to achieve something like class decorators won over the BDFL.

Semantics

The semantics and design goals of class decorators are the same as for function decorators ([4], [5]); the only difference is that you're decorating a class instead of a function. The following two snippets are semantically identical:

class A:
  pass
A = foo(bar(A))


@foo
@bar
class A:
  pass

For a detailed examination of decorators, please refer to PEP 318.

Implementation

Adapting Python's grammar to support class decorators requires modifying two rules and adding a new rule:

funcdef: [decorators] 'def' NAME parameters ['->' test] ':' suite

compound_stmt: if_stmt | while_stmt | for_stmt | try_stmt |
               with_stmt | funcdef | classdef

need to be changed to

decorated: decorators (classdef | funcdef)

funcdef: 'def' NAME parameters ['->' test] ':' suite

compound_stmt: if_stmt | while_stmt | for_stmt | try_stmt |
               with_stmt | funcdef | classdef | decorated

Adding decorated is necessary to avoid an ambiguity in the grammar.

The Python AST and bytecode must be modified accordingly.

A reference implementation [6] has been provided by Jack Diederich.

Acceptance

There was virtually no discussion following the posting of this PEP, meaning that everyone agreed it should be accepted.

The patch was committed to Subversion as revision 55430.

pep-3130 Access to Current Module/Class/Function

PEP: 3130
Title: Access to Current Module/Class/Function
Version: $Revision$
Last-Modified: $Date$
Author: Jim J. Jewett <jimjjewett at gmail.com>
Status: Rejected
Type: Standards Track
Content-Type: text/plain
Created: 22-Apr-2007
Python-Version: 3.0
Post-History: 22-Apr-2007

Rejection Notice

    This PEP is rejected.  It is not clear how it should be
    implemented or what the precise semantics should be in edge cases,
    and there aren't enough important use cases given.  response has
    been lukewarm at best.


Abstract

    It is common to need a reference to the current module, class,
    or function, but there is currently no entirely correct way to
    do this.  This PEP proposes adding the keywords __module__,
    __class__, and __function__.


Rationale for __module__

    Many modules export various functions, classes, and other objects,
    but will perform additional activities (such as running unit
    tests) when run as a script.  The current idiom is to test whether
    the module's name has been set to magic value.

        if __name__ == "__main__": ...

    More complicated introspection requires a module to (attempt to)
    import itself.  If importing the expected name actually produces
    a different module, there is no good workaround.

        # __import__ lets you use a variable, but... it gets more
        # complicated if the module is in a package.
        __import__(__name__)

        # So just go to sys modules... and hope that the module wasn't
        # hidden/removed (perhaps for security), that __name__ wasn't
        # changed, and definitely hope that no other module with the
        # same name is now available.
        class X(object):
            pass

        import sys
        mod = sys.modules[__name__]
        mod = sys.modules[X.__class__.__module__]

    Proposal:  Add a __module__ keyword which refers to the module
    currently being defined (executed).  (But see open issues.)

        # XXX sys.main is still changing as draft progresses.  May
        # really need sys.modules[sys.main]
        if __module__ is sys.main:    # assumes PEP (3122), Cannon
            ...


Rationale for __class__

    Class methods are passed the current instance; from this they can
    determine self.__class__ (or cls, for class methods).
    Unfortunately, this reference is to the object's actual class,
    which may be a subclass of the defining class.  The current
    workaround is to repeat the name of the class, and assume that the
    name will not be rebound.

        class C(B):

            def meth(self):
                super(C, self).meth() # Hope C is never rebound.

        class D(C):

            def meth(self):
                # ?!? issubclass(D,C), so it "works":
                super(C, self).meth() 

    Proposal: Add a __class__ keyword which refers to the class
    currently being defined (executed).  (But see open issues.)

        class C(B):
            def meth(self):
                super(__class__, self).meth()

    Note that super calls may be further simplified by the "New Super"
    PEP (Spealman).  The __class__ (or __this_class__) attribute came
    up in attempts to simplify the explanation and/or implementation
    of that PEP, but was separated out as an independent decision.

    Note that __class__ (or __this_class__) is not quite the same as
    the __thisclass__ property on bound super objects.  The existing
    super.__thisclass__ property refers to the class from which the
    Method Resolution Order search begins.  In the above class D, it
    would refer to (the current reference of name) C.


Rationale for __function__

    Functions (including methods) often want access to themselves,
    usually for a private storage location or true recursion.  While
    there are several workarounds, all have their drawbacks.

        def counter(_total=[0]):
            # _total shouldn't really appear in the
            # signature at all; the list wrapping and
            # [0] unwrapping obscure the code
            _total[0] += 1
            return _total[0]

        @annotate(total=0)
        def counter():
            # Assume name counter is never rebound:
            counter.total += 1
            return counter.total

        # class exists only to provide storage:
        class _wrap(object):

            __total = 0

            def f(self):
                self.__total += 1
                return self.__total

        # set module attribute to a bound method:
        accum = _wrap().f

        # This function calls "factorial", which should be itself --
        # but the same programming styles that use heavy recursion
        # often have a greater willingness to rebind function names.
        def factorial(n):
            return (n * factorial(n-1) if n else 1)

    Proposal: Add a __function__ keyword which refers to the function
    (or method) currently being defined (executed).  (But see open
    issues.)

        @annotate(total=0)
        def counter():
            # Always refers to this function obj:
            __function__.total += 1
            return __function__.total

        def factorial(n):
            return (n * __function__(n-1) if n else 1)


Backwards Compatibility

    While a user could be using these names already, double-underscore
    names ( __anything__ ) are explicitly reserved to the interpreter.
    It is therefore acceptable to introduce special meaning to these
    names within a single feature release.


Implementation

    Ideally, these names would be keywords treated specially by the
    bytecode compiler.

    Guido has suggested [1] using a cell variable filled in by the
    metaclass.

    Michele Simionato has provided a prototype using bytecode hacks
    [2].  This does not require any new bytecode operators; it just
    modifies the which specific sequence of existing operators gets
    run.


Open Issues

    - Are __module__, __class__, and __function__ the right names?  In
      particular, should the names include the word "this", either as
      __this_module__, __this_class__, and __this_function__, (format
      discussed on the python-3000 and python-ideas lists) or as
      __thismodule__, __thisclass__, and __thisfunction__ (inspired
      by, but conflicting with, current usage of super.__thisclass__).

    - Are all three keywords needed, or should this enhancement be
      limited to a subset of the objects?  Should methods be treated
      separately from other functions?


References

    [1] Fixing super anyone?  Guido van Rossum
        http://mail.python.org/pipermail/python-3000/2007-April/006671.html

    [2] Descriptor/Decorator challenge,  Michele Simionato
        http://groups.google.com/group/comp.lang.python/browse_frm/thread/a6010c7494871bb1/62a2da68961caeb6?lnk=gst&q=simionato+challenge&rnum=1&hl=en#62a2da68961caeb6


Copyright

    This document has been placed in the public domain.



pep-3131 Supporting Non-ASCII Identifiers

PEP:3131
Title:Supporting Non-ASCII Identifiers
Version:$Revision$
Last-Modified:$Date$
Author:Martin von Lรถwis <martin at v.loewis.de>
Status:Final
Type:Standards Track
Content-Type:text/x-rst
Created:1-May-2007
Python-Version:3.0
Post-History:

Abstract

This PEP suggests to support non-ASCII letters (such as accented characters, Cyrillic, Greek, Kanji, etc.) in Python identifiers.

Rationale

Python code is written by many people in the world who are not familiar with the English language, or even well-acquainted with the Latin writing system. Such developers often desire to define classes and functions with names in their native languages, rather than having to come up with an (often incorrect) English translation of the concept they want to name. By using identifiers in their native language, code clarity and maintainability of the code among speakers of that language improves.

For some languages, common transliteration systems exist (in particular, for the Latin-based writing systems). For other languages, users have larger difficulties to use Latin to write their native words.

Common Objections

Some objections are often raised against proposals similar to this one.

People claim that they will not be able to use a library if to do so they have to use characters they cannot type on their keyboards. However, it is the choice of the designer of the library to decide on various constraints for using the library: people may not be able to use the library because they cannot get physical access to the source code (because it is not published), or because licensing prohibits usage, or because the documentation is in a language they cannot understand. A developer wishing to make a library widely available needs to make a number of explicit choices (such as publication, licensing, language of documentation, and language of identifiers). It should always be the choice of the author to make these decisions - not the choice of the language designers.

In particular, projects wishing to have wide usage probably might want to establish a policy that all identifiers, comments, and documentation is written in English (see the GNU coding style guide for an example of such a policy). Restricting the language to ASCII-only identifiers does not enforce comments and documentation to be English, or the identifiers actually to be English words, so an additional policy is necessary, anyway.

Specification of Language Changes

The syntax of identifiers in Python will be based on the Unicode standard annex UAX-31 [1], with elaboration and changes as defined below.

Within the ASCII range (U+0001..U+007F), the valid characters for identifiers are the same as in Python 2.5. This specification only introduces additional characters from outside the ASCII range. For other characters, the classification uses the version of the Unicode Character Database as included in the unicodedata module.

The identifier syntax is <XID_Start> <XID_Continue>*.

The exact specification of what characters have the XID_Start or XID_Continue properties can be found in the DerivedCoreProperties file of the Unicode data in use by Python (4.1 at the time this PEP was written), see [6]. For reference, the construction rules for these sets are given below. The XID_* properties are derived from ID_Start/ID_Continue, which are derived themselves.

ID_Start is defined as all characters having one of the general categories uppercase letters (Lu), lowercase letters (Ll), titlecase letters (Lt), modifier letters (Lm), other letters (Lo), letter numbers (Nl), the underscore, and characters carrying the Other_ID_Start property. XID_Start then closes this set under normalization, by removing all characters whose NFKC normalization is not of the form ID_Start ID_Continue* anymore.

ID_Continue is defined as all characters in ID_Start, plus nonspacing marks (Mn), spacing combining marks (Mc), decimal number (Nd), connector punctuations (Pc), and characters carryig the Other_ID_Continue property. Again, XID_Continue closes this set under NFKC-normalization; it also adds U+00B7 to support Catalan.

All identifiers are converted into the normal form NFKC while parsing; comparison of identifiers is based on NFKC.

A non-normative HTML file listing all valid identifier characters for Unicode 4.1 can be found at http://www.dcl.hpi.uni-potsdam.de/home/loewis/table-3131.html.

Policy Specification

As an addition to the Python Coding style, the following policy is prescribed: All identifiers in the Python standard library MUST use ASCII-only identifiers, and SHOULD use English words wherever feasible (in many cases, abbreviations and technical terms are used which aren't English). In addition, string literals and comments must also be in ASCII. The only exceptions are (a) test cases testing the non-ASCII features, and (b) names of authors. Authors whose names are not based on the latin alphabet MUST provide a latin transliteration of their names.

As an option, this specification can be applied to Python 2.x. In that case, ASCII-only identifiers would continue to be represented as byte string objects in namespace dictionaries; identifiers with non-ASCII characters would be represented as Unicode strings.

Implementation

The following changes will need to be made to the parser:

  1. If a non-ASCII character is found in the UTF-8 representation of the source code, a forward scan is made to find the first ASCII non-identifier character (e.g. a space or punctuation character)
  2. The entire UTF-8 string is passed to a function to normalize the string to NFKC, and then verify that it follows the identifier syntax. No such callout is made for pure-ASCII identifiers, which continue to be parsed the way they are today. The Unicode database must start including the Other_ID_{Start|Continue} property.
  3. If this specification is implemented for 2.x, reflective libraries (such as pydoc) must be verified to continue to work when Unicode strings appear in __dict__ slots as keys.

Open Issues

John Nagle suggested consideration of Unicode Technical Standard #39, [2], which discusses security mechanisms for Unicode identifiers. It's not clear how that can precisely apply to this PEP; possible consequences are

  • warn about characters listed as "restricted" in xidmodifications.txt
  • warn about identifiers using mixed scripts
  • somehow perform Confusable Detection

In the latter two approaches, it's not clear how precisely the algorithm should work. For mixed scripts, certain kinds of mixing should probably allowed - are these the "Common" and "Inherited" scripts mentioned in section 5? For Confusable Detection, it seems one needs two identifiers to compare them for confusion - is it possible to somehow apply it to a single identifier only, and warn?

In follow-up discussion, it turns out that John Nagle actually meant to suggest UTR#36, level "Highly Restrictive", [3].

Several people suggested to allow and ignore formatting control characters (general category Cf), as is done in Java, JavaScript, and C#. It's not clear whether this would improve things (it might for RTL languages); if there is a need, these can be added later.

Some people would like to see an option on selecting support for this PEP at run-time; opinions vary on what precisely that option should be, and what precisely its default value should be. Guido van Rossum commented in [5] that a global flag passed to the interpreter is not acceptable, as it would apply to all modules.

Discussion

Ka-Ping Yee summarizes discussion and further objection in [4] as such:

  1. Should identifiers be allowed to contain any Unicode letter?

    Drawbacks of allowing non-ASCII identifiers wholesale:

    1. Python will lose the ability to make a reliable round trip to a human-readable display on screen or on paper.
    2. Python will become vulnerable to a new class of security exploits; code and submitted patches will be much harder to inspect.
    3. Humans will no longer be able to validate Python syntax.
    4. Unicode is young; its problems are not yet well understood and solved; tool support is weak.
    5. Languages with non-ASCII identifiers use different character sets and normalization schemes; PEP 3131's choices are non-obvious.
    6. The Unicode bidi algorithm yields an extremely confusing display order for RTL text when digits or operators are nearby.
  2. Should the default behaviour accept only ASCII identifiers, or should it accept identifiers containing non-ASCII characters?

    Arguments for ASCII only by default:

    1. Non-ASCII identifiers by default makes common practice/assumptions subtly/unknowingly wrong; rarely wrong is worse than obviously wrong.
    2. Better to raise a warning than to fail silently when encountering an probably unexpected situation.
    3. All of current usage is ASCII-only; the vast majority of future usage will be ASCII-only.
    1. It is the pockets of Unicode adoption that are parochial, not the ASCII advocates.
    2. Python should audit for ASCII-only identifiers for the same reasons that it audits for tab-space consistency
    3. Incremental change is safer.
    4. An ASCII-only default favors open-source development and sharing of source code.
    5. Existing projects won't have to waste any brainpower worrying about the implications of Unicode identifiers.
  3. Should non-ASCII identifiers be optional?

    Various voices in support of a flag (although there's been debate over which should be the default, no one seems to be saying that there shouldn't be an off switch)

  4. Should the identifier character set be configurable?

    Various voices proposing and supporting a selectable character set, so that users can get all the benefits of using their own language without the drawbacks of confusable/unfamiliar characters

  5. Which identifier characters should be allowed?

    1. What to do about bidi format control characters?
    2. What about other ID_Continue characters? What about characters that look like punctuation? What about other recommendations in UTS #39? What about mixed-script identifiers?
  6. Which normalization form should be used, NFC or NFKC?

  7. Should source code be required to be in normalized form?

pep-3132 Extended Iterable Unpacking

PEP:3132
Title:Extended Iterable Unpacking
Version:$Revision$
Last-Modified:$Date$
Author:Georg Brandl <georg at python.org>
Status:Final
Type:Standards Track
Content-Type:text/x-rst
Created:30-Apr-2007
Python-Version:3.0
Post-History:

Abstract

This PEP proposes a change to iterable unpacking syntax, allowing to specify a "catch-all" name which will be assigned a list of all items not assigned to a "regular" name.

An example says more than a thousand words:

>>> a, *b, c = range(5)
>>> a
0
>>> c
4
>>> b
[1, 2, 3]

Rationale

Many algorithms require splitting a sequence in a "first, rest" pair. With the new syntax,

first, rest = seq[0], seq[1:]

is replaced by the cleaner and probably more efficient:

first, *rest = seq

For more complex unpacking patterns, the new syntax looks even cleaner, and the clumsy index handling is not necessary anymore.

Also, if the right-hand value is not a list, but an iterable, it has to be converted to a list before being able to do slicing; to avoid creating this temporary list, one has to resort to

it = iter(seq)
first = it.next()
rest = list(it)

Specification

A tuple (or list) on the left side of a simple assignment (unpacking is not defined for augmented assignment) may contain at most one expression prepended with a single asterisk (which is henceforth called a "starred" expression, while the other expressions in the list are called "mandatory"). This designates a subexpression that will be assigned a list of all items from the iterable being unpacked that are not assigned to any of the mandatory expressions, or an empty list if there are no such items.

For example, if seq is a slicable sequence, all the following assignments are equivalent if seq has at least three elements:

a, b, c = seq[0], list(seq[1:-1]), seq[-1]
a, *b, c = seq
[a, *b, c] = seq

It is an error (as it is currently) if the iterable doesn't contain enough items to assign to all the mandatory expressions.

It is also an error to use the starred expression as a lone assignment target, as in

*a = range(5)

This, however, is valid syntax:

*a, = range(5)

Note that this proposal also applies to tuples in implicit assignment context, such as in a for statement:

for a, *b in [(1, 2, 3), (4, 5, 6, 7)]:
    print(b)

would print out

[2, 3]
[5, 6, 7]

Starred expressions are only allowed as assignment targets, using them anywhere else (except for star-args in function calls, of course) is an error.

Implementation

Grammar change

This feature requires a new grammar rule:

star_expr: ['*'] expr

In these two rules, expr is changed to star_expr:

comparison: star_expr (comp_op star_expr)*
exprlist: star_expr (',' star_expr)* [',']

Changes to the Compiler

A new ASDL expression type Starred is added which represents a starred expression. Note that the starred expression element introduced here is universal and could later be used for other purposes in non-assignment context, such as the yield *iterable proposal.

The compiler is changed to recognize all cases where a starred expression is invalid and flag them with syntax errors.

A new bytecode instruction, UNPACK_EX, is added, whose argument has the number of mandatory targets before the starred target in the lower 8 bits and the number of mandatory targets after the starred target in the upper 8 bits. For unpacking sequences without starred expressions, the old UNPACK_ITERABLE opcode is kept.

Changes to the Bytecode Interpreter

The function unpack_iterable() in ceval.c is changed to handle the extended unpacking, via an argcntafter parameter. In the UNPACK_EX case, the function will do the following:

  • collect all items for mandatory targets before the starred one
  • collect all remaining items from the iterable in a list
  • pop items for mandatory targets after the starred one from the list
  • push the single items and the resized list on the stack

Shortcuts for unpacking iterables of known types, such as lists or tuples, can be added.

The current implementation can be found at the SourceForge Patch tracker [SFPATCH]. It now includes a minimal test case.

Acceptance

After a short discussion on the python-3000 list [1], the PEP was accepted by Guido in its current form. Possible changes discussed were:

  • Only allow a starred expression as the last item in the exprlist. This would simplify the unpacking code a bit and allow for the starred expression to be assigned an iterator. This behavior was rejected because it would be too surprising.
  • Try to give the starred target the same type as the source iterable, for example, b in a, *b = 'hello' would be assigned the string 'ello'. This may seem nice, but is impossible to get right consistently with all iterables.
  • Make the starred target a tuple instead of a list. This would be consistent with a function's *args, but make further processing of the result harder.

pep-3133 Introducing Roles

PEP:3133
Title:Introducing Roles
Version:$Revision$
Last-Modified:$Date$
Author:Collin Winter <collinwinter at google.com>
Status:Rejected
Type:Standards Track
Content-Type:text/x-rst
Requires:3115 3129
Created:1-May-2007
Python-Version:3.0
Post-History:13-May-2007

Rejection Notice

This PEP has helped push PEP 3119 towards a saner, more minimalistic approach. But given the latest version of PEP 3119 I much prefer that. GvR.

Abstract

Python's existing object model organizes objects according to their implementation. It is often desirable -- especially in duck typing-based language like Python -- to organize objects by the part they play in a larger system (their intent), rather than by how they fulfill that part (their implementation). This PEP introduces the concept of roles, a mechanism for organizing objects according to their intent rather than their implementation.

Rationale

In the beginning were objects. They allowed programmers to marry function and state, and to increase code reusability through concepts like polymorphism and inheritance, and lo, it was good. There came a time, however, when inheritance and polymorphism weren't enough. With the invention of both dogs and trees, we were no longer able to be content with knowing merely, "Does it understand 'bark'?" We now needed to know what a given object thought that "bark" meant.

One solution, the one detailed here, is that of roles, a mechanism orthogonal and complementary to the traditional class/instance system. Whereas classes concern themselves with state and implementation, the roles mechanism deals exclusively with the behaviours embodied in a given class.

This system was originally called "traits" and implemented for Squeak Smalltalk [4]. It has since been adapted for use in Perl 6 [3] where it is called "roles", and it is primarily from there that the concept is now being interpreted for Python 3. Python 3 will preserve the name "roles".

In a nutshell: roles tell you what an object does, classes tell you how an object does it.

In this PEP, I will outline a system for Python 3 that will make it possible to easily determine whether a given object's understanding of "bark" is tree-like or dog-like. (There might also be more serious examples.)

A Note on Syntax

A syntax proposals in this PEP are tentative and should be considered to be strawmen. The necessary bits that this PEP depends on -- namely PEP 3115's class definition syntax and PEP 3129's class decorators -- are still being formalized and may change. Function names will, of course, be subject to lengthy bikeshedding debates.

Performing Your Role

Static Role Assignment

Let's start out by defining Tree and Dog classes

class Tree(Vegetable):

  def bark(self):
    return self.is_rough()


class Dog(Animal):

  def bark(self):
    return self.goes_ruff()

While both implement a bark() method with the same signature, they do wildly different things. We need some way of differentiating what we're expecting. Relying on inheritance and a simple isinstance() test will limit code reuse and/or force any dog-like classes to inherit from Dog, whether or not that makes sense. Let's see if roles can help.

@perform_role(Doglike)
class Dog(Animal):
  ...

@perform_role(Treelike)
class Tree(Vegetable):
  ...

@perform_role(SitThere)
class Rock(Mineral):
  ...

We use class decorators from PEP 3129 to associate a particular role or roles with a class. Client code can now verify that an incoming object performs the Doglike role, allowing it to handle Wolf, LaughingHyena and Aibo [1] instances, too.

Roles can be composed via normal inheritance:

@perform_role(Guard, MummysLittleDarling)
class GermanShepherd(Dog):

  def guard(self, the_precious):
    while True:
      if intruder_near(the_precious):
        self.growl()

  def get_petted(self):
    self.swallow_pride()

Here, GermanShepherd instances perform three roles: Guard and MummysLittleDarling are applied directly, whereas Doglike is inherited from Dog.

Assigning Roles at Runtime

Roles can be assigned at runtime, too, by unpacking the syntactic sugar provided by decorators.

Say we import a Robot class from another module, and since we know that Robot already implements our Guard interface, we'd like it to play nicely with guard-related code, too.

>>> perform(Guard)(Robot)

This takes effect immediately and impacts all instances of Robot.

Asking Questions About Roles

Just because we've told our robot army that they're guards, we'd like to check in on them occasionally and make sure they're still at their task.

>>> performs(our_robot, Guard)
True

What about that one robot over there?

>>> performs(that_robot_over_there, Guard)
True

The performs() function is used to ask if a given object fulfills a given role. It cannot be used, however, to ask a class if its instances fulfill a role:

>>> performs(Robot, Guard)
False

This is because the Robot class is not interchangeable with a Robot instance.

Defining New Roles

Empty Roles

Roles are defined like a normal class, but use the Role metaclass.

class Doglike(metaclass=Role):
  ...

Metaclasses are used to indicate that Doglike is a Role in the same way 5 is an int and tuple is a type.

Composing Roles via Inheritance

Roles may inherit from other roles; this has the effect of composing them. Here, instances of Dog will perform both the Doglike and FourLegs roles.

class FourLegs(metaclass=Role):
  pass

class Doglike(FourLegs, Carnivor):
  pass

@perform_role(Doglike)
class Dog(Mammal):
  pass

Requiring Concrete Methods

So far we've only defined empty roles -- not very useful things. Let's now require that all classes that claim to fulfill the Doglike role define a bark() method:

class Doglike(FourLegs):

  def bark(self):
    pass

No decorators are required to flag the method as "abstract", and the method will never be called, meaning whatever code it contains (if any) is irrelevant. Roles provide only abstract methods; concrete default implementations are left to other, better-suited mechanisms like mixins.

Once you have defined a role, and a class has claimed to perform that role, it is essential that that claim be verified. Here, the programmer has misspelled one of the methods required by the role.

@perform_role(FourLegs)
class Horse(Mammal):

  def run_like_teh_wind(self)
    ...

This will cause the role system to raise an exception, complaining that you're missing a run_like_the_wind() method. The role system carries out these checks as soon as a class is flagged as performing a given role.

Concrete methods are required to match exactly the signature demanded by the role. Here, we've attempted to fulfill our role by defining a concrete version of bark(), but we've missed the mark a bit.

@perform_role(Doglike)
class Coyote(Mammal):

  def bark(self, target=moon):
    pass

This method's signature doesn't match exactly with what the Doglike role was expecting, so the role system will throw a bit of a tantrum.

Mechanism

The following are strawman proposals for how roles might be expressed in Python. The examples here are phrased in a way that the roles mechanism may be implemented without changing the Python interpreter. (Examples adapted from an article on Perl 6 roles by Curtis Poe [2].)

  1. Static class role assignment

    @perform_role(Thieving)
    class Elf(Character):
      ...
    

    perform_role() accepts multiple arguments, such that this is also legal:

    @perform_role(Thieving, Spying, Archer)
    class Elf(Character):
      ...
    

    The Elf class now performs both the Thieving, Spying, and Archer roles.

  2. Querying instances

    if performs(my_elf, Thieving):
      ...
    

    The second argument to performs() may also be anything with a __contains__() method, meaning the following is legal:

    if performs(my_elf, set([Thieving, Spying, BoyScout])):
      ...
    

    Like isinstance(), the object needs only to perform a single role out of the set in order for the expression to be true.

Relationship to Abstract Base Classes

Early drafts of this PEP [5] envisioned roles as competing with the abstract base classes proposed in PEP 3119. After further discussion and deliberation, a compromise and a delegation of responsibilities and use-cases has been worked out as follows:

  • Roles provide a way of indicating a object's semantics and abstract capabilities. A role may define abstract methods, but only as a way of delineating an interface through which a particular set of semantics are accessed. An Ordering role might require that some set of ordering operators be defined.

    class Ordering(metaclass=Role):
      def __ge__(self, other):
        pass
    
      def __le__(self, other):
        pass
    
      def __ne__(self, other):
        pass
    
      # ...and so on
    

    In this way, we're able to indicate an object's role or function within a larger system without constraining or concerning ourselves with a particular implementation.

  • Abstract base classes, by contrast, are a way of reusing common, discrete units of implementation. For example, one might define an OrderingMixin that implements several ordering operators in terms of other operators.

    class OrderingMixin:
      def __ge__(self, other):
        return self > other or self == other
    
      def __le__(self, other):
        return self < other or self == other
    
      def __ne__(self, other):
        return not self == other
    
      # ...and so on
    

    Using this abstract base class - more properly, a concrete mixin - allows a programmer to define a limited set of operators and let the mixin in effect "derive" the others.

By combining these two orthogonal systems, we're able to both a) provide functionality, and b) alert consumer systems to the presence and availability of this functionality. For example, since the OrderingMixin class above satisfies the interface and semantics expressed in the Ordering role, we say the mixin performs the role:

@perform_role(Ordering)
class OrderingMixin:
  def __ge__(self, other):
    return self > other or self == other

  def __le__(self, other):
    return self < other or self == other

  def __ne__(self, other):
    return not self == other

  # ...and so on

Now, any class that uses the mixin will automatically -- that is, without further programmer effort -- be tagged as performing the Ordering role.

The separation of concerns into two distinct, orthogonal systems is desirable because it allows us to use each one separately. Take, for example, a third-party package providing a RecursiveHash role that indicates a container takes its contents into account when determining its hash value. Since Python's built-in tuple and frozenset classes follow this semantic, the RecursiveHash role can be applied to them.

>>> perform_role(RecursiveHash)(tuple)
>>> perform_role(RecursiveHash)(frozenset)

Now, any code that consumes RecursiveHash objects will now be able to consume tuples and frozensets.

Open Issues

Allowing Instances to Perform Different Roles Than Their Class

Perl 6 allows instances to perform different roles than their class. These changes are local to the single instance and do not affect other instances of the class. For example:

my_elf = Elf()
my_elf.goes_on_quest()
my_elf.becomes_evil()
now_performs(my_elf, Thieving) # Only this one elf is a thief
my_elf.steals(["purses", "candy", "kisses"])

In Perl 6, this is done by creating an anonymous class that inherits from the instance's original parent and performs the additional role(s). This is possible in Python 3, though whether it is desirable is still is another matter.

Inclusion of this feature would, of course, make it much easier to express the works of Charles Dickens in Python:

>>> from literature import role, BildungsRoman
>>> from dickens import Urchin, Gentleman
>>>
>>> with BildungsRoman() as OliverTwist:
...   mr_brownlow = Gentleman()
...   oliver, artful_dodger = Urchin(), Urchin()
...   now_performs(artful_dodger, [role.Thief, role.Scoundrel])
...
...   oliver.has_adventures_with(ArtfulDodger)
...   mr_brownlow.adopt_orphan(oliver)
...   now_performs(oliver, role.RichWard)

Requiring Attributes

Neal Norwitz has requested the ability to make assertions about the presence of attributes using the same mechanism used to require methods. Since roles take effect at class definition-time, and since the vast majority of attributes are defined at runtime by a class's __init__() method, there doesn't seem to be a good way to check for attributes at the same time as methods.

It may still be desirable to include non-enforced attributes in the role definition, if only for documentation purposes.

Roles of Roles

Under the proposed semantics, it is possible for roles to have roles of their own.

@perform_role(Y)
class X(metaclass=Role):
  ...

While this is possible, it is meaningless, since roles are generally not instantiated. There has been some off-line discussion about giving meaning to this expression, but so far no good ideas have emerged.

class_performs()

It is currently not possible to ask a class if its instances perform a given role. It may be desirable to provide an analogue to performs() such that

>>> isinstance(my_dwarf, Dwarf)
True
>>> performs(my_dwarf, Surly)
True
>>> performs(Dwarf, Surly)
False
>>> class_performs(Dwarf, Surly)
True

Prettier Dynamic Role Assignment

An early draft of this PEP included a separate mechanism for dynamically assigning a role to a class. This was spelled

>>> now_perform(Dwarf, GoldMiner)

This same functionality already exists by unpacking the syntactic sugar provided by decorators:

>>> perform_role(GoldMiner)(Dwarf)

At issue is whether dynamic role assignment is sufficiently important to warrant a dedicated spelling.

Syntax Support

Though the phrasings laid out in this PEP are designed so that the roles system could be shipped as a stand-alone package, it may be desirable to add special syntax for defining, assigning and querying roles. One example might be a role keyword, which would translate

class MyRole(metaclass=Role):
  ...

into

role MyRole:
  ...

Assigning a role could take advantage of the class definition arguments proposed in PEP 3115:

class MyClass(performs=MyRole):
  ...

Implementation

A reference implementation is forthcoming.

Acknowledgements

Thanks to Jeffery Yasskin, Talin and Guido van Rossum for several hours of in-person discussion to iron out the differences, overlap and finer points of roles and abstract base classes.

pep-3134 Exception Chaining and Embedded Tracebacks

PEP: 3134
Title: Exception Chaining and Embedded Tracebacks
Version: $Revision$
Last-Modified: $Date$
Author: Ka-Ping Yee
Status: Final
Type: Standards Track
Content-Type: text/plain
Created: 12-May-2005
Python-Version: 3.0
Post-History: 

Numbering Note

    This PEP started its life as PEP 344.  Since it is now targeted
    for Python 3000, it has been moved into the 3xxx space.


Abstract

    This PEP proposes three standard attributes on exception instances:
    the '__context__' attribute for implicitly chained exceptions, the
    '__cause__' attribute for explicitly chained exceptions, and the
    '__traceback__' attribute for the traceback.  A new "raise ... from"
    statement sets the '__cause__' attribute.


Motivation

    During the handling of one exception (exception A), it is possible
    that another exception (exception B) may occur.  In today's Python
    (version 2.4), if this happens, exception B is propagated outward
    and exception A is lost.  In order to debug the problem, it is
    useful to know about both exceptions.  The '__context__' attribute
    retains this information automatically.

    Sometimes it can be useful for an exception handler to intentionally
    re-raise an exception, either to provide extra information or to
    translate an exception to another type.  The '__cause__' attribute
    provides an explicit way to record the direct cause of an exception.

    In today's Python implementation, exceptions are composed of three
    parts: the type, the value, and the traceback.  The 'sys' module,
    exposes the current exception in three parallel variables, exc_type,
    exc_value, and exc_traceback, the sys.exc_info() function returns a
    tuple of these three parts, and the 'raise' statement has a
    three-argument form accepting these three parts.  Manipulating
    exceptions often requires passing these three things in parallel,
    which can be tedious and error-prone.  Additionally, the 'except'
    statement can only provide access to the value, not the traceback.
    Adding the '__traceback__' attribute to exception values makes all
    the exception information accessible from a single place.


History

    Raymond Hettinger [1] raised the issue of masked exceptions on
    Python-Dev in January 2003 and proposed a PyErr_FormatAppend()
    function that C modules could use to augment the currently active
    exception with more information.  Brett Cannon [2] brought up
    chained exceptions again in June 2003, prompting a long discussion.

    Greg Ewing [3] identified the case of an exception occuring in a
    'finally' block during unwinding triggered by an original exception,
    as distinct from the case of an exception occuring in an 'except'
    block that is handling the original exception.

    Greg Ewing [4] and Guido van Rossum [5], and probably others, have
    previously mentioned adding a traceback attribute to Exception
    instances.  This is noted in PEP 3000.

    This PEP was motivated by yet another recent Python-Dev reposting
    of the same ideas [6] [7].


Rationale

    The Python-Dev discussions revealed interest in exception chaining
    for two quite different purposes.  To handle the unexpected raising
    of a secondary exception, the exception must be retained implicitly.
    To support intentional translation of an exception, there must be a
    way to chain exceptions explicitly.  This PEP addresses both.

    Several attribute names for chained exceptions have been suggested
    on Python-Dev [2], including 'cause', 'antecedent', 'reason',
    'original', 'chain', 'chainedexc', 'exc_chain', 'excprev',
    'previous', and 'precursor'.  For an explicitly chained exception,
    this PEP suggests '__cause__' because of its specific meaning.  For
    an implicitly chained exception, this PEP proposes the name
    '__context__' because the intended meaning is more specific than
    temporal precedence but less specific than causation: an exception
    occurs in the context of handling another exception.
    
    This PEP suggests names with leading and trailing double-underscores
    for these three attributes because they are set by the Python VM.
    Only in very special cases should they be set by normal assignment.

    This PEP handles exceptions that occur during 'except' blocks and
    'finally' blocks in the same way.  Reading the traceback makes it
    clear where the exceptions occurred, so additional mechanisms for
    distinguishing the two cases would only add unnecessary complexity.

    This PEP proposes that the outermost exception object (the one
    exposed for matching by 'except' clauses) be the most recently
    raised exception for compatibility with current behaviour.

    This PEP proposes that tracebacks display the outermost exception
    last, because this would be consistent with the chronological order
    of tracebacks (from oldest to most recent frame) and because the
    actual thrown exception is easier to find on the last line.

    To keep things simpler, the C API calls for setting an exception
    will not automatically set the exception's '__context__'.  Guido
    van Rossum has has expressed concerns with making such changes [8].

    As for other languages, Java and Ruby both discard the original
    exception when another exception occurs in a 'catch'/'rescue' or
    'finally'/'ensure' clause.  Perl 5 lacks built-in structured
    exception handling.  For Perl 6, RFC number 88 [9] proposes an exception
    mechanism that implicitly retains chained exceptions in an array
    named @@.  In that RFC, the most recently raised exception is
    exposed for matching, as in this PEP; also, arbitrary expressions
    (possibly involving @@) can be evaluated for exception matching.

    Exceptions in C# contain a read-only 'InnerException' property that
    may point to another exception.  Its documentation [10] says that
    "When an exception X is thrown as a direct result of a previous
    exception Y, the InnerException property of X should contain a
    reference to Y."  This property is not set by the VM automatically;
    rather, all exception constructors take an optional 'innerException'
    argument to set it explicitly.  The '__cause__' attribute fulfills
    the same purpose as InnerException, but this PEP proposes a new form
    of 'raise' rather than extending the constructors of all exceptions.
    C# also provides a GetBaseException method that jumps directly to
    the end of the InnerException chain; this PEP proposes no analog.

    The reason all three of these attributes are presented together in
    one proposal is that the '__traceback__' attribute provides
    convenient access to the traceback on chained exceptions.


Implicit Exception Chaining

    Here is an example to illustrate the '__context__' attribute.

        def compute(a, b):
            try:
                a/b
            except Exception, exc:
                log(exc)

        def log(exc):
            file = open('logfile.txt')  # oops, forgot the 'w'
            print >>file, exc
            file.close()

    Calling compute(0, 0) causes a ZeroDivisionError.  The compute()
    function catches this exception and calls log(exc), but the log()
    function also raises an exception when it tries to write to a
    file that wasn't opened for writing.

    In today's Python, the caller of compute() gets thrown an IOError.
    The ZeroDivisionError is lost.  With the proposed change, the
    instance of IOError has an additional '__context__' attribute that
    retains the ZeroDivisionError.

    The following more elaborate example demonstrates the handling of a
    mixture of 'finally' and 'except' clauses:

        def main(filename):
            file = open(filename)       # oops, forgot the 'w'
            try:
                try:
                    compute()
                except Exception, exc:
                    log(file, exc)
            finally:
                file.clos()             # oops, misspelled 'close'
        
        def compute():
            1/0
        
        def log(file, exc):
            try:
                print >>file, exc       # oops, file is not writable
            except:
                display(exc)
        
        def display(exc):
            print ex                    # oops, misspelled 'exc'

    Calling main() with the name of an existing file will trigger four
    exceptions.  The ultimate result will be an AttributeError due to
    the misspelling of 'clos', whose __context__ points to a NameError
    due to the misspelling of 'ex', whose __context__ points to an
    IOError due to the file being read-only, whose __context__ points to
    a ZeroDivisionError, whose __context__ attribute is None.

    The proposed semantics are as follows:

    1.  Each thread has an exception context initially set to None.
    
    2.  Whenever an exception is raised, if the exception instance does
        not already have a '__context__' attribute, the interpreter sets
        it equal to the thread's exception context.

    3.  Immediately after an exception is raised, the thread's exception
        context is set to the exception.

    4.  Whenever the interpreter exits an 'except' block by reaching the
        end or executing a 'return', 'yield', 'continue', or 'break'
        statement, the thread's exception context is set to None.


Explicit Exception Chaining

    The '__cause__' attribute on exception objects is always initialized
    to None.  It is set by a new form of the 'raise' statement:

        raise EXCEPTION from CAUSE

    which is equivalent to:

        exc = EXCEPTION
        exc.__cause__ = CAUSE
        raise exc
    
    In the following example, a database provides implementations for a
    few different kinds of storage, with file storage as one kind.  The
    database designer wants errors to propagate as DatabaseError objects
    so that the client doesn't have to be aware of the storage-specific
    details, but doesn't want to lose the underlying error information.

        class DatabaseError(Exception):
            pass

        class FileDatabase(Database):
            def __init__(self, filename):
                try:
                    self.file = open(filename)
                except IOError, exc:
                    raise DatabaseError('failed to open') from exc

    If the call to open() raises an exception, the problem will be
    reported as a DatabaseError, with a __cause__ attribute that reveals
    the IOError as the original cause.


Traceback Attribute

    The following example illustrates the '__traceback__' attribute.

        def do_logged(file, work):
            try:
                work()
            except Exception, exc:
                write_exception(file, exc)
                raise exc

        from traceback import format_tb

        def write_exception(file, exc):
            ...
            type = exc.__class__
            message = str(exc)
            lines = format_tb(exc.__traceback__)
            file.write(... type ... message ... lines ...)
            ...

    In today's Python, the do_logged() function would have to extract
    the traceback from sys.exc_traceback or sys.exc_info()[2] and pass
    both the value and the traceback to write_exception().  With the
    proposed change, write_exception() simply gets one argument and
    obtains the exception using the '__traceback__' attribute.

    The proposed semantics are as follows:

    1.  Whenever an exception is caught, if the exception instance does
        not already have a '__traceback__' attribute, the interpreter
        sets it to the newly caught traceback.


Enhanced Reporting

    The default exception handler will be modified to report chained
    exceptions.  The chain of exceptions is traversed by following the
    '__cause__' and '__context__' attributes, with '__cause__' taking
    priority.  In keeping with the chronological order of tracebacks,
    the most recently raised exception is displayed last; that is, the
    display begins with the description of the innermost exception and
    backs up the chain to the outermost exception.  The tracebacks are
    formatted as usual, with one of the lines:

        The above exception was the direct cause of the following exception:

    or

        During handling of the above exception, another exception occurred:

    between tracebacks, depending whether they are linked by __cause__
    or __context__ respectively.  Here is a sketch of the procedure:
    
        def print_chain(exc):
            if exc.__cause__:
                print_chain(exc.__cause__)
                print '\nThe above exception was the direct cause...'
            elif exc.__context__:
                print_chain(exc.__context__)
                print '\nDuring handling of the above exception, ...'
            print_exc(exc)

    In the 'traceback' module, the format_exception, print_exception,
    print_exc, and print_last functions will be updated to accept an
    optional 'chain' argument, True by default.  When this argument is
    True, these functions will format or display the entire chain of
    exceptions as just described.  When it is False, these functions
    will format or display only the outermost exception.

    The 'cgitb' module should also be updated to display the entire
    chain of exceptions.


C API

    The PyErr_Set* calls for setting exceptions will not set the
    '__context__' attribute on exceptions.  PyErr_NormalizeException
    will always set the 'traceback' attribute to its 'tb' argument and
    the '__context__' and '__cause__' attributes to None.

    A new API function, PyErr_SetContext(context), will help C
    programmers provide chained exception information.  This function
    will first normalize the current exception so it is an instance,
    then set its '__context__' attribute.  A similar API function,
    PyErr_SetCause(cause), will set the '__cause__' attribute.


Compatibility

    Chained exceptions expose the type of the most recent exception, so
    they will still match the same 'except' clauses as they do now.

    The proposed changes should not break any code unless it sets or
    uses attributes named '__context__', '__cause__', or '__traceback__'
    on exception instances.  As of 2005-05-12, the Python standard
    library contains no mention of such attributes.


Open Issue: Extra Information

    Walter Dรถrwald [11] expressed a desire to attach extra information
    to an exception during its upward propagation without changing its
    type.  This could be a useful feature, but it is not addressed by
    this PEP.  It could conceivably be addressed by a separate PEP
    establishing conventions for other informational attributes on
    exceptions.


Open Issue: Suppressing Context

    As written, this PEP makes it impossible to suppress '__context__',
    since setting exc.__context__ to None in an 'except' or 'finally'
    clause will only result in it being set again when exc is raised.


Open Issue: Limiting Exception Types

    To improve encapsulation, library implementors may want to wrap all
    implementation-level exceptions with an application-level exception.
    One could try to wrap exceptions by writing this:

        try:
            ... implementation may raise an exception ...
        except:
            import sys
            raise ApplicationError from sys.exc_value

    or this:

        try:
            ... implementation may raise an exception ...
        except Exception, exc:
            raise ApplicationError from exc

    but both are somewhat flawed.  It would be nice to be able to name
    the current exception in a catch-all 'except' clause, but that isn't
    addressed here.  Such a feature would allow something like this:

        try:
            ... implementation may raise an exception ...
        except *, exc:
            raise ApplicationError from exc


Open Issue: yield

    The exception context is lost when a 'yield' statement is executed;
    resuming the frame after the 'yield' does not restore the context.
    Addressing this problem is out of the scope of this PEP; it is not a
    new problem, as demonstrated by the following example:

        >>> def gen():
        ...     try:
        ...         1/0
        ...     except:
        ...         yield 3
        ...         raise
        ...
        >>> g = gen()
        >>> g.next()
        3
        >>> g.next()
        TypeError: exceptions must be classes, instances, or strings
        (deprecated), not NoneType


Open Issue: Garbage Collection

    The strongest objection to this proposal has been that it creates
    cycles between exceptions and stack frames [12].  Collection of
    cyclic garbage (and therefore resource release) can be greatly
    delayed.

        >>> try:
        >>>   1/0
        >>> except Exception, err:
        >>>   pass

    will introduce a cycle from err -> traceback -> stack frame -> err,
    keeping all locals in the same scope alive until the next GC happens.

    Today, these locals would go out of scope.  There is lots of code
    which assumes that "local" resources -- particularly open files -- will
    be closed quickly.  If closure has to wait for the next GC, a program
    (which runs fine today) may run out of file handles.

    Making the __traceback__ attribute a weak reference would avoid the
    problems with cyclic garbage.  Unfortunately, it would make saving
    the Exception for later (as unittest does) more awkward, and it would
    not allow as much cleanup of the sys module.

    A possible alternate solution, suggested by Adam Olsen, would be to
    instead turn the reference from the stack frame to the 'err' variable
    into a weak reference when the variable goes out of scope [13].

  

Possible Future Compatible Changes

    These changes are consistent with the appearance of exceptions as
    a single object rather than a triple at the interpreter level.

    - If PEP 340 or PEP 343 is accepted, replace the three (type, value,
      traceback) arguments to __exit__ with a single exception argument.

    - Deprecate sys.exc_type, sys.exc_value, sys.exc_traceback, and
      sys.exc_info() in favour of a single member, sys.exception.

    - Deprecate sys.last_type, sys.last_value, and sys.last_traceback
      in favour of a single member, sys.last_exception.

    - Deprecate the three-argument form of the 'raise' statement in
      favour of the one-argument form.

    - Upgrade cgitb.html() to accept a single value as its first
      argument as an alternative to a (type, value, traceback) tuple.


Possible Future Incompatible Changes

    These changes might be worth considering for Python 3000.

    - Remove sys.exc_type, sys.exc_value, sys.exc_traceback, and
      sys.exc_info().

    - Remove sys.last_type, sys.last_value, and sys.last_traceback.

    - Replace the three-argument sys.excepthook with a one-argument
      API, and changing the 'cgitb' module to match.

    - Remove the three-argument form of the 'raise' statement.

    - Upgrade traceback.print_exception to accept an 'exception'
      argument instead of the type, value, and traceback arguments.


Implementation

    The __traceback__ and __cause__ attributes and the new raise syntax were
    implemented in revision 57783 [14].


Acknowledgements

    Brett Cannon, Greg Ewing, Guido van Rossum, Jeremy Hylton, Phillip
    J. Eby, Raymond Hettinger, Walter Dรถrwald, and others.


References

    [1] Raymond Hettinger, "Idea for avoiding exception masking"
        http://mail.python.org/pipermail/python-dev/2003-January/032492.html

    [2] Brett Cannon explains chained exceptions
        http://mail.python.org/pipermail/python-dev/2003-June/036063.html

    [3] Greg Ewing points out masking caused by exceptions during finally
        http://mail.python.org/pipermail/python-dev/2003-June/036290.html

    [4] Greg Ewing suggests storing the traceback in the exception object
        http://mail.python.org/pipermail/python-dev/2003-June/036092.html

    [5] Guido van Rossum mentions exceptions having a traceback attribute
        http://mail.python.org/pipermail/python-dev/2005-April/053060.html

    [6] Ka-Ping Yee, "Tidier Exceptions"
        http://mail.python.org/pipermail/python-dev/2005-May/053671.html

    [7] Ka-Ping Yee, "Chained Exceptions"
        http://mail.python.org/pipermail/python-dev/2005-May/053672.html

    [8] Guido van Rossum discusses automatic chaining in PyErr_Set*
        http://mail.python.org/pipermail/python-dev/2003-June/036180.html

    [9] Tony Olensky, "Omnibus Structured Exception/Error Handling Mechanism"
        http://dev.perl.org/perl6/rfc/88.html
     
   [10] MSDN .NET Framework Library, "Exception.InnerException Property"
        http://msdn.microsoft.com/library/en-us/cpref/html/frlrfsystemexceptionclassinnerexceptiontopic.asp

   [11] Walter Dรถrwald suggests wrapping exceptions to add details
        http://mail.python.org/pipermail/python-dev/2003-June/036148.html

   [12] Guido van Rossum restates the objection to cyclic trash
        http://mail.python.org/pipermail/python-3000/2007-January/005322.html

   [13] Adam Olsen suggests using a weakref from stack frame to exception
        http://mail.python.org/pipermail/python-3000/2007-January/005363.html

   [14] Patch to implement the bulk of the PEP
        http://svn.python.org/view/python/branches/py3k/Include/?rev=57783&view=rev


Copyright

    This document has been placed in the public domain.


pep-3135 New Super

PEP:3135
Title:New Super
Version:$Revision$
Last-Modified:$Date$
Author:Calvin Spealman <ironfroggy at gmail.com>, Tim Delaney <timothy.c.delaney at gmail.com>, Lie Ryan <lie.1296 at gmail.com>
Status:Final
Type:Standards Track
Content-Type:text/x-rst
Created:28-Apr-2007
Python-Version:3.0
Post-History:28-Apr-2007, 29-Apr-2007 (1), 29-Apr-2007 (2), 14-May-2007, 12-Mar-2009

Numbering Note

This PEP started its life as PEP 367. Since it is now targeted for Python 3000, it has been moved into the 3xxx space.

Abstract

This PEP proposes syntactic sugar for use of the super type to automatically construct instances of the super type binding to the class that a method was defined in, and the instance (or class object for classmethods) that the method is currently acting upon.

The premise of the new super usage suggested is as follows:

super().foo(1, 2)

to replace the old:

super(Foo, self).foo(1, 2)

Rationale

The current usage of super requires an explicit passing of both the class and instance it must operate from, requiring a breaking of the DRY (Don't Repeat Yourself) rule. This hinders any change in class name, and is often considered a wart by many.

Specification

Within the specification section, some special terminology will be used to distinguish similar and closely related concepts. "super class" will refer to the actual builtin class named "super". A "super instance" is simply an instance of the super class, which is associated with another class and possibly with an instance of that class.

The new super semantics are only available in Python 3.0.

Replacing the old usage of super, calls to the next class in the MRO (method resolution order) can be made without explicitly passing the class object (although doing so will still be supported). Every function will have a cell named __class__ that contains the class object that the function is defined in.

The new syntax:

super()

is equivalent to:

super(__class__, <firstarg>)

where __class__ is the class that the method was defined in, and <firstarg> is the first parameter of the method (normally self for instance methods, and cls for class methods). For functions defined outside a class body, __class__ is not defined, and will result in runtime SystemError.

While super is not a reserved word, the parser recognizes the use of super in a method definition and only passes in the __class__ cell when this is found. Thus, calling a global alias of super without arguments will not necessarily work.

Closed Issues

Determining the class object to use

The class object is taken from a cell named __class__.

Should super actually become a keyword?

No. It is not necessary for super to become a keyword.

super used with __call__ attributes

It was considered that it might be a problem that instantiating super instances the classic way, because calling it would lookup the __call__ attribute and thus try to perform an automatic super lookup to the next class in the MRO. However, this was found to be false, because calling an object only looks up the __call__ method directly on the object's type. The following example shows this in action.

class A(object):
    def __call__(self):
        return '__call__'
    def __getattribute__(self, attr):
        if attr == '__call__':
            return lambda: '__getattribute__'
a = A()
assert a() == '__call__'
assert a.__call__() == '__getattribute__'

In any case, this issue goes away entirely because classic calls to super(<class>, <instance>) are still supported with the same meaning.

Alternative Proposals

No Changes

Although its always attractive to just keep things how they are, people have sought a change in the usage of super calling for some time, and for good reason, all mentioned previously.

  • Decoupling from the class name (which might not even be bound to the right class anymore!)
  • Simpler looking, cleaner super calls would be better

Dynamic attribute on super type

The proposal adds a dynamic attribute lookup to the super type, which will automatically determine the proper class and instance parameters. Each super attribute lookup identifies these parameters and performs the super lookup on the instance, as the current super implementation does with the explicit invokation of a super instance upon a class and instance.

This proposal relies on sys._getframe(), which is not appropriate for anything except a prototype implementation.

self.__super__.foo(*args)

The __super__ attribute is mentioned in this PEP in several places, and could be a candidate for the complete solution, actually using it explicitly instead of any super usage directly. However, double-underscore names are usually an internal detail, and attempted to be kept out of everyday code.

super(self, *args) or __super__(self, *args)

This solution only solves the problem of the type indication, does not handle differently named super methods, and is explicit about the name of the instance. It is less flexable without being able to enacted on other method names, in cases where that is needed. One use case this fails is where a base- class has a factory classmethod and a subclass has two factory classmethods, both of which needing to properly make super calls to the one in the base- class.

super.foo(self, *args)

This variation actually eliminates the problems with locating the proper instance, and if any of the alternatives were pushed into the spotlight, I would want it to be this one.

super(*p, **kw)

There has been the proposal that directly calling super(*p, **kw) would be equivalent to calling the method on the super object with the same name as the method currently being executed i.e. the following two methods would be equivalent:

def f(self, *p, **kw):
    super.f(*p, **kw)
def f(self, *p, **kw):
    super(*p, **kw)

There is strong sentiment for and against this, but implementation and style concerns are obvious. Guido has suggested that this should be excluded from this PEP on the principle of KISS (Keep It Simple Stupid).

History

12-Mar-2009 - Updated to reflect the current state of implementation.

29-Apr-2007 - Changed title from "Super As A Keyword" to "New Super"
  • Updated much of the language and added a terminology section for clarification in confusing places.
  • Added reference implementation and history sections.
06-May-2007 - Updated by Tim Delaney to reflect discussions on the python-3000
and python-dev mailing lists.

pep-3136 Labeled break and continue

PEP:3136
Title:Labeled break and continue
Version:$Revision$
Last-Modified:$Date$
Author:Matt Chisholm <matt-python at theory.org>
Status:Rejected
Type:Standards Track
Content-Type:text/x-rst
Created:30-Jun-2007
Python-Version:3.1
Post-History:

Abstract

This PEP proposes support for labels in Python's break and continue statements. It is inspired by labeled break and continue in other languages, and the author's own infrequent but persistent need for such a feature.

Introduction

The break statement allows the programmer to terminate a loop early, and the continue statement allows the programmer to move to the next iteration of a loop early. In Python currently, break and continue can apply only to the innermost enclosing loop.

Adding support for labels to the break and continue statements is a logical extension to the existing behavior of the break and continue statements. Labeled break and continue can improve the readability and flexibility of complex code which uses nested loops.

For brevity's sake, the examples and discussion in this PEP usually refers to the break statement. However, all of the examples and motivations apply equally to labeled continue.

Motivation

If the programmer wishes to move to the next iteration of an outer enclosing loop, or terminate multiple loops at once, he or she has a few less-than elegant options.

Here's one common way of imitating labeled break in Python (For this and future examples, ... denotes an arbitrary number of intervening lines of code):

for a in a_list:
    time_to_break_out_of_a = False
    ...
    for b in b_list:
        ...
        if condition_one(a, b):
            break
        ...
        if condition_two(a, b):
            time_to_break_out_of_a = True
            break
        ...
    if time_to_break_out_of_a:
        break
    ...

This requires five lines and an extra variable, time_to_break_out_of_a, to keep track of when to break out of the outer (a) loop. And those five lines are spread across many lines of code, making the control flow difficult to understand.

This technique is also error-prone. A programmer modifying this code might inadvertently put new code after the end of the inner (b) loop but before the test for time_to_break_out_of_a, instead of after the test. This means that code which should have been skipped by breaking out of the outer loop gets executed incorrectly.

This could also be written with an exception. The programmer would declare a special exception, wrap the inner loop in a try, and catch the exception and break when you see it:

class BreakOutOfALoop(Exception): pass

for a in a_list:
    ...
    try:
        for b in b_list:
            ...
            if condition_one(a, b):
                break
            ...
            if condition_two(a, b):
                raise BreakOutOfALoop
            ...
    except BreakOutOfALoop:
        break
    ...

Again, though; this requires five lines and a new, single-purpose exception class (instead of a new variable), and spreads basic control flow out over many lines. And it breaks out of the inner loop with break and out of the other loop with an exception, which is inelegant. [1]

This next strategy might be the most elegant solution, assuming condition_two() is inexpensive to compute:

for a in a_list:
    ...
    for b in b_list:
        ...
        if condition_one(a, b):
            break
        ...
        if condition_two(a, b):
            break
        ...
    if condition_two(a, b)
        break
    ...

Breaking twice is still inelegant. This implementation also relies on the fact that the inner (b) loop bleeds b into the outer for loop, which (although explicitly supported) is both surprising to novices, and in my opinion counter-intuitive and poor practice.

The programmer must also still remember to put in both breaks on condition two and not insert code before the second break. A single conceptual action, breaking out of both loops on condition_two(), requires four lines of code at two indentation levels, possibly separated by many intervening lines at the end of the inner (b) loop.

Other languages

Now, put aside whatever dislike you may have for other programming languages, and consider the syntax of labeled break and continue. In Perl:

ALOOP: foreach $a (@a_array){
    ...
    BLOOP: foreach $b (@b_array){
        ...
        if (condition_one($a,$b)){
            last BLOOP; # same as plain old last;
        }
        ...
        if (condition_two($a,$b)){
            last ALOOP;
        }
        ...
    }
    ...
}

(Notes: Perl uses last instead of break. The BLOOP labels could be omitted; last and continue apply to the innermost loop by default.)

PHP uses a number denoting the number of loops to break out of, rather than a label:

foreach ($a_array as $a){
    ....
    foreach ($b_array as $b){
        ....
        if (condition_one($a, $b)){
            break 1;  # same as plain old break
        }
        ....
        if (condition_two($a, $b)){
            break 2;
        }
        ....
    }
    ...
}

C/C++, Java, and Ruby all have similar constructions.

The control flow regarding when to break out of the outer (a) loop is fully encapsulated in the break statement which gets executed when the break condition is satisfied. The depth of the break statement does not matter. Control flow is not spread out. No extra variables, exceptions, or re-checking or storing of control conditions is required. There is no danger that code will get inadvertently inserted after the end of the inner (b) loop and before the break condition is re-checked inside the outer (a) loop. These are the benefits that labeled break and continue would bring to Python.

What this PEP is not

This PEP is not a proposal to add GOTO to Python. GOTO allows a programmer to jump to an arbitrary block or line of code, and generally makes control flow more difficult to follow. Although break and continue (with or without support for labels) can be considered a type of GOTO, it is much more restricted. Another Python construct, yield, could also be considered a form of GOTO -- an even less restrictive one. The goal of this PEP is to propose an extension to the existing control flow tools break and continue, to make control flow easier to understand, not more difficult.

Labeled break and continue cannot transfer control to another function or method. They cannot even transfer control to an arbitrary line of code in the current scope. Currently, they can only affect the behavior of a loop, and are quite different and much more restricted than GOTO. This extension allows them to affect any enclosing loop in the current name-space, but it does not change their behavior to that of GOTO.

Specification

Under all of these proposals, break and continue by themselves will continue to behave as they currently do, applying to the innermost loop by default.

Proposal A - Explicit labels

The for and while loop syntax will be followed by an optional as or label (contextual) keyword [2] and then an identifier, which may be used to identify the loop out of which to break (or which should be continued).

The break (and continue) statements will be followed by an optional identifier that refers to the loop out of which to break (or which should be continued). Here is an example using the as keyword:

for a in a_list as a_loop:
    ...
    for b in b_list as b_loop:
        ...
        if condition_one(a, b):
            break b_loop  # same as plain old break
        ...
        if condition_two(a, b):
            break a_loop
        ...
    ...

Or, with label instead of as:

for a in a_list label a_loop:
    ...
    for b in b_list label b_loop:
        ...
        if condition_one(a, b):
            break b_loop  # same as plain old break
        ...
        if condition_two(a, b):
            break a_loop
        ...
    ...

This has all the benefits outlined above. It requires modifications to the language syntax: the syntax of break and continue syntax statements and for and while statements. It requires either a new conditional keyword label or an extension to the conditional keyword as. [3] It is unlikely to require any changes to existing Python programs. Passing an identifier not defined in the local scope to break or continue would raise a NameError.

Proposal B - Numeric break & continue

Rather than altering the syntax of for and while loops, break and continue would take a numeric argument denoting the enclosing loop which is being controlled, similar to PHP.

It seems more Pythonic to me for break and continue to refer to loops indexing from zero, as opposed to indexing from one as PHP does.

for a in a_list:
    ...
    for b in b_list:
        ...
        if condition_one(a,b):
            break 0  # same as plain old break
        ...
        if condition_two(a,b):
            break 1
        ...
    ...

Passing a number that was too large, or less than zero, or non-integer to break or continue would (probably) raise an IndexError.

This proposal would not require any changes to existing Python programs.

Proposal C - The reduplicative method

The syntax of break and continue would be altered to allow multiple break and continue statements on the same line. Thus, break break would break out of the first and second enclosing loops.

for a in a_list:
    ...
    for b in b_list:
        ...
        if condition_one(a,b):
            break  # plain old break
        ...
        if condition_two(a,b):
            break break
        ...
    ...

This would also allow the programmer to break out of the inner loop and continue the next outermost simply by writing break continue, [4] and so on. I'm not sure what exception would be raised if the programmer used more break or continue statements than existing loops (perhaps a SyntaxError?).

I expect this proposal to get rejected because it will be judged too difficult to understand.

This proposal would not require any changes to existing Python programs.

Proposal D - Explicit iterators

Rather than embellishing for and while loop syntax with labels, the programmer wishing to use labeled breaks would be required to create the iterator explicitly and assign it to a identifier if he or she wanted to break out of or continue that loop from within a deeper loop.

a_iter = iter(a_list)
for a in a_iter:
    ...
    b_iter = iter(b_list)
    for b in b_iter:
        ...
        if condition_one(a,b):
            break b_iter  # same as plain old break
        ...
        if condition_two(a,b):
            break a_iter
        ...
    ...

Passing a non-iterator object to break or continue would raise a TypeError; and a nonexistent identifier would raise a NameError. This proposal requires only one extra line to create a labeled loop, and no extra lines to break out of a containing loop, and no changes to existing Python programs.

Proposal E - Explicit iterators and iterator methods

This is a variant of Proposal D. Iterators would need be created explicitly if anything other that the most basic use of break and continue was required. Instead of modifying the syntax of break and continue, .break() and .continue() methods could be added to the Iterator type.

a_iter = iter(a_list)
for a in a_iter:
    ...
    b_iter = iter(b_list)
    for b in b_iter:
        ...
        if condition_one(a,b):
            b_iter.break()  # same as plain old break
        ...
        if condition_two(a,b):
            a_iter.break()
        ...
    ...

I expect that this proposal will get rejected on the grounds of sheer ugliness. However, it requires no changes to the language syntax whatsoever, nor does it require any changes to existing Python programs.

Implementation

I have never looked at the Python language implementation itself, so I have no idea how difficult this would be to implement. If this PEP is accepted, but no one is available to write the feature, I will try to implement it myself.

Footnotes

[1]Breaking some loops with exceptions is inelegant because it's a violation of There's Only One Way To Do It.
[2]Or really any new contextual keyword that the community likes: as, label, labeled, loop, name, named, walrus, whatever.
[3]The use of as in a similar context has been proposed here, http://sourceforge.net/tracker/index.php?func=detail&aid=1714448&group_id=5470&atid=355470 but to my knowledge this idea has not been written up as a PEP.
[4]To continue the Nth outer loop, you would write break N-1 times and then continue. Only one continue would be allowed, and only at the end of a sequence of breaks. continue break or continue continue makes no sense.

Resources

This issue has come up before, although it has never been resolved, to my knowledge.

pep-3137 Immutable Bytes and Mutable Buffer

PEP:3137
Title:Immutable Bytes and Mutable Buffer
Version:$Revision$
Last-Modified:$Date$
Author:Guido van Rossum <guido at python.org>
Status:Final
Type:Standards Track
Content-Type:text/x-rst
Created:26-Sep-2007
Python-Version:3.0
Post-History:26-Sep-2007, 30-Sep-2007

Introduction

After releasing Python 3.0a1 with a mutable bytes type, pressure mounted to add a way to represent immutable bytes. Gregory P. Smith proposed a patch that would allow making a bytes object temporarily immutable by requesting that the data be locked using the new buffer API from PEP 3118. This did not seem the right approach to me.

Jeffrey Yasskin, with the help of Adam Hupp, then prepared a patch to make the bytes type immutable (by crudely removing all mutating APIs) and fix the fall-out in the test suite. This showed that there aren't all that many places that depend on the mutability of bytes, with the exception of code that builds up a return value from small pieces.

Thinking through the consequences, and noticing that using the array module as an ersatz mutable bytes type is far from ideal, and recalling a proposal put forward earlier by Talin, I floated the suggestion to have both a mutable and an immutable bytes type. (This had been brought up before, but until seeing the evidence of Jeffrey's patch I wasn't open to the suggestion.)

Moreover, a possible implementation strategy became clear: use the old PyString implementation, stripped down to remove locale support and implicit conversions to/from Unicode, for the immutable bytes type, and keep the new PyBytes implementation as the mutable bytes type.

The ensuing discussion made it clear that the idea is welcome but needs to be specified more precisely. Hence this PEP.

Advantages

One advantage of having an immutable bytes type is that code objects can use these. It also makes it possible to efficiently create hash tables using bytes for keys; this may be useful when parsing protocols like HTTP or SMTP which are based on bytes representing text.

Porting code that manipulates binary data (or encoded text) in Python 2.x will be easier using the new design than using the original 3.0 design with mutable bytes; simply replace str with bytes and change '...' literals into b'...' literals.

Naming

I propose the following type names at the Python level:

  • bytes is an immutable array of bytes (PyString)
  • bytearray is a mutable array of bytes (PyBytes)
  • memoryview is a bytes view on another object (PyMemory)

The old type named buffer is so similar to the new type memoryview, introduce by PEP 3118, that it is redundant. The rest of this PEP doesn't discuss the functionality of memoryview; it is just mentioned here to justify getting rid of the old buffer type. (An earlier version of this PEP proposed buffer as the new name for PyBytes; in the end this name was deemed to confusing given the many other uses of the word buffer.)

While eventually it makes sense to change the C API names, this PEP maintains the old C API names, which should be familiar to all.

Summary

Here's a simple ASCII-art table summarizing the type names in various Python versions:

+--------------+-------------+------------+--------------------------+
| C name       | 2.x    repr | 3.0a1 repr | 3.0a2               repr |
+--------------+-------------+------------+--------------------------+
| PyUnicode    | unicode u'' | str     '' | str                   '' |
| PyString     | str      '' | str8   s'' | bytes                b'' |
| PyBytes      | N/A         | bytes  b'' | bytearray bytearray(b'') |
| PyBuffer     | buffer      | buffer     | N/A                      |
| PyMemoryView | N/A         | memoryview | memoryview         <...> |
+--------------+-------------+------------+--------------------------+

Literal Notations

The b'...' notation introduced in Python 3.0a1 returns an immutable bytes object, whatever variation is used. To create a mutable array of bytes, use bytearray(b'...') or bytearray([...]). The latter form takes a list of integers in range(256).

Functionality

PEP 3118 Buffer API

Both bytes and bytearray implement the PEP 3118 buffer API. The bytes type only implements read-only requests; the bytearray type allows writable and data-locked requests as well. The element data type is always 'B' (i.e. unsigned byte).

Constructors

There are four forms of constructors, applicable to both bytes and bytearray:

  • bytes(<bytes>), bytes(<bytearray>), bytearray(<bytes>), bytearray(<bytearray>): simple copying constructors, with the note that bytes(<bytes>) might return its (immutable) argument, but bytearray(<bytearray>) always makes a copy.
  • bytes(<str>, <encoding>[, <errors>]), bytearray(<str>, <encoding>[, <errors>]): encode a text string. Note that the str.encode() method returns an immutable bytes object. The <encoding> argument is mandatory; <errors> is optional. <encoding> and <errrors>, if given, must be str instances.
  • bytes(<memory view>), bytearray(<memory view>): construct a bytes or bytearray object from anything that implements the PEP 3118 buffer API.
  • bytes(<iterable of ints>), bytearray(<iterable of ints>): construct a bytes or bytearray object from a stream of integers in range(256).
  • bytes(<int>), bytearray(<int>): construct a zero-initialized bytes or bytearray object of a given length.

Comparisons

The bytes and bytearray types are comparable with each other and orderable, so that e.g. b'abc' == bytearray(b'abc') < b'abd'.

Comparing either type to a str object for equality returns False regardless of the contents of either operand. Ordering comparisons with str raise TypeError. This is all conformant to the standard rules for comparison and ordering between objects of incompatible types.

(Note: in Python 3.0a1, comparing a bytes instance with a str instance would raise TypeError, on the premise that this would catch the occasional mistake quicker, especially in code ported from Python 2.x. However, a long discussion on the python-3000 list pointed out so many problems with this that it is clearly a bad idea, to be rolled back in 3.0a2 regardless of the fate of the rest of this PEP.)

Slicing

Slicing a bytes object returns a bytes object. Slicing a bytearray object returns a bytearray object.

Slice assignment to a bytearray object accepts anything that implements the PEP 3118 buffer API, or an iterable of integers in range(256).

Indexing

Indexing bytes and bytearray returns small ints (like the bytes type in 3.0a1, and like lists or array.array('B')).

Assignment to an item of a bytearray object accepts an int in range(256). (To assign from a bytes sequence, use a slice assignment.)

Str() and Repr()

The str() and repr() functions return the same thing for these objects. The repr() of a bytes object returns a b'...' style literal. The repr() of a bytearray returns a string of the form "bytearray(b'...')".

Operators

The following operators are implemented by the bytes and bytearray types, except where mentioned:

  • b1 + b2: concatenation. With mixed bytes/bytearray operands, the return type is that of the first argument (this seems arbitrary until you consider how += works).
  • b1 += b2: mutates b1 if it is a bytearray object.
  • b * n, n * b: repetition; n must be an integer.
  • b *= n: mutates b if it is a bytearray object.
  • b1 in b2, b1 not in b2: substring test; b1 can be any object implementing the PEP 3118 buffer API.
  • i in b, i not in b: single-byte membership test; i must be an integer (if it is a length-1 bytes array, it is considered to be a substring test, with the same outcome).
  • len(b): the number of bytes.
  • hash(b): the hash value; only implemented by the bytes type.

Note that the % operator is not implemented. It does not appear worth the complexity.

Methods

The following methods are implemented by bytes as well as bytearray, with similar semantics. They accept anything that implements the PEP 3118 buffer API for bytes arguments, and return the same type as the object whose method is called ("self"):

.capitalize(), .center(), .count(), .decode(), .endswith(),
.expandtabs(), .find(), .index(), .isalnum(), .isalpha(), .isdigit(),
.islower(), .isspace(), .istitle(), .isupper(), .join(), .ljust(),
.lower(), .lstrip(), .partition(), .replace(), .rfind(), .rindex(),
.rjust(), .rpartition(), .rsplit(), .rstrip(), .split(),
.splitlines(), .startswith(), .strip(), .swapcase(), .title(),
.translate(), .upper(), .zfill()

This is exactly the set of methods present on the str type in Python 2.x, with the exclusion of .encode(). The signatures and semantics are the same too. However, whenever character classes like letter, whitespace, lower case are used, the ASCII definitions of these classes are used. (The Python 2.x str type uses the definitions from the current locale, settable through the locale module.) The .encode() method is left out because of the more strict definitions of encoding and decoding in Python 3000: encoding always takes a Unicode string and returns a bytes sequence, and decoding always takes a bytes sequence and returns a Unicode string.

In addition, both types implement the class method .fromhex(), which constructs an object from a string containing hexadecimal values (with or without spaces between the bytes).

The bytearray type implements these additional methods from the MutableSequence ABC (see PEP 3119):

.extend(), .insert(), .append(), .reverse(), .pop(), .remove().

Bytes and the Str Type

Like the bytes type in Python 3.0a1, and unlike the relationship between str and unicode in Python 2.x, attempts to mix bytes (or bytearray) objects and str objects without specifying an encoding will raise a TypeError exception. (However, comparing bytes/bytearray and str objects for equality will simply return False; see the section on Comparisons above.)

Conversions between bytes or bytearray objects and str objects must always be explicit, using an encoding. There are two equivalent APIs: str(b, <encoding>[, <errors>]) is equivalent to b.decode(<encoding>[, <errors>]), and bytes(s, <encoding>[, <errors>]) is equivalent to s.encode(<encoding>[, <errors>]).

There is one exception: we can convert from bytes (or bytearray) to str without specifying an encoding by writing str(b). This produces the same result as repr(b). This exception is necessary because of the general promise that any object can be printed, and printing is just a special case of conversion to str. There is however no promise that printing a bytes object interprets the individual bytes as characters (unlike in Python 2.x).

The str type currently implements the PEP 3118 buffer API. While this is perhaps occasionally convenient, it is also potentially confusing, because the bytes accessed via the buffer API represent a platform-depending encoding: depending on the platform byte order and a compile-time configuration option, the encoding could be UTF-16-BE, UTF-16-LE, UTF-32-BE, or UTF-32-LE. Worse, a different implementation of the str type might completely change the bytes representation, e.g. to UTF-8, or even make it impossible to access the data as a contiguous array of bytes at all. Therefore, the PEP 3118 buffer API will be removed from the str type.

The basestring Type

The basestring type will be removed from the language. Code that used to say isinstance(x, basestring) should be changed to use isinstance(x, str) instead.

Pickling

Left as an exercise for the reader.

pep-3138 String representation in Python 3000

PEP:3138
Title:String representation in Python 3000
Version:$Revision$
Last-Modified:$Date$
Author:Atsuo Ishimoto <ishimoto--at--gembook.org>
Status:Final
Type:Standards Track
Content-Type:text/x-rst
Created:05-May-2008
Post-History:05-May-2008, 05-Jun-2008

Abstract

This PEP proposes a new string representation form for Python 3000. In Python prior to Python 3000, the repr() built-in function converted arbitrary objects to printable ASCII strings for debugging and logging. For Python 3000, a wider range of characters, based on the Unicode standard, should be considered 'printable'.

Motivation

The current repr() converts 8-bit strings to ASCII using following algorithm.

  • Convert CR, LF, TAB and '\' to '\r', '\n', '\t', '\\'.
  • Convert other non-printable characters(0x00-0x1f, 0x7f) and non-ASCII characters (>= 0x80) to '\xXX'.
  • Backslash-escape quote characters (apostrophe, ') and add the quote character at the beginning and the end.

For Unicode strings, the following additional conversions are done.

  • Convert leading surrogate pair characters without trailing character (0xd800-0xdbff, but not followed by 0xdc00-0xdfff) to '\uXXXX'.
  • Convert 16-bit characters (>= 0x100) to '\uXXXX'.
  • Convert 21-bit characters (>= 0x10000) and surrogate pair characters to '\U00xxxxxx'.

This algorithm converts any string to printable ASCII, and repr() is used as a handy and safe way to print strings for debugging or for logging. Although all non-ASCII characters are escaped, this does not matter when most of the string's characters are ASCII. But for other languages, such as Japanese where most characters in a string are not ASCII, this is very inconvenient.

We can use print(aJapaneseString) to get a readable string, but we don't have a similar workaround for printing strings from collections such as lists or tuples. print(listOfJapaneseStrings) uses repr() to build the string to be printed, so the resulting strings are always hex-escaped. Or when open(japaneseFilemame) raises an exception, the error message is something like IOError: [Errno 2] No such file or directory: '\u65e5\u672c\u8a9e', which isn't helpful.

Python 3000 has a lot of nice features for non-Latin users such as non-ASCII identifiers, so it would be helpful if Python could also progress in a similar way for printable output.

Some users might be concerned that such output will mess up their console if they print binary data like images. But this is unlikely to happen in practice because bytes and strings are different types in Python 3000, so printing an image to the console won't mess it up.

This issue was once discussed by Hye-Shik Chang [1], but was rejected.

Specification

  • Add a new function to the Python C API int Py_UNICODE_ISPRINTABLE (Py_UNICODE ch). This function returns 0 if repr() should escape the Unicode character ch; otherwise it returns 1. Characters that should be escaped are defined in the Unicode character database as:
    • Cc (Other, Control)
    • Cf (Other, Format)
    • Cs (Other, Surrogate)
    • Co (Other, Private Use)
    • Cn (Other, Not Assigned)
    • Zl (Separator, Line), refers to LINE SEPARATOR ('\u2028').
    • Zp (Separator, Paragraph), refers to PARAGRAPH SEPARATOR ('\u2029').
    • Zs (Separator, Space) other than ASCII space ('\x20'). Characters in this category should be escaped to avoid ambiguity.
  • The algorithm to build repr() strings should be changed to:
    • Convert CR, LF, TAB and '\' to '\r', '\n', '\t', '\\'.
    • Convert non-printable ASCII characters (0x00-0x1f, 0x7f) to '\xXX'.
    • Convert leading surrogate pair characters without trailing character (0xd800-0xdbff, but not followed by 0xdc00-0xdfff) to '\uXXXX'.
    • Convert non-printable characters (Py_UNICODE_ISPRINTABLE() returns 0) to 'xXX', '\uXXXX' or '\U00xxxxxx'.
    • Backslash-escape quote characters (apostrophe, 0x27) and add a quote character at the beginning and the end.
  • Set the Unicode error-handler for sys.stderr to 'backslashreplace' by default.
  • Add a new function to the Python C API PyObject *PyObject_ASCII (PyObject *o). This function converts any python object to a string using PyObject_Repr() and then hex-escapes all non-ASCII characters. PyObject_ASCII() generates the same string as PyObject_Repr() in Python 2.
  • Add a new built-in function, ascii(). This function converts any python object to a string using repr() and then hex-escapes all non-ASCII characters. ascii() generates the same string as repr() in Python 2.
  • Add a '%a' string format operator. '%a' converts any python object to a string using repr() and then hex-escapes all non-ASCII characters. The '%a' format operator generates the same string as '%r' in Python 2. Also, add '!a' conversion flags to the string.format() method and add '%A' operator to the PyUnicode_FromFormat(). They convert any object to an ASCII string as '%a' string format operator.
  • Add an isprintable() method to the string type. str.isprintable() returns False if repr() would escape any character in the string; otherwise returns True. The isprintable() method calls the Py_UNICODE_ISPRINTABLE() function internally.

Rationale

The repr() in Python 3000 should be Unicode, not ASCII based, just like Python 3000 strings. Also, conversion should not be affected by the locale setting, because the locale is not necessarily the same as the output device's locale. For example, it is common for a daemon process to be invoked in an ASCII setting, but writes UTF-8 to its log files. Also, web applications might want to report the error information in more readable form based on the HTML page's encoding.

Characters not supported by the user's console could be hex-escaped on printing, by the Unicode encoder's error-handler. If the error-handler of the output file is 'backslashreplace', such characters are hex-escaped without raising UnicodeEncodeError. For example, if the default encoding is ASCII, print('Hello ¢') will print 'Hello \xa2'. If the encoding is ISO-8859-1, 'Hello ¢' will be printed.

The default error-handler for sys.stdout is 'strict'. Other applications reading the output might not understand hex-escaped characters, so unsupported characters should be trapped when writing. If unsupported characters must be escaped, the error-handler should be changed explicitly. Unlike sys.stdout, sys.stderr doesn't raise UnicodeEncodingError by default, because the default error-handler is 'backslashreplace'. So printing error messages containing non-ASCII characters to sys.stderr will not raise an exception. Also, information about uncaught exceptions (exception object, traceback) is printed by the interpreter without raising exceptions.

Alternate Solutions

To help debugging in non-Latin languages without changing repr(), other suggestions were made.

  • Supply a tool to print lists or dicts.

    Strings to be printed for debugging are not only contained by lists or dicts, but also in many other types of object. File objects contain a file name in Unicode, exception objects contain a message in Unicode, etc. These strings should be printed in readable form when repr()ed. It is unlikely to be possible to implement a tool to print all possible object types.

  • Use sys.displayhook and sys.excepthook.

    For interactive sessions, we can write hooks to restore hex escaped characters to the original characters. But these hooks are called only when printing the result of evaluating an expression entered in an interactive Python session, and don't work for the print() function, for non-interactive sessions or for logging.debug("%r", ...), etc.

  • Subclass sys.stdout and sys.stderr.

    It is difficult to implement a subclass to restore hex-escaped characters since there isn't enough information left by the time it's a string to undo the escaping correctly in all cases. For example, print("\\"+"u0041") should be printed as '\u0041', not 'A'. But there is no chance to tell file objects apart.

  • Make the encoding used by unicode_repr() adjustable, and make the existing repr() the default.

    With adjustable repr(), the result of using repr() is unpredictable and would make it impossible to write correct code involving repr(). And if current repr() is the default, then the old convention remains intact and users may expect ASCII strings as the result of repr(). Third party applications or libraries could be confused when a custom repr() function is used.

Backwards Compatibility

Changing repr() may break some existing code, especially testing code. Five of Python's regression tests fail with this modification. If you need repr() strings without non-ASCII character as Python 2, you can use the following function.

def repr_ascii(obj):
    return str(repr(obj).encode("ASCII", "backslashreplace"), "ASCII")

For logging or for debugging, the following code can raise UnicodeEncodeError.

log = open("logfile", "w")
log.write(repr(data))     # UnicodeEncodeError will be raised
                          # if data contains unsupported characters.

To avoid exceptions being raised, you can explicitly specify the error-handler.

log = open("logfile", "w", errors="backslashreplace")
log.write(repr(data))  # Unsupported characters will be escaped.

For a console that uses a Unicode-based encoding, for example, en_US.utf8 or de_DE.utf8, the backslashreplace trick doesn't work and all printable characters are not escaped. This will cause a problem of similarly drawing characters in Western, Greek and Cyrillic languages. These languages use similar (but different) alphabets (descended from a common ancestor) and contain letters that look similar but have different character codes. For example, it is hard to distinguish Latin 'a', 'e' and 'o' from Cyrillic 'а', 'е' and 'о'. (The visual representation, of course, very much depends on the fonts used but usually these letters are almost indistinguishable.) To avoid the problem, the user can adjust the terminal encoding to get a result suitable for their environment.

Rejected Proposals

  • Add encoding and errors arguments to the builtin print() function, with defaults of sys.getfilesystemencoding() and 'backslashreplace'.

    Complicated to implement, and in general, this is not seen as a good idea. [2]

  • Use character names to escape characters, instead of hex character codes. For example, repr('\u03b1') can be converted to "\N{GREEK SMALL LETTER ALPHA}".

    Using character names can be very verbose compared to hex-escape. e.g., repr("\ufbf9") is converted to "\N{ARABIC LIGATURE UIGHUR KIRGHIZ YEH WITH HAMZA ABOVE WITH ALEF MAKSURA ISOLATED FORM}".

  • Default error-handler of sys.stdout should be 'backslashreplace'.

    Stuff written to stdout might be consumed by another program that might misinterpret the \ escapes. For interactive sessions, it is possible to make the 'backslashreplace' error-handler the default, but this may add confusion of the kind "it works in interactive mode but not when redirecting to a file".

Implementation

The author wrote a patch in http://bugs.python.org/issue2630; this was committed to the Python 3.0 branch in revision 64138 on 06-11-2008.

References

[1]Multibyte string on string::string_print (http://bugs.python.org/issue479898)
[2][Python-3000] Displaying strings containing unicode escapes (http://mail.python.org/pipermail/python-3000/2008-April/013366.html)

pep-3139 Cleaning out sys and the "interpreter" module

PEP:3139
Title:Cleaning out sys and the "interpreter" module
Version:$Revision$
Last-Modified:$Date$
Author:Benjamin Peterson <benjamin at python.org>
Status:Rejected
Type:Standards Track
Content-Type:text/x-rst
Created:4-April-2008
Python-Version:3.0

Abstract

This PEP proposes a new low-level module for CPython-specific interpreter functions in order to clean out the sys module and separate general Python functionality from implementation details.

Rationale

The sys module currently contains functions and data that can be put into two major groups:

  1. Data and functions that are available in all Python implementations and deal with the general running of a Python virtual machine.
    • argv
    • byteorder
    • path, path_hooks, meta_path, path_importer_cache, and modules
    • copyright, hexversion, version, and version_info
    • displayhook, __displayhook__
    • excepthook, __excepthook__, exc_info, and exc_clear
    • exec_prefix and prefix
    • executable
    • exit
    • flags, py3kwarning, dont_write_bytecode, and warn_options
    • getfilesystemencoding
    • get/setprofile
    • get/settrace, call_tracing
    • getwindowsversion
    • maxint and maxunicode
    • platform
    • ps1 and ps2
    • stdin, stderr, stdout, __stdin__, __stderr__, __stdout__
    • tracebacklimit
  2. Data and functions that affect the CPython interpreter.
    • get/setrecursionlimit
    • get/setcheckinterval
    • _getframe and _current_frame
    • getrefcount
    • get/setdlopenflags
    • settscdumps
    • api_version
    • winver
    • dllhandle
    • float_info
    • _compact_freelists
    • _clear_type_cache
    • subversion
    • builtin_module_names
    • callstats
    • intern

The second collections of items has been steadily increasing over the years causing clutter in sys. Guido has even said he doesn't recognize some of things in it [1]!

Moving these items items off to another module would send a clear message to other Python implementations about what functions need and need not be implemented.

It has also been proposed that the contents of types module be distributed across the standard library [2]; the interpreter module would provide an excellent resting place for internal types like frames and code objects.

Specification

A new builtin module named "interpreter" (see Naming) will be added.

The second list of items above will be split into the stdlib as follows:

The interpreter module
  • get/setrecursionlimit
  • get/setcheckinterval
  • _getframe and _current_frame
  • get/setdlopenflags
  • settscdumps
  • api_version
  • winver
  • dllhandle
  • float_info
  • _clear_type_cache
  • subversion
  • builtin_module_names
  • callstats
  • intern
The gc module:
  • getrefcount
  • _compact_freelists

Transition Plan

Once implemented in 3.x, the interpreter module will be back-ported to 2.6. Py3k warnings will be added the the sys functions it replaces.

Open Issues

What should move?

dont_write_bytecode

Some believe that the writing of bytecode is an implementation detail and should be moved [3]. The counterargument is that all current, complete Python implementations do write some sort of bytecode, so it is valuable to be able to disable it. Also, if it is moved, some wish to put it in the imp module.

Move to some to imp?

It was noted that dont_write_bytecode or maybe builtin_module_names might fit nicely in the imp module.

Naming

The author proposes the name "interpreter" for the new module. "pyvm" has also been suggested [4]. The name "cpython" was well liked [5].

pep-3140 str(container) should call str(item), not repr(item)

PEP: 3140
Title: str(container) should call str(item), not repr(item)
Version: $Revision$
Last-Modified: $Date$
Author: Oleg Broytmann <phd at phd.pp.ru>, Jim J. Jewett <jimjjewett at gmail.com>
Discussions-To:  <python-3000 at python.org>
Status: Rejected
Type: Standards Track
Content-Type: text/plain
Created: 27-May-2008
Post-History: 28-May-2008

Rejection

   Guido said this would cause too much disturbance too close to beta. See
   http://mail.python.org/pipermail/python-3000/2008-May/013876.html.


Abstract

   This document discusses the advantages and disadvantages of the
   current implementation of str(container).  It also discusses the
   pros and cons of a different approach - to call str(item) instead
   of repr(item).


Motivation

   Currently str(container) calls repr on items.  Arguments for it:
   -- containers refuse to guess what the user wants to see on
      str(container) - surroundings, delimiters, and so on;
   -- repr(item) usually displays type information - apostrophes
      around strings, class names, etc.

   Arguments against:
   -- it's illogical; str() is expected to call __str__ if it exists,
      not __repr__;
   -- there is no standard way to print a container's content calling
      items' __str__, that's inconvenient in cases where __str__ and
      __repr__ return different results;
   -- repr(item) sometimes do wrong things (hex-escapes non-ASCII
      strings, e.g.)

   This PEP proposes to change how str(container) works.  It is
   proposed to mimic how repr(container) works except one detail
   - call str on items instead of repr.  This allows a user to choose
   what results she want to get - from item.__repr__ or item.__str__.


Current situation

   Most container types (tuples, lists, dicts, sets, etc.) do not
   implement __str__ method, so str(container) calls
   container.__repr__, and container.__repr__, once called, forgets
   it is called from str and always calls repr on the container's
   items.

   This behaviour has advantages and disadvantages.  One advantage is
   that most items are represented with type information - strings
   are surrounded by apostrophes, instances may have both class name
   and instance data:

       >>> print([42, '42'])
       [42, '42']
       >>> print([Decimal('42'), datetime.now()])
       [Decimal("42"), datetime.datetime(2008, 5, 27, 19, 57, 43, 485028)]

   The disadvantage is that __repr__ often returns technical data
   (like '<object at address>') or unreadable string (hex-encoded
   string if the input is non-ASCII string):

       >>> print(['тест'])
       ['\xd4\xc5\xd3\xd4']

   One of the motivations for PEP 3138 is that neither repr nor str
   will allow the sensible printing of dicts whose keys are non-ASCII
   text strings.  Now that Unicode identifiers are allowed, it
   includes Python's own attribute dicts.  This also includes JSON
   serialization (and caused some hoops for the json lib).

   PEP 3138 proposes to fix this by breaking the "repr is safe ASCII"
   invariant, and changing the way repr (which is used for
   persistence) outputs some objects, with system-dependent failures.

   Changing how str(container) works would allow easy debugging in
   the normal case, and retain the safety of ASCII-only for the
   machine-readable  case.  The only downside is that str(x) and
   repr(x) would more often be different -- but only in those cases
   where the current almost-the-same version is insufficient.

   It also seems illogical that str(container) calls repr on items
   instead of str.  It's only logical to expect following code

       class Test:
           def __str__(self):
               return "STR"

           def __repr__(self):
               return "REPR"


       test = Test()
       print(test)
       print(repr(test))
       print([test])
       print(str([test]))

   to print

       STR
       REPR
       [STR]
       [STR]

   where it actually prints

       STR
       REPR
       [REPR]
       [REPR]

   Especially it is illogical to see that print in Python 2 uses str
   if it is called on what seems to be a tuple:

       >>> print Decimal('42'), datetime.now()
       42 2008-05-27 20:16:22.534285

   where on an actual tuple it prints

       >>> print((Decimal('42'), datetime.now()))
       (Decimal("42"), datetime.datetime(2008, 5, 27, 20, 16, 27, 937911))


A different approach - call str(item)

   For example, with numbers it is often only the value that people
   care about.

       >>> print Decimal('3')
       3

   But putting the value in a list forces users to read the type
   information, exactly as if repr had been called for the benefit of
   a machine:

       >>> print [Decimal('3')]
       [Decimal("3")]

   After this change, the type information would not clutter the str
   output:

       >>> print "%s".format([Decimal('3')])
       [3]
       >>> str([Decimal('3')])  # ==
       [3]

   But it would still be available if desired:

       >>> print "%r".format([Decimal('3')])
       [Decimal('3')]
       >>> repr([Decimal('3')])  # ==
       [Decimal('3')]

   There is a number of strategies to fix the problem.  The most
   radical is to change __repr__ so it accepts a new parameter (flag)
   "called from str, so call str on items, not repr".  The
   drawback of the proposal is that every __repr__ implementation
   must be changed.  Introspection could help a bit (inspect __repr__
   before calling if it accepts 2 or 3 parameters), but introspection
   doesn't work on classes written in C, like all built-in containers.

   Less radical proposal is to implement __str__ methods for built-in
   container types.  The obvious drawback is a duplication of effort
   - all those __str__ and __repr__ implementations are only differ
   in one small detail - if they call str or repr on items.

   The most conservative proposal is not to change str at all but
   to allow developers to implement their own application- or
   library-specific pretty-printers.  The drawback is again
   a multiplication of effort and proliferation of many small
   specific container-traversal algorithms.


Backward compatibility

   In those cases where type information is more important than
   usual, it will still be possible to get the current results by
   calling repr explicitly.


Copyright

   This document has been placed in the public domain.



pep-3141 A Type Hierarchy for Numbers

PEP:3141
Title:A Type Hierarchy for Numbers
Version:$Revision$
Last-Modified:$Date$
Author:Jeffrey Yasskin <jyasskin at google.com>
Status:Final
Type:Standards Track
Content-Type:text/x-rst
Created:23-Apr-2007
Post-History:25-Apr-2007, 16-May-2007, 02-Aug-2007

Abstract

This proposal defines a hierarchy of Abstract Base Classes (ABCs) (PEP 3119) to represent number-like classes. It proposes a hierarchy of Number :> Complex :> Real :> Rational :> Integral where A :> B means "A is a supertype of B". The hierarchy is inspired by Scheme's numeric tower [4].

Rationale

Functions that take numbers as arguments should be able to determine the properties of those numbers, and if and when overloading based on types is added to the language, should be overloadable based on the types of the arguments. For example, slicing requires its arguments to be Integrals, and the functions in the math module require their arguments to be Real.

Specification

This PEP specifies a set of Abstract Base Classes, and suggests a general strategy for implementing some of the methods. It uses terminology from PEP 3119, but the hierarchy is intended to be meaningful for any systematic method of defining sets of classes.

The type checks in the standard library should use these classes instead of the concrete built-ins.

Numeric Classes

We begin with a Number class to make it easy for people to be fuzzy about what kind of number they expect. This class only helps with overloading; it doesn't provide any operations.

class Number(metaclass=ABCMeta): pass

Most implementations of complex numbers will be hashable, but if you need to rely on that, you'll have to check it explicitly: mutable numbers are supported by this hierarchy.

class Complex(Number):
    """Complex defines the operations that work on the builtin complex type.

    In short, those are: conversion to complex, bool(), .real, .imag,
    +, -, *, /, **, abs(), .conjugate(), ==, and !=.

    If it is given heterogenous arguments, and doesn't have special
    knowledge about them, it should fall back to the builtin complex
    type as described below.
    """

    @abstractmethod
    def __complex__(self):
        """Return a builtin complex instance."""

    def __bool__(self):
        """True if self != 0."""
        return self != 0

    @abstractproperty
    def real(self):
        """Retrieve the real component of this number.

        This should subclass Real.
        """
        raise NotImplementedError

    @abstractproperty
    def imag(self):
        """Retrieve the real component of this number.

        This should subclass Real.
        """
        raise NotImplementedError

    @abstractmethod
    def __add__(self, other):
        raise NotImplementedError

    @abstractmethod
    def __radd__(self, other):
        raise NotImplementedError

    @abstractmethod
    def __neg__(self):
        raise NotImplementedError

    def __pos__(self):
        """Coerces self to whatever class defines the method."""
        raise NotImplementedError

    def __sub__(self, other):
        return self + -other

    def __rsub__(self, other):
        return -self + other

    @abstractmethod
    def __mul__(self, other):
        raise NotImplementedError

    @abstractmethod
    def __rmul__(self, other):
        raise NotImplementedError

    @abstractmethod
    def __div__(self, other):
        """a/b; should promote to float or complex when necessary."""
        raise NotImplementedError

    @abstractmethod
    def __rdiv__(self, other):
        raise NotImplementedError

    @abstractmethod
    def __pow__(self, exponent):
        """a**b; should promote to float or complex when necessary."""
        raise NotImplementedError

    @abstractmethod
    def __rpow__(self, base):
        raise NotImplementedError

    @abstractmethod
    def __abs__(self):
        """Returns the Real distance from 0."""
        raise NotImplementedError

    @abstractmethod
    def conjugate(self):
        """(x+y*i).conjugate() returns (x-y*i)."""
        raise NotImplementedError

    @abstractmethod
    def __eq__(self, other):
        raise NotImplementedError

    # __ne__ is inherited from object and negates whatever __eq__ does.

The Real ABC indicates that the value is on the real line, and supports the operations of the float builtin. Real numbers are totally ordered except for NaNs (which this PEP basically ignores).

class Real(Complex):
    """To Complex, Real adds the operations that work on real numbers.

    In short, those are: conversion to float, trunc(), math.floor(),
    math.ceil(), round(), divmod(), //, %, <, <=, >, and >=.

    Real also provides defaults for some of the derived operations.
    """

    # XXX What to do about the __int__ implementation that's
    # currently present on float?  Get rid of it?

    @abstractmethod
    def __float__(self):
        """Any Real can be converted to a native float object."""
        raise NotImplementedError

    @abstractmethod
    def __trunc__(self):
        """Truncates self to an Integral.

        Returns an Integral i such that:
          * i>=0 iff self>0;
          * abs(i) <= abs(self);
          * for any Integral j satisfying the first two conditions,
            abs(i) >= abs(j) [i.e. i has "maximal" abs among those].
        i.e. "truncate towards 0".
        """
        raise NotImplementedError

    @abstractmethod
    def __floor__(self):
        """Finds the greatest Integral <= self."""
        raise NotImplementedError

    @abstractmethod
    def __ceil__(self):
        """Finds the least Integral >= self."""
        raise NotImplementedError

    @abstractmethod
    def __round__(self, ndigits:Integral=None):
        """Rounds self to ndigits decimal places, defaulting to 0.

        If ndigits is omitted or None, returns an Integral,
        otherwise returns a Real, preferably of the same type as
        self. Types may choose which direction to round half. For
        example, float rounds half toward even.

        """
        raise NotImplementedError

    def __divmod__(self, other):
        """The pair (self // other, self % other).

        Sometimes this can be computed faster than the pair of
        operations.
        """
        return (self // other, self % other)

    def __rdivmod__(self, other):
        """The pair (self // other, self % other).

        Sometimes this can be computed faster than the pair of
        operations.
        """
        return (other // self, other % self)

    @abstractmethod
    def __floordiv__(self, other):
        """The floor() of self/other. Integral."""
        raise NotImplementedError

    @abstractmethod
    def __rfloordiv__(self, other):
        """The floor() of other/self."""
        raise NotImplementedError

    @abstractmethod
    def __mod__(self, other):
        """self % other

        See
        http://mail.python.org/pipermail/python-3000/2006-May/001735.html
        and consider using "self/other - trunc(self/other)"
        instead if you're worried about round-off errors.
        """
        raise NotImplementedError

    @abstractmethod
    def __rmod__(self, other):
        """other % self"""
        raise NotImplementedError

    @abstractmethod
    def __lt__(self, other):
        """< on Reals defines a total ordering, except perhaps for NaN."""
        raise NotImplementedError

    @abstractmethod
    def __le__(self, other):
        raise NotImplementedError

    # __gt__ and __ge__ are automatically done by reversing the arguments.
    # (But __le__ is not computed as the opposite of __gt__!)

    # Concrete implementations of Complex abstract methods.
    # Subclasses may override these, but don't have to.

    def __complex__(self):
        return complex(float(self))

    @property
    def real(self):
        return +self

    @property
    def imag(self):
        return 0

    def conjugate(self):
        """Conjugate is a no-op for Reals."""
        return +self

We should clean up Demo/classes/Rat.py and promote it into rational.py in the standard library. Then it will implement the Rational ABC.

class Rational(Real, Exact):
    """.numerator and .denominator should be in lowest terms."""

    @abstractproperty
    def numerator(self):
        raise NotImplementedError

    @abstractproperty
    def denominator(self):
        raise NotImplementedError

    # Concrete implementation of Real's conversion to float.
    # (This invokes Integer.__div__().)

    def __float__(self):
        return self.numerator / self.denominator

And finally integers:

class Integral(Rational):
    """Integral adds a conversion to int and the bit-string operations."""

    @abstractmethod
    def __int__(self):
        raise NotImplementedError

    def __index__(self):
        """__index__() exists because float has __int__()."""
        return int(self)

    def __lshift__(self, other):
        return int(self) << int(other)

    def __rlshift__(self, other):
        return int(other) << int(self)

    def __rshift__(self, other):
        return int(self) >> int(other)

    def __rrshift__(self, other):
        return int(other) >> int(self)

    def __and__(self, other):
        return int(self) & int(other)

    def __rand__(self, other):
        return int(other) & int(self)

    def __xor__(self, other):
        return int(self) ^ int(other)

    def __rxor__(self, other):
        return int(other) ^ int(self)

    def __or__(self, other):
        return int(self) | int(other)

    def __ror__(self, other):
        return int(other) | int(self)

    def __invert__(self):
        return ~int(self)

    # Concrete implementations of Rational and Real abstract methods.
    def __float__(self):
        """float(self) == float(int(self))"""
        return float(int(self))

    @property
    def numerator(self):
        """Integers are their own numerators."""
        return +self

    @property
    def denominator(self):
        """Integers have a denominator of 1."""
        return 1

Changes to operations and __magic__ methods

To support more precise narrowing from float to int (and more generally, from Real to Integral), we propose the following new __magic__ methods, to be called from the corresponding library functions. All of these return Integrals rather than Reals.

  1. __trunc__(self), called from a new builtin trunc(x), which returns the Integral closest to x between 0 and x.
  2. __floor__(self), called from math.floor(x), which returns the greatest Integral <= x.
  3. __ceil__(self), called from math.ceil(x), which returns the least Integral >= x.
  4. __round__(self), called from round(x), which returns the Integral closest to x, rounding half as the type chooses. float will change in 3.0 to round half toward even. There is also a 2-argument version, __round__(self, ndigits), called from round(x, ndigits), which should return a Real.

In 2.6, math.floor, math.ceil, and round will continue to return floats.

The int() conversion implemented by float is equivalent to trunc(). In general, the int() conversion should try __int__() first and if it is not found, try __trunc__().

complex.__{divmod,mod,floordiv,int,float}__ also go away. It would be nice to provide a nice error message to help confused porters, but not appearing in help(complex) is more important.

Notes for type implementors

Implementors should be careful to make equal numbers equal and hash them to the same values. This may be subtle if there are two different extensions of the real numbers. For example, a complex type could reasonably implement hash() as follows:

def __hash__(self):
    return hash(complex(self))

but should be careful of any values that fall outside of the built in complex's range or precision.

Adding More Numeric ABCs

There are, of course, more possible ABCs for numbers, and this would be a poor hierarchy if it precluded the possibility of adding those. You can add MyFoo between Complex and Real with:

class MyFoo(Complex): ...
MyFoo.register(Real)

Implementing the arithmetic operations

We want to implement the arithmetic operations so that mixed-mode operations either call an implementation whose author knew about the types of both arguments, or convert both to the nearest built in type and do the operation there. For subtypes of Integral, this means that __add__ and __radd__ should be defined as:

class MyIntegral(Integral):

    def __add__(self, other):
        if isinstance(other, MyIntegral):
            return do_my_adding_stuff(self, other)
        elif isinstance(other, OtherTypeIKnowAbout):
            return do_my_other_adding_stuff(self, other)
        else:
            return NotImplemented

    def __radd__(self, other):
        if isinstance(other, MyIntegral):
            return do_my_adding_stuff(other, self)
        elif isinstance(other, OtherTypeIKnowAbout):
            return do_my_other_adding_stuff(other, self)
        elif isinstance(other, Integral):
            return int(other) + int(self)
        elif isinstance(other, Real):
            return float(other) + float(self)
        elif isinstance(other, Complex):
            return complex(other) + complex(self)
        else:
            return NotImplemented

There are 5 different cases for a mixed-type operation on subclasses of Complex. I'll refer to all of the above code that doesn't refer to MyIntegral and OtherTypeIKnowAbout as "boilerplate". a will be an instance of A, which is a subtype of Complex (a : A <: Complex), and b : B <: Complex. I'll consider a + b:

  1. If A defines an __add__ which accepts b, all is well.
  2. If A falls back to the boilerplate code, and it were to return a value from __add__, we'd miss the possibility that B defines a more intelligent __radd__, so the boilerplate should return NotImplemented from __add__. (Or A may not implement __add__ at all.)
  3. Then B's __radd__ gets a chance. If it accepts a, all is well.
  4. If it falls back to the boilerplate, there are no more possible methods to try, so this is where the default implementation should live.
  5. If B <: A, Python tries B.__radd__ before A.__add__. This is ok, because it was implemented with knowledge of A, so it can handle those instances before delegating to Complex.

If A<:Complex and B<:Real without sharing any other knowledge, then the appropriate shared operation is the one involving the built in complex, and both __radd__s land there, so a+b == b+a.

Rejected Alternatives

The initial version of this PEP defined an algebraic hierarchy inspired by a Haskell Numeric Prelude [3] including MonoidUnderPlus, AdditiveGroup, Ring, and Field, and mentioned several other possible algebraic types before getting to the numbers. We had expected this to be useful to people using vectors and matrices, but the NumPy community really wasn't interested, and we ran into the issue that even if x is an instance of X <: MonoidUnderPlus and y is an instance of Y <: MonoidUnderPlus, x + y may still not make sense.

Then we gave the numbers a much more branching structure to include things like the Gaussian Integers and Z/nZ, which could be Complex but wouldn't necessarily support things like division. The community decided that this was too much complication for Python, so I've now scaled back the proposal to resemble the Scheme numeric tower much more closely.

The Decimal Type

After consultation with its authors it has been decided that the Decimal type should not at this time be made part of the numeric tower.

References

[1]Introducing Abstract Base Classes (http://www.python.org/dev/peps/pep-3119/)
[2]Possible Python 3K Class Tree?, wiki page by Bill Janssen (http://wiki.python.org/moin/AbstractBaseClasses)
[3]NumericPrelude: An experimental alternative hierarchy of numeric type classes (http://darcs.haskell.org/numericprelude/docs/html/index.html)
[4]The Scheme numerical tower (http://www.swiss.ai.mit.edu/ftpdir/scheme-reports/r5rs-html/r5rs_8.html#SEC50)

Acknowledgements

Thanks to Neal Norwitz for encouraging me to write this PEP in the first place, to Travis Oliphant for pointing out that the numpy people didn't really care about the algebraic concepts, to Alan Isaac for reminding me that Scheme had already done this, and to Guido van Rossum and lots of other people on the mailing list for refining the concept.

pep-3142 Add a "while" clause to generator expressions

PEP: 3142
Title: Add a "while" clause to generator expressions
Version: $Revision$
Last-Modified: $Date$
Author: Gerald Britton <gerald.britton at gmail.com>
Status: Rejected
Type: Standards Track
Content-Type: text/plain
Created: 12-Jan-2009
Python-Version: 3.0
Post-History: 
Resolution: http://mail.python.org/pipermail/python-dev/2013-May/126136.html

Abstract

   This PEP proposes an enhancement to generator expressions, adding a
   "while" clause to complement the existing "if" clause.


Rationale

   A generator expression (PEP 289 [1]) is a concise method to serve
   dynamically-generated objects to list comprehensions (PEP 202 [2]).
   Current generator expressions allow for an "if" clause to filter
   the objects that are returned to those meeting some set of
   criteria.  However, since the "if" clause is evaluated for every
   object that may be returned, in some cases it is possible that all
   objects would be rejected after a certain point.  For example:

       g = (n for n in range(100) if n*n < 50)

   which is equivalent to the using a generator function
   (PEP 255 [3]):

       def __gen(exp):
           for n in exp:
               if n*n < 50:
                   yield n
       g = __gen(iter(range(10)))

   would yield 0, 1, 2, 3, 4, 5, 6 and 7, but would also consider
   the numbers from 8 to 99 and reject them all since n*n >= 50 for
   numbers in that range.  Allowing for a "while" clause would allow
   the redundant tests to be short-circuited:

       g = (n for n in range(100) while n*n < 50)

   would also yield 0, 1, 2, 3, 4, 5, 6 and 7, but would stop at 8
   since the condition (n*n < 50) is no longer true.  This would be
   equivalent to the generator function:

       def __gen(exp):
           for n in exp:
               if n*n < 50:
                   yield n
               else:
                   break
       g = __gen(iter(range(100)))

   Currently, in order to achieve the same result, one would need to
   either write a generator function such as the one above or use the
   takewhile function from itertools:

       from itertools import takewhile
       g = takewhile(lambda n: n*n < 50, range(100))

   The takewhile code achieves the same result as the proposed syntax,
   albeit in a longer (some would say "less-elegant") fashion.  Also,
   the takewhile version requires an extra function call (the lambda
   in the example above) with the associated performance penalty.
   A simple test shows that:

       for n in (n for n in range(100) if 1): pass

   performs about 10% better than:

       for n in takewhile(lambda n: 1, range(100)): pass

   though they achieve similar results.  (The first example uses a
   generator; takewhile is an iterator).  If similarly implemented,
   a "while" clause should perform about the same as the "if" clause
   does today.

   The reader may ask if the "if" and "while" clauses should be
   mutually exclusive.  There are good examples that show that there
   are times when both may be used to good advantage. For example:

       p = (p for p in primes() if p > 100 while p < 1000)

   should return prime numbers found between 100 and 1000, assuming
   I have a primes() generator that yields prime numbers.

   Adding a "while" clause to generator expressions maintains the
   compact form while adding a useful facility for short-circuiting
   the expression.


Acknowledgements

   Raymond Hettinger first proposed the concept of generator
   expressions in January 2002.


References

   [1] PEP 289: Generator Expressions
       http://www.python.org/dev/peps/pep-0289/

   [2] PEP 202: List Comprehensions
       http://www.python.org/dev/peps/pep-0202/

   [3] PEP 255: Simple Generators
       http://www.python.org/dev/peps/pep-0255/


Copyright

   This document has been placed in the public domain.


pep-3143 Standard daemon process library

PEP:3143
Title:Standard daemon process library
Version:$Revision$
Last-Modified:$Date$
Author:Ben Finney <ben+python at benfinney.id.au>
Status:Deferred
Type:Standards Track
Content-Type:text/x-rst
Created:2009-01-26
Python-Version:3.x
Post-History:

Abstract

Writing a program to become a well-behaved Unix daemon is somewhat complex and tricky to get right, yet the steps are largely similar for any daemon regardless of what else the program may need to do.

This PEP introduces a package to the Python standard library that provides a simple interface to the task of becoming a daemon process.

PEP Deferral

Further exploration of the concepts covered in this PEP has been deferred for lack of a current champion interested in promoting the goals of the PEP and collecting and incorporating feedback, and with sufficient available time to do so effectively.

Specification

Example usage

Simple example of direct DaemonContext usage:

import daemon

from spam import do_main_program

with daemon.DaemonContext():
    do_main_program()

More complex example usage:

import os
import grp
import signal
import daemon
import lockfile

from spam import (
    initial_program_setup,
    do_main_program,
    program_cleanup,
    reload_program_config,
    )

context = daemon.DaemonContext(
    working_directory='/var/lib/foo',
    umask=0o002,
    pidfile=lockfile.FileLock('/var/run/spam.pid'),
    )

context.signal_map = {
    signal.SIGTERM: program_cleanup,
    signal.SIGHUP: 'terminate',
    signal.SIGUSR1: reload_program_config,
    }

mail_gid = grp.getgrnam('mail').gr_gid
context.gid = mail_gid

important_file = open('spam.data', 'w')
interesting_file = open('eggs.data', 'w')
context.files_preserve = [important_file, interesting_file]

initial_program_setup()

with context:
    do_main_program()

Interface

A new package, daemon, is added to the standard library.

A class, DaemonContext, is defined to represent the settings and process context for the program running as a daemon process.

DaemonContext objects

A DaemonContext instance represents the behaviour settings and process context for the program when it becomes a daemon. The behaviour and environment is customised by setting options on the instance, before calling the open method.

Each option can be passed as a keyword argument to the DaemonContext constructor, or subsequently altered by assigning to an attribute on the instance at any time prior to calling open. That is, for options named wibble and wubble, the following invocation:

foo = daemon.DaemonContext(wibble=bar, wubble=baz)
foo.open()

is equivalent to:

foo = daemon.DaemonContext()
foo.wibble = bar
foo.wubble = baz
foo.open()

The following options are defined.

files_preserve
Default:None

List of files that should not be closed when starting the daemon. If None, all open file descriptors will be closed.

Elements of the list are file descriptors (as returned by a file object's fileno() method) or Python file objects. Each specifies a file that is not to be closed during daemon start.

chroot_directory
Default:None

Full path to a directory to set as the effective root directory of the process. If None, specifies that the root directory is not to be changed.

working_directory
Default:'/'

Full path of the working directory to which the process should change on daemon start.

Since a filesystem cannot be unmounted if a process has its current working directory on that filesystem, this should either be left at default or set to a directory that is a sensible “home directory” for the daemon while it is running.

umask
Default:0

File access creation mask (“umask”) to set for the process on daemon start.

Since a process inherits its umask from its parent process, starting the daemon will reset the umask to this value so that files are created by the daemon with access modes as it expects.

pidfile
Default:None

Context manager for a PID lock file. When the daemon context opens and closes, it enters and exits the pidfile context manager.

detach_process
Default:None

If True, detach the process context when opening the daemon context; if False, do not detach.

If unspecified (None) during initialisation of the instance, this will be set to True by default, and False only if detaching the process is determined to be redundant; for example, in the case when the process was started by init, by initd, or by inetd.

signal_map
Default:system-dependent

Mapping from operating system signals to callback actions.

The mapping is used when the daemon context opens, and determines the action for each signal's signal handler:

  • A value of None will ignore the signal (by setting the signal action to signal.SIG_IGN).
  • A string value will be used as the name of an attribute on the DaemonContext instance. The attribute's value will be used as the action for the signal handler.
  • Any other value will be used as the action for the signal handler.

The default value depends on which signals are defined on the running system. Each item from the list below whose signal is actually defined in the signal module will appear in the default map:

  • signal.SIGTTIN: None
  • signal.SIGTTOU: None
  • signal.SIGTSTP: None
  • signal.SIGTERM: 'terminate'

Depending on how the program will interact with its child processes, it may need to specify a signal map that includes the signal.SIGCHLD signal (received when a child process exits). See the specific operating system's documentation for more detail on how to determine what circumstances dictate the need for signal handlers.

uid
Default:os.getuid()
gid
Default:os.getgid()

The user ID (“UID”) value and group ID (“GID”) value to switch the process to on daemon start.

The default values, the real UID and GID of the process, will relinquish any effective privilege elevation inherited by the process.

prevent_core
Default:True

If true, prevents the generation of core files, in order to avoid leaking sensitive information from daemons run as root.

stdin
Default:None
stdout
Default:None
stderr
Default:None

Each of stdin, stdout, and stderr is a file-like object which will be used as the new file for the standard I/O stream sys.stdin, sys.stdout, and sys.stderr respectively. The file should therefore be open, with a minimum of mode 'r' in the case of stdin, and mode 'w+' in the case of stdout and stderr.

If the object has a fileno() method that returns a file descriptor, the corresponding file will be excluded from being closed during daemon start (that is, it will be treated as though it were listed in files_preserve).

If None, the corresponding system stream is re-bound to the file named by os.devnull.

The following methods are defined.

open()
Return:None

Open the daemon context, turning the current program into a daemon process. This performs the following steps:

  • If this instance's is_open property is true, return immediately. This makes it safe to call open multiple times on an instance.

  • If the prevent_core attribute is true, set the resource limits for the process to prevent any core dump from the process.

  • If the chroot_directory attribute is not None, set the effective root directory of the process to that directory (via os.chroot).

    This allows running the daemon process inside a “chroot gaol” as a means of limiting the system's exposure to rogue behaviour by the process. Note that the specified directory needs to already be set up for this purpose.

  • Set the process UID and GID to the uid and gid attribute values.

  • Close all open file descriptors. This excludes those listed in the files_preserve attribute, and those that correspond to the stdin, stdout, or stderr attributes.

  • Change current working directory to the path specified by the working_directory attribute.

  • Reset the file access creation mask to the value specified by the umask attribute.

  • If the detach_process option is true, detach the current process into its own process group, and disassociate from any controlling terminal.

  • Set signal handlers as specified by the signal_map attribute.

  • If any of the attributes stdin, stdout, stderr are not None, bind the system streams sys.stdin, sys.stdout, and/or sys.stderr to the files represented by the corresponding attributes. Where the attribute has a file descriptor, the descriptor is duplicated (instead of re-binding the name).

  • If the pidfile attribute is not None, enter its context manager.

  • Mark this instance as open (for the purpose of future open and close calls).

  • Register the close method to be called during Python's exit processing.

When the function returns, the running program is a daemon process.

close()
Return:None

Close the daemon context. This performs the following steps:

  • If this instance's is_open property is false, return immediately. This makes it safe to call close multiple times on an instance.
  • If the pidfile attribute is not None, exit its context manager.
  • Mark this instance as closed (for the purpose of future open and close calls).
is_open
Return:True if the instance is open, False otherwise.

This property exposes the state indicating whether the instance is currently open. It is True if the instance's open method has been called and the close method has not subsequently been called.

terminate(signal_number, stack_frame)
Return:None

Signal handler for the signal.SIGTERM signal. Performs the following step:

  • Raise a SystemExit exception explaining the signal.

The class also implements the context manager protocol via __enter__ and __exit__ methods.

__enter__()
Return:The DaemonContext instance

Call the instance's open() method, then return the instance.

__exit__(exc_type, exc_value, exc_traceback)
Return:True or False as defined by the context manager protocol

Call the instance's close() method, then return True if the exception was handled or False if it was not.

Motivation

The majority of programs written to be Unix daemons either implement behaviour very similar to that in the specification, or are poorly-behaved daemons by the correct daemon behaviour.

Since these steps should be much the same in most implementations but are very particular and easy to omit or implement incorrectly, they are a prime target for a standard well-tested implementation in the standard library.

Rationale

Correct daemon behaviour

According to Stevens in [stevens] §2.6, a program should perform the following steps to become a Unix daemon process.

  • Close all open file descriptors.
  • Change current working directory.
  • Reset the file access creation mask.
  • Run in the background.
  • Disassociate from process group.
  • Ignore terminal I/O signals.
  • Disassociate from control terminal.
  • Don't reacquire a control terminal.
  • Correctly handle the following circumstances:
    • Started by System V init process.
    • Daemon termination by SIGTERM signal.
    • Children generate SIGCLD signal.

The daemon tool [slack-daemon] lists (in its summary of features) behaviour that should be performed when turning a program into a well-behaved Unix daemon process. It differs from this PEP's intent in that it invokes a separate program as a daemon process. The following features are appropriate for a daemon that starts itself once the program is already running:

  • Sets up the correct process context for a daemon.
  • Behaves sensibly when started by initd(8) or inetd(8).
  • Revokes any suid or sgid privileges to reduce security risks in case daemon is incorrectly installed with special privileges.
  • Prevents the generation of core files to prevent leaking sensitive information from daemons run as root (optional).
  • Names the daemon by creating and locking a PID file to guarantee that only one daemon with the given name can execute at any given time (optional).
  • Sets the user and group under which to run the daemon (optional, root only).
  • Creates a chroot gaol (optional, root only).
  • Captures the daemon's stdout and stderr and directs them to syslog (optional).

A daemon is not a service

This PEP addresses only Unix-style daemons, for which the above correct behaviour is relevant, as opposed to comparable behaviours on other operating systems.

There is a related concept in many systems, called a “service”. A service differs from the model in this PEP, in that rather than having the current program continue to run as a daemon process, a service starts an additional process to run in the background, and the current process communicates with that additional process via some defined channels.

The Unix-style daemon model in this PEP can be used, among other things, to implement the background-process part of a service; but this PEP does not address the other aspects of setting up and managing a service.

Reference Implementation

The python-daemon package [python-daemon].

Other daemon implementations

Prior to this PEP, several existing third-party Python libraries or tools implemented some of this PEP's correct daemon behaviour.

The reference implementation is a fairly direct successor from the following implementations:

Other Python daemon implementations that differ from this PEP:

  • The zdaemon tool [zdaemon] was written for the Zope project. Like [slack-daemon], it differs from this specification because it is used to run another program as a daemon process.
  • The Python library daemon [clapper-daemon] is (according to its homepage) no longer maintained. As of version 1.0.1, it implements the basic steps from [stevens].
  • The daemonize library [seutter-daemonize] also implements the basic steps from [stevens].
  • Ray Burr's daemon.py module [burr-daemon] provides the [stevens] procedure as well as PID file handling and redirection of output to syslog.
  • Twisted [twisted] includes, perhaps unsurprisingly, an implementation of a process daemonisation API that is integrated with the rest of the Twisted framework; it differs significantly from the API in this PEP.
  • The Python initd library [dagitses-initd], which uses [clapper-daemon], implements an equivalent of Unix initd(8) for controlling a daemon process.

References

[stevens](1, 2, 3, 4) Unix Network Programming, W. Richard Stevens, 1994 Prentice Hall.
[slack-daemon](1, 2) The (non-Python) “libslack” implementation of a daemon tool http://www.libslack.org/daemon/ by “raf” <raf@raf.org>.
[python-daemon](1, 2) The python-daemon library http://pypi.python.org/pypi/python-daemon/ by Ben Finney et al.
[cookbook-66012](1, 2) Python Cookbook recipe 66012, “Fork a daemon process on Unix” http://code.activestate.com/recipes/66012/.
[cookbook-278731]Python Cookbook recipe 278731, “Creating a daemon the Python way” http://code.activestate.com/recipes/278731/.
[bda.daemon]The bda.daemon library http://pypi.python.org/pypi/bda.daemon/ by Robert Niederreiter et al.
[zdaemon]The zdaemon tool http://pypi.python.org/pypi/zdaemon/ by Guido van Rossum et al.
[clapper-daemon](1, 2) The daemon library http://pypi.python.org/pypi/daemon/ by Brian Clapper.
[seutter-daemonize]The daemonize library http://daemonize.sourceforge.net/ by Jerry Seutter.
[burr-daemon]The daemon.py module http://www.nightmare.com/~ryb/code/daemon.py by Ray Burr.
[twisted]The Twisted application framework http://pypi.python.org/pypi/Twisted/ by Glyph Lefkowitz et al.
[dagitses-initd]The Python initd library http://pypi.python.org/pypi/initd/ by Michael Andreas Dagitses.

pep-3144 IP Address Manipulation Library for the Python Standard Library

PEP: 3144
Title: IP Address Manipulation Library for the Python Standard Library
Version: $Revision$
Last-Modified: $Date$
Author: Peter Moody <pmoody at google.com>
BDFL-Delegate: Nick Coghlan
Discussions-To:  <ipaddr-py-dev at googlegroups.com>
Status: Final
Type: Standards Track
Content-Type: text/plain
Created: 6-Feb-2012
Python-Version: 3.3
Resolution: http://mail.python.org/pipermail/python-dev/2012-May/119474.html

Abstract:

    This PEP proposes a design and for an IP address manipulation module for
    python.


PEP Acceptance:

    This PEP was accepted by Nick Coghlan on the 15th of May, 2012.


Motivation:

    Several very good IP address modules for python already exist.
    The truth is that all of them struggle with the balance between
    adherence to Pythonic principals and the shorthand upon which
    network engineers and administrators rely.  ipaddress aims to
    strike the right balance.


Rationale:

    The existence of several Python IP address manipulation modules is
    evidence of an outstanding need for the functionality this module
    seeks to provide.


Background:

    PEP 3144 and ipaddr have been up for inclusion before.  The
    version of the library specified here is backwards incompatible
    with the version on PyPI and the one which was discussed before.
    In order to avoid confusing users of the current ipaddr, I've
    renamed this version of the library "ipaddress".

    The main differences between ipaddr and ipaddress are:

    * ipaddress *Network classes are equivalent to the ipaddr *Network
      class counterparts with the strict flag set to True.

    * ipaddress *Interface classes are equivalent to the ipaddr
      *Network class counterparts with the strict flag set to False.

    * The factory functions in ipaddress were renamed to disambiguate
      them from classes.

    * A few attributes were renamed to disambiguate their purpose as
      well. (eg. network, network_address)

    * A number of methods and functions which returned containers in ipaddr now
      return iterators. This includes, subnets, address_exclude,
      summarize_address_range and collapse_address_list.


    Due to the backwards incompatible API changes between ipaddress and ipaddr,
    the proposal is to add the module using the new provisional API status:

    * http://docs.python.org/dev/glossary.html#term-provisional-package


    Relevant messages on python-dev:

    * http://mail.python.org/pipermail/python-dev/2012-January/116016.html
    * http://mail.python.org/pipermail/python-dev/2012-February/116656.html
    * http://mail.python.org/pipermail/python-dev/2012-February/116688.html


Specification:

    The ipaddr module defines a total of 6 new public classes, 3 for
    manipulating IPv4 objects and 3 for manipulating IPv6 objects.
    The classes are as follows:

    IPv4Address/IPv6Address - These define individual addresses, for
    example the IPv4 address returned by an A record query for
    www.google.com (74.125.224.84) or the IPv6 address returned by a
    AAAA record query for ipv6.google.com (2001:4860:4001:801::1011).

    IPv4Network/IPv6Network - These define networks or groups of
    addresses, for example the IPv4 network reserved for multicast use
    (224.0.0.0/4) or the IPv6 network reserved for multicast
    (ff00::/8, wow, that's big).

    IPv4Interface/IPv6Interface - These hybrid classes refer to an
    individual address on a given network.  For example, the IPV4
    address 192.0.2.1 on the network 192.0.2.0/24 could be referred to
    as 192.0.2.1/24.  Likewise, the IPv6 address 2001:DB8::1 on the
    network 2001:DB8::/96 could be referred to as 2001:DB8::1/96.
    It's very common to refer to addresses assigned to computer
    network interfaces like this, hence the Interface name.

    All IPv4 classes share certain characteristics and methods; the
    number of bits needed to represent them, whether or not they
    belong to certain special IPv4 network ranges, etc.  Similarly,
    all IPv6 classes share characteristics and methods.

    ipaddr makes extensive use of inheritance to avoid code
    duplication as much as possible.  The parent classes are private,
    but they are outlined here:

    _IPAddrBase - Provides methods common to all ipaddr objects.

    _BaseAddress - Provides methods common to IPv4Address and
    IPv6Address.

    _BaseInterface - Provides methods common to IPv4Interface and
    IPv6Interface, as well as IPv4Network and IPv6Network (ipaddr
    treats the Network classes as a special case of Interface).

    _BaseV4 - Provides methods and variables (eg, _max_prefixlen)
    common to all IPv4 classes.

    _BaseV6 - Provides methods and variables common to all IPv6 classes.

    Comparisons between objects of differing IP versions results in a
    TypeError [1].  Additionally, comparisons of objects with
    different _Base parent classes results in a TypeError.  The effect
    of the _Base parent class limitation is that IPv4Interface's can
    be compared to IPv4Network's and IPv6Interface's can be compared
    to IPv6Network's.


Reference Implementation:

    The current reference implementation can be found at:
    http://code.google.com/p/ipaddress-py/source/browse/ipaddress.py

    Or see the tarball to include the README and unittests.
    http://code.google.com/p/ipaddress-py/downloads/detail?name=ipaddress-1.0.tar.gz

    More information about using the reference implementation can be
    found at: http://code.google.com/p/ipaddr-py/wiki/Using3144


References:

    [1] Appealing to authority is a logical fallacy, but Vint Cerf is an
        an authority who can't be ignored. Full text of the email
        follows:

        """
        I have seen a substantial amount of traffic about IPv4 and
        IPv6 comparisons and the general consensus is that these are
        not comparable.

        If we were to take a very simple minded view, we might treat
        these as pure integers in which case there is an ordering but
        not a useful one.

        In the IPv4 world, "length" is important because we take
        longest (most specific) address first for routing.  Length is
        determine by the mask, as you know.

        Assuming that the same style of argument works in IPv6, we
        would have to conclude that treating an IPv6 value purely as
        an integer for comparison with IPv4 would lead to some really
        strange results.

        All of IPv4 space would lie in the host space of 0::0/96
        prefix of IPv6. For any useful interpretation of IPv4, this is
        a non-starter.

        I think the only sensible conclusion is that IPv4 values and
        IPv6 values should be treated as non-comparable.

        Vint
        """


Copyright:

    This document has been placed in the public domain.



pep-3145 Asynchronous I/O For subprocess.Popen

PEP: 3145
Title: Asynchronous I/O For subprocess.Popen
Version: $Revision$
Last-Modified: $Date$
Author: (James) Eric Pruitt, Charles R. McCreary, Josiah Carlson
Status: Withdrawn
Type: Standards Track
Content-Type: text/plain
Created: 04-Aug-2009
Python-Version: 3.2
Post-History: 

Abstract:

    In its present form, the subprocess.Popen implementation is prone to
    dead-locking and blocking of the parent Python script while waiting on data
    from the child process. This PEP proposes to make
    subprocess.Popen more asynchronous to help alleviate these
    problems.


PEP Deferral:

    Further exploration of the concepts covered in this PEP has been deferred
    at least until after PEP 3156 has been resolved.


PEP Withdrawal:

    This can be dealt with in the bug tracker.
     A specific proposal is attached to http://bugs.python.org/issue18823


Motivation:

    A search for "python asynchronous subprocess" will turn up numerous
    accounts of people wanting to execute a child process and communicate with
    it from time to time reading only the data that is available instead of
    blocking to wait for the program to produce data [1] [2] [3].  The current
    behavior of the subprocess module is that when a user sends or receives
    data via the stdin, stderr and stdout file objects, dead locks are common
    and documented [4] [5].  While communicate can be used to alleviate some of
    the buffering issues, it will still cause the parent process to block while
    attempting to read data when none is available to be read from the child
    process.  

Rationale:

    There is a documented need for asynchronous, non-blocking functionality in
    subprocess.Popen [6] [7] [2] [3].  Inclusion of the code would improve the
    utility of the Python standard library that can be used on Unix based and
    Windows builds of Python.  Practically every I/O object in Python has a
    file-like wrapper of some sort.  Sockets already act as such and for
    strings there is StringIO.  Popen can be made to act like a file by simply
    using the methods attached the the subprocess.Popen.stderr, stdout and
    stdin file-like objects.  But when using the read and write methods of
    those options, you do not have the benefit of asynchronous I/O.  In the
    proposed solution the wrapper wraps the asynchronous methods to mimic a
    file object.

Reference Implementation:

    I have been maintaining a Google Code repository that contains all of my
    changes including tests and documentation [9] as well as blog detailing
    the problems I have come across in the development process [10].  

    I have been working on implementing non-blocking asynchronous I/O in the
    subprocess.Popen module as well as a wrapper class for subprocess.Popen
    that makes it so that an executed process can take the place of a file by
    duplicating all of the methods and attributes that file objects have.  

    There are two base functions that have been added to the subprocess.Popen
    class: Popen.send and Popen._recv, each with two separate implementations,
    one for Windows and one for Unix based systems.  The Windows
    implementation uses ctypes to access the functions needed to control pipes
    in the kernel 32 DLL in an asynchronous manner.  On Unix based systems,
    the Python interface for file control serves the same purpose.  The
    different implementations of Popen.send and Popen._recv have identical
    arguments to make code that uses these functions work across multiple
    platforms.  

    When calling the Popen._recv function, it requires the pipe name be
    passed as an argument so there exists the Popen.recv function that passes
    selects stdout as the pipe for Popen._recv by default. Popen.recv_err
    selects stderr as the pipe by default. Popen.recv and Popen.recv_err
    are much easier to read and understand than Popen._recv('stdout' ...) and
    Popen._recv('stderr' ...) respectively.  

    Since the Popen._recv function does not wait on data to be produced
    before returning a value, it may return empty bytes. Popen.asyncread
    handles this issue by returning all data read over a given time
    interval.  

    The ProcessIOWrapper class uses the asyncread and asyncwrite functions to
    allow a process to act like a file so that there are no blocking issues
    that can arise from using the stdout and stdin file objects produced from
    a subprocess.Popen call.
    

References:

    [1] [ python-Feature Requests-1191964 ] asynchronous Subprocess
        http://mail.python.org/pipermail/python-bugs-list/2006-December/
          036524.html

    [2] Daily Life in an Ivory Basement : /feb-07/problems-with-subprocess
        http://ivory.idyll.org/blog/feb-07/problems-with-subprocess

    [3] How can I run an external command asynchronously from Python? - Stack 
        Overflow
        http://stackoverflow.com/questions/636561/how-can-i-run-an-external-
          command-asynchronously-from-python

    [4] 18.1. subprocess - Subprocess management - Python v2.6.2 documentation
        http://docs.python.org/library/subprocess.html#subprocess.Popen.wait

    [5] 18.1. subprocess - Subprocess management - Python v2.6.2 documentation
        http://docs.python.org/library/subprocess.html#subprocess.Popen.kill

    [6] Issue 1191964: asynchronous Subprocess - Python tracker
        http://bugs.python.org/issue1191964

    [7] Module to allow Asynchronous subprocess use on Windows and Posix
        platforms - ActiveState Code
        http://code.activestate.com/recipes/440554/

    [8] subprocess.rst - subprocdev - Project Hosting on Google Code
        http://code.google.com/p/subprocdev/source/browse/doc/subprocess.rst?spec=svn2c925e935cad0166d5da85e37c742d8e7f609de5&r=2c925e935cad0166d5da85e37c742d8e7f609de5#437

    [9] subprocdev - Project Hosting on Google Code
        http://code.google.com/p/subprocdev

    [10] Python Subprocess Dev
         http://subdev.blogspot.com/

Copyright:

    This P.E.P. is licensed under the Open Publication License;
    http://www.opencontent.org/openpub/.

pep-3146 Merging Unladen Swallow into CPython

PEP:3146
Title:Merging Unladen Swallow into CPython
Version:$Revision$
Last-Modified:$Date$
Author:Collin Winter <collinwinter at google.com>, Jeffrey Yasskin <jyasskin at google.com>, Reid Kleckner <rnk at mit.edu>
Status:Withdrawn
Type:Standards Track
Content-Type:text/x-rst
Created:1-Jan-2010
Python-Version:3.3
Post-History:

PEP Withdrawal

With Unladen Swallow going the way of the Norwegian Blue [1] [2], this PEP has been deemed to have been withdrawn.

Abstract

This PEP proposes the merger of the Unladen Swallow project [3] into CPython's source tree. Unladen Swallow is an open-source branch of CPython focused on performance. Unladen Swallow is source-compatible with valid Python 2.6.4 applications and C extension modules.

Unladen Swallow adds a just-in-time (JIT) compiler to CPython, allowing for the compilation of selected Python code to optimized machine code. Beyond classical static compiler optimizations, Unladen Swallow's JIT compiler takes advantage of data collected at runtime to make checked assumptions about code behaviour, allowing the production of faster machine code.

This PEP proposes to integrate Unladen Swallow into CPython's development tree in a separate py3k-jit branch, targeted for eventual merger with the main py3k branch. While Unladen Swallow is by no means finished or perfect, we feel that Unladen Swallow has reached sufficient maturity to warrant incorporation into CPython's roadmap. We have sought to create a stable platform that the wider CPython development team can build upon, a platform that will yield increasing performance for years to come.

This PEP will detail Unladen Swallow's implementation and how it differs from CPython 2.6.4; the benchmarks used to measure performance; the tools used to ensure correctness and compatibility; the impact on CPython's current platform support; and the impact on the CPython core development process. The PEP concludes with a proposed merger plan and brief notes on possible directions for future work.

We seek the following from the BDFL:

  • Approval for the overall concept of adding a just-in-time compiler to CPython, following the design laid out below.
  • Permission to continue working on the just-in-time compiler in the CPython source tree.
  • Permission to eventually merge the just-in-time compiler into the py3k branch once all blocking issues [32] have been addressed.
  • A pony.

Rationale, Implementation

Many companies and individuals would like Python to be faster, to enable its use in more projects. Google is one such company.

Unladen Swallow is a Google-sponsored branch of CPython, initiated to improve the performance of Google's numerous Python libraries, tools and applications. To make the adoption of Unladen Swallow as easy as possible, the project initially aimed at four goals:

  • A performance improvement of 5x over the baseline of CPython 2.6.4 for single-threaded code.
  • 100% source compatibility with valid CPython 2.6 applications.
  • 100% source compatibility with valid CPython 2.6 C extension modules.
  • Design for eventual merger back into CPython.

We chose 2.6.4 as our baseline because Google uses CPython 2.4 internally, and jumping directly from CPython 2.4 to CPython 3.x was considered infeasible.

To achieve the desired performance, Unladen Swallow has implemented a just-in-time (JIT) compiler [52] in the tradition of Urs Hoelzle's work on Self [53], gathering feedback at runtime and using that to inform compile-time optimizations. This is similar to the approach taken by the current breed of JavaScript engines [60], [61]; most Java virtual machines [65]; Rubinius [62], MacRuby [64], and other Ruby implementations; Psyco [66]; and others.

We explicitly reject any suggestion that our ideas are original. We have sought to reuse the published work of other researchers wherever possible. If we have done any original work, it is by accident. We have tried, as much as possible, to take good ideas from all corners of the academic and industrial community. A partial list of the research papers that have informed Unladen Swallow is available on the Unladen Swallow wiki [55].

The key observation about optimizing dynamic languages is that they are only dynamic in theory; in practice, each individual function or snippet of code is relatively static, using a stable set of types and child functions. The current CPython bytecode interpreter assumes the worst about the code it is running, that at any moment the user might override the len() function or pass a never-before-seen type into a function. In practice this never happens, but user code pays for that support. Unladen Swallow takes advantage of the relatively static nature of user code to improve performance.

At a high level, the Unladen Swallow JIT compiler works by translating a function's CPython bytecode to platform-specific machine code, using data collected at runtime, as well as classical compiler optimizations, to improve the quality of the generated machine code. Because we only want to spend resources compiling Python code that will actually benefit the runtime of the program, an online heuristic is used to assess how hot a given function is. Once the hotness value for a function crosses a given threshold, it is selected for compilation and optimization. Until a function is judged hot, however, it runs in the standard CPython eval loop, which in Unladen Swallow has been instrumented to record interesting data about each bytecode executed. This runtime data is used to reduce the flexibility of the generated machine code, allowing us to optimize for the common case. For example, we collect data on

  • Whether a branch was taken/not taken. If a branch is never taken, we will not compile it to machine code.
  • Types used by operators. If we find that a + b is only ever adding integers, the generated machine code for that snippet will not support adding floats.
  • Functions called at each callsite. If we find that a particular foo() callsite is always calling the same foo function, we can optimize the call or inline it away

Refer to [56] for a complete list of data points gathered and how they are used.

However, if by chance the historically-untaken branch is now taken, or some integer-optimized a + b snippet receives two strings, we must support this. We cannot change Python semantics. Each of these sections of optimized machine code is preceded by a guard, which checks whether the simplifying assumptions we made when optimizing still hold. If the assumptions are still valid, we run the optimized machine code; if they are not, we revert back to the interpreter and pick up where we left off.

We have chosen to reuse a set of existing compiler libraries called LLVM [4] for code generation and code optimization. This has saved our small team from needing to understand and debug code generation on multiple machine instruction sets and from needing to implement a large set of classical compiler optimizations. The project would not have been possible without such code reuse. We have found LLVM easy to modify and its community receptive to our suggestions and modifications.

In somewhat more depth, Unladen Swallow's JIT works by compiling CPython bytecode to LLVM's own intermediate representation (IR) [96], taking into account any runtime data from the CPython eval loop. We then run a set of LLVM's built-in optimization passes, producing a smaller, optimized version of the original LLVM IR. LLVM then lowers the IR to platform-specific machine code, performing register allocation, instruction scheduling, and any necessary relocations. This arrangement of the compilation pipeline allows the LLVM-based JIT to be easily omitted from a compiled python binary by passing --without-llvm to ./configure; various use cases for this flag are discussed later.

For a complete detailing of how Unladen Swallow works, consult the Unladen Swallow documentation [54], [56].

Unladen Swallow has focused on improving the performance of single-threaded, pure-Python code. We have not made an effort to remove CPython's global interpreter lock (GIL); we feel this is separate from our work, and due to its sensitivity, is best done in a mainline development branch. We considered making GIL-removal a part of Unladen Swallow, but were concerned by the possibility of introducing subtle bugs when porting our work from CPython 2.6 to 3.x.

A JIT compiler is an extremely versatile tool, and we have by no means exhausted its full potential. We have tried to create a sufficiently flexible framework that the wider CPython development community can build upon it for years to come, extracting increased performance in each subsequent release.

Alternatives

There are number of alternative strategies for improving Python performance which we considered, but found unsatisfactory.

  • Cython, Shedskin: Cython [103] and Shedskin [104] are both static compilers for Python. We view these as useful-but-limited workarounds for CPython's historically-poor performance. Shedskin does not support the full Python standard library [105], while Cython requires manual Cython-specific annotations for optimum performance.

    Static compilers like these are useful for writing extension modules without worrying about reference counting, but because they are static, ahead-of-time compilers, they cannot optimize the full range of code under consideration by a just-in-time compiler informed by runtime data.

  • IronPython: IronPython [108] is Python on Microsoft's .Net platform. It is not actively tested on Mono [109], meaning that it is essentially Windows-only, making it unsuitable as a general CPython replacement.

  • Jython: Jython [110] is a complete implementation of Python 2.5, but is significantly slower than Unladen Swallow (3-5x on measured benchmarks) and has no support for CPython extension modules [111], which would make migration of large applications prohibitively expensive.

  • Psyco: Psyco [66] is a specializing JIT compiler for CPython, implemented as an extension module. It primarily improves performance for numerical code. Pros: exists; makes some code faster. Cons: 32-bit only, with no plans for 64-bit support; supports x86 only; very difficult to maintain; incompatible with SSE2 optimized code due to alignment issues.

  • PyPy: PyPy [67] has good performance on numerical code, but is slower than Unladen Swallow on some workloads. Migration of large applications from CPython to PyPy would be prohibitively expensive: PyPy's JIT compiler supports only 32-bit x86 code generation; important modules, such as MySQLdb and pycrypto, do not build against PyPy; PyPy does not offer an embedding API, much less the same API as CPython.

  • PyV8: PyV8 [112] is an alpha-stage experimental Python-to-JavaScript compiler that runs on top of V8. PyV8 does not implement the whole Python language, and has no support for CPython extension modules.

  • WPython: WPython [106] is a wordcode-based reimplementation of CPython's interpreter loop. While it provides a modest improvement to interpreter performance [107], it is not an either-or substitute for a just-in-time compiler. An interpreter will never be as fast as optimized machine code. We view WPython and similar interpreter enhancements as complementary to our work, rather than as competitors.

Performance

Benchmarks

Unladen Swallow has developed a fairly large suite of benchmarks, ranging from synthetic microbenchmarks designed to test a single feature up through whole-application macrobenchmarks. The inspiration for these benchmarks has come variously from third-party contributors (in the case of the html5lib benchmark), Google's own internal workloads (slowspitfire, pickle, unpickle), as well as tools and libraries in heavy use throughout the wider Python community (django, 2to3, spambayes). These benchmarks are run through a single interface called perf.py that takes care of collecting memory usage information, graphing performance, and running statistics on the benchmark results to ensure significance.

The full list of available benchmarks is available on the Unladen Swallow wiki [44], including instructions on downloading and running the benchmarks for yourself. All our benchmarks are open-source; none are Google-proprietary. We believe this collection of benchmarks serves as a useful tool to benchmark any complete Python implementation, and indeed, PyPy is already using these benchmarks for their own performance testing [82], [97]. We welcome this, and we seek additional workloads for the benchmark suite from the Python community.

We have focused our efforts on collecting macrobenchmarks and benchmarks that simulate real applications as well as possible, when running a whole application is not feasible. Along a different axis, our benchmark collection originally focused on the kinds of workloads seen by Google's Python code (webapps, text processing), though we have since expanded the collection to include workloads Google cares nothing about. We have so far shied away from heavily-numerical workloads, since NumPy [81] already does an excellent job on such code and so improving numerical performance was not an initial high priority for the team; we have begun to incorporate such benchmarks into the collection [98] and have started work on optimizing numerical Python code.

Beyond these benchmarks, there are also a variety of workloads we are explicitly not interested in benchmarking. Unladen Swallow is focused on improving the performance of pure-Python code, so the performance of extension modules like NumPy is uninteresting since NumPy's core routines are implemented in C. Similarly, workloads that involve a lot of IO like GUIs, databases or socket-heavy applications would, we feel, fail to accurately measure interpreter or code generation optimizations. That said, there's certainly room to improve the performance of C-language extensions modules in the standard library, and as such, we have added benchmarks for the cPickle and re modules.

Performance vs CPython 2.6.4

The charts below compare the arithmetic mean of multiple benchmark iterations for CPython 2.6.4 and Unladen Swallow. perf.py gathers more data than this, and indeed, arithmetic mean is not the whole story; we reproduce only the mean for the sake of conciseness. We include the t score from the Student's two-tailed T-test [45] at the 95% confidence interval to indicate the significance of the result. Most benchmarks are run for 100 iterations, though some longer-running whole-application benchmarks are run for fewer iterations.

A description of each of these benchmarks is available on the Unladen Swallow wiki [44].

Command:

./perf.py -r -b default,apps ../a/python ../b/python

32-bit; gcc 4.0.3; Ubuntu Dapper; Intel Core2 Duo 6600 @ 2.4GHz; 2 cores; 4MB L2 cache; 4GB RAM

Benchmark CPython 2.6.4 Unladen Swallow r988 Change Significance Timeline
2to3 25.13 s 24.87 s 1.01x faster t=8.94 http://tinyurl.com/yamhrpg
django 1.08 s 0.80 s 1.35x faster t=315.59 http://tinyurl.com/y9mrn8s
html5lib 14.29 s 13.20 s 1.08x faster t=2.17 http://tinyurl.com/y8tyslu
nbody 0.51 s 0.28 s 1.84x faster t=78.007 http://tinyurl.com/y989qhg
rietveld 0.75 s 0.55 s 1.37x faster Insignificant http://tinyurl.com/ye7mqd3
slowpickle 0.75 s 0.55 s 1.37x faster t=20.78 http://tinyurl.com/ybrsfnd
slowspitfire 0.83 s 0.61 s 1.36x faster t=2124.66 http://tinyurl.com/yfknhaw
slowunpickle 0.33 s 0.26 s 1.26x faster t=15.12 http://tinyurl.com/yzlakoo
spambayes 0.31 s 0.34 s 1.10x slower Insignificant http://tinyurl.com/yem62ub

64-bit; gcc 4.2.4; Ubuntu Hardy; AMD Opteron 8214 HE @ 2.2 GHz; 4 cores; 1MB L2 cache; 8GB RAM

Benchmark CPython 2.6.4 Unladen Swallow r988 Change Significance Timeline
2to3 31.98 s 30.41 s 1.05x faster t=8.35 http://tinyurl.com/ybcrl3b
django 1.22 s 0.94 s 1.30x faster t=106.68 http://tinyurl.com/ybwqll6
html5lib 18.97 s 17.79 s 1.06x faster t=2.78 http://tinyurl.com/yzlyqvk
nbody 0.77 s 0.27 s 2.86x faster t=133.49 http://tinyurl.com/yeyqhbg
rietveld 0.74 s 0.80 s 1.08x slower t=-2.45 http://tinyurl.com/yzjc6ff
slowpickle 0.91 s 0.62 s 1.48x faster t=28.04 http://tinyurl.com/yf7en6k
slowspitfire 1.01 s 0.72 s 1.40x faster t=98.70 http://tinyurl.com/yc8pe2o
slowunpickle 0.51 s 0.34 s 1.51x faster t=32.65 http://tinyurl.com/yjufu4j
spambayes 0.43 s 0.45 s 1.06x slower Insignificant http://tinyurl.com/yztbjfp

Many of these benchmarks take a hit under Unladen Swallow because the current version blocks execution to compile Python functions down to machine code. This leads to the behaviour seen in the timeline graphs for the html5lib and rietveld benchmarks, for example, and slows down the overall performance of 2to3. We have an active development branch to fix this problem ([47], [48]), but working within the strictures of CPython's current threading system has complicated the process and required far more care and time than originally anticipated. We view this issue as critical to final merger into the py3k branch.

We have obviously not met our initial goal of a 5x performance improvement. A performance retrospective follows, which addresses why we failed to meet our initial performance goal. We maintain a list of yet-to-be-implemented performance work [51].

Memory Usage

The following table shows maximum memory usage (in kilobytes) for each of Unladen Swallow's default benchmarks for both CPython 2.6.4 and Unladen Swallow r988, as well as a timeline of memory usage across the lifetime of the benchmark. We include tables for both 32- and 64-bit binaries. Memory usage was measured on Linux 2.6 systems by summing the Private_ sections from the kernel's /proc/$pid/smaps pseudo-files [46].

Command:

./perf.py -r --track_memory -b default,apps ../a/python ../b/python

32-bit

Benchmark CPython 2.6.4 Unladen Swallow r988 Change Timeline
2to3 26396 kb 46896 kb 1.77x http://tinyurl.com/yhr2h4z
django 10028 kb 27740 kb 2.76x http://tinyurl.com/yhan8vs
html5lib 150028 kb 173924 kb 1.15x http://tinyurl.com/ybt44en
nbody 3020 kb 16036 kb 5.31x http://tinyurl.com/ya8hltw
rietveld 15008 kb 46400 kb 3.09x http://tinyurl.com/yhd5dra
slowpickle 4608 kb 16656 kb 3.61x http://tinyurl.com/ybukyvo
slowspitfire 85776 kb 97620 kb 1.13x http://tinyurl.com/y9vj35z
slowunpickle 3448 kb 13744 kb 3.98x http://tinyurl.com/yexh4d5
spambayes 7352 kb 46480 kb 6.32x http://tinyurl.com/yem62ub

64-bit

Benchmark CPython 2.6.4 Unladen Swallow r988 Change Timeline
2to3 51596 kb 82340 kb 1.59x http://tinyurl.com/yljg6rs
django 16020 kb 38908 kb 2.43x http://tinyurl.com/ylqsebh
html5lib 259232 kb 324968 kb 1.25x http://tinyurl.com/yha6oee
nbody 4296 kb 23012 kb 5.35x http://tinyurl.com/yztozza
rietveld 24140 kb 73960 kb 3.06x http://tinyurl.com/ybg2nq7
slowpickle 4928 kb 23300 kb 4.73x http://tinyurl.com/yk5tpbr
slowspitfire 133276 kb 148676 kb 1.11x http://tinyurl.com/y8bz2xe
slowunpickle 4896 kb 16948 kb 3.46x http://tinyurl.com/ygywwoc
spambayes 10728 kb 84992 kb 7.92x http://tinyurl.com/yhjban5

The increased memory usage comes from a) LLVM code generation, analysis and optimization libraries; b) native code; c) memory usage issues or leaks in LLVM; d) data structures needed to optimize and generate machine code; e) as-yet uncategorized other sources.

While we have made significant progress in reducing memory usage since the initial naive JIT implementation [43], there is obviously more to do. We believe that there are still memory savings to be made without sacrificing performance. We have tended to focus on raw performance, and we have not yet made a concerted push to reduce memory usage. We view reducing memory usage as a blocking issue for final merger into the py3k branch. We seek guidance from the community on an acceptable level of increased memory usage.

Start-up Time

Statically linking LLVM's code generation, analysis and optimization libraries increases the time needed to start the Python binary. C++ static initializers used by LLVM also increase start-up time, as does importing the collection of pre-compiled C runtime routines we want to inline to Python code.

Results from Unladen Swallow's startup benchmarks:

$ ./perf.py -r -b startup /tmp/cpy-26/bin/python /tmp/unladen/bin/python

### normal_startup ###
Min: 0.219186 -> 0.352075: 1.6063x slower
Avg: 0.227228 -> 0.364384: 1.6036x slower
Significant (t=-51.879098, a=0.95)
Stddev: 0.00762 -> 0.02532: 3.3227x larger
Timeline: http://tinyurl.com/yfe8z3r

### startup_nosite ###
Min: 0.105949 -> 0.264912: 2.5004x slower
Avg: 0.107574 -> 0.267505: 2.4867x slower
Significant (t=-703.557403, a=0.95)
Stddev: 0.00214 -> 0.00240: 1.1209x larger
Timeline: http://tinyurl.com/yajn8fa

### bzr_startup ###
Min: 0.067990 -> 0.097985: 1.4412x slower
Avg: 0.084322 -> 0.111348: 1.3205x slower
Significant (t=-37.432534, a=0.95)
Stddev: 0.00793 -> 0.00643: 1.2330x smaller
Timeline: http://tinyurl.com/ybdm537

### hg_startup ###
Min: 0.016997 -> 0.024997: 1.4707x slower
Avg: 0.026990 -> 0.036772: 1.3625x slower
Significant (t=-53.104502, a=0.95)
Stddev: 0.00406 -> 0.00417: 1.0273x larger
Timeline: http://tinyurl.com/ycout8m

bzr_startup and hg_startup measure how long it takes Bazaar and Mercurial, respectively, to display their help screens. startup_nosite runs python -S many times; usage of the -S option is rare, but we feel this gives a good indication of where increased startup time is coming from.

Unladen Swallow has made headway toward optimizing startup time, but there is still more work to do and further optimizations to implement. Improving start-up time is a high-priority item [34] in Unladen Swallow's merger punchlist.

Binary Size

Statically linking LLVM's code generation, analysis and optimization libraries significantly increases the size of the python binary. The tables below report stripped on-disk binary sizes; the binaries are stripped to better correspond with the configurations used by system package managers. We feel this is the most realistic measure of any change in binary size.

Binary size CPython 2.6.4 CPython 3.1.1 Unladen Swallow r1041
32-bit 1.3M 1.4M 12M
64-bit 1.6M 1.6M 12M

The increased binary size is caused by statically linking LLVM's code generation, analysis and optimization libraries into the python binary. This can be straightforwardly addressed by modifying LLVM to better support shared linking and then using that, instead of the current static linking. For the moment, though, static linking provides an accurate look at the cost of linking against LLVM.

Even when statically linking, we believe there is still headroom to improve on-disk binary size by narrowing Unladen Swallow's dependencies on LLVM. This issue is actively being addressed [33].

Performance Retrospective

Our initial goal for Unladen Swallow was a 5x performance improvement over CPython 2.6. We did not hit that, nor to put it bluntly, even come close. Why did the project not hit that goal, and can an LLVM-based JIT ever hit that goal?

Why did Unladen Swallow not achieve its 5x goal? The primary reason was that LLVM required more work than we had initially anticipated. Based on the fact that Apple was shipping products based on LLVM [83], and other high-level languages had successfully implemented LLVM-based JITs ([62], [64], [84]), we had assumed that LLVM's JIT was relatively free of show-stopper bugs.

That turned out to be incorrect. We had to turn our attention away from performance to fix a number of critical bugs in LLVM's JIT infrastructure (for example, [85], [86]) as well as a number of nice-to-have enhancements that would enable further optimizations along various axes (for example, [88], [87], [89]). LLVM's static code generation facilities, tools and optimization passes are stable and stress-tested, but the just-in-time infrastructure was relatively untested and buggy. We have fixed this.

(Our hypothesis is that we hit these problems -- problems other projects had avoided -- because of the complexity and thoroughness of CPython's standard library test suite.)

We also diverted engineering effort away from performance and into support tools such as gdb and oProfile. gdb did not work well with JIT compilers at all, and LLVM previously had no integration with oProfile. Having JIT-aware debuggers and profilers has been very valuable to the project, and we do not regret channeling our time in these directions. See the Debugging and Profiling sections for more information.

Can an LLVM-based CPython JIT ever hit the 5x performance target? The benchmark results for JIT-based JavaScript implementations suggest that 5x is indeed possible, as do the results PyPy's JIT has delivered for numeric workloads. The experience of Self-92 [53] is also instructive.

Can LLVM deliver this? We believe that we have only begun to scratch the surface of what our LLVM-based JIT can deliver. The optimizations we have incorporated into this system thus far have borne significant fruit (for example, [90], [91], [92]). Our experience to date is that the limiting factor on Unladen Swallow's performance is the engineering cycles needed to implement the literature. We have found LLVM easy to work with and to modify, and its built-in optimizations have greatly simplified the task of implementing Python-level optimizations.

An overview of further performance opportunities is discussed in the Future Work section.

Correctness and Compatibility

Unladen Swallow's correctness test suite includes CPython's test suite (under Lib/test/), as well as a number of important third-party applications and libraries [6]. A full list of these applications and libraries is reproduced below. Any dependencies needed by these packages, such as zope.interface [35], are also tested indirectly as a part of testing the primary package, thus widening the corpus of tested third-party Python code.

  • 2to3
  • Cheetah
  • cvs2svn
  • Django
  • Nose
  • NumPy
  • PyCrypto
  • pyOpenSSL
  • PyXML
  • Setuptools
  • SQLAlchemy
  • SWIG
  • SymPy
  • Twisted
  • ZODB

These applications pass all relevant tests when run under Unladen Swallow. Note that some tests that failed against our baseline of CPython 2.6.4 were disabled, as were tests that made assumptions about CPython internals such as exact bytecode numbers or bytecode format. Any package with disabled tests includes a README.unladen file that details the changes (for example, [38]).

In addition, Unladen Swallow is tested automatically against an array of internal Google Python libraries and applications. These include Google's internal Python bindings for BigTable [36], the Mondrian code review application [37], and Google's Python standard library, among others. The changes needed to run these projects under Unladen Swallow have consistently broken into one of three camps:

  • Adding CPython 2.6 C API compatibility. Since Google still primarily uses CPython 2.4 internally, we have needed to convert uses of int to Py_ssize_t and similar API changes.
  • Fixing or disabling explicit, incorrect tests of the CPython version number.
  • Conditionally disabling code that worked around or depending on bugs in CPython 2.4 that have since been fixed.

Testing against this wide range of public and proprietary applications and libraries has been instrumental in ensuring the correctness of Unladen Swallow. Testing has exposed bugs that we have duly corrected. Our automated regression testing regime has given us high confidence in our changes as we have moved forward.

In addition to third-party testing, we have added further tests to CPython's test suite for corner cases of the language or implementation that we felt were untested or underspecified (for example, [49], [50]). These have been especially important when implementing optimizations, helping make sure we have not accidentally broken the darker corners of Python.

We have also constructed a test suite focused solely on the LLVM-based JIT compiler and the optimizations implemented for it [39]. Because of the complexity and subtlety inherent in writing an optimizing compiler, we have attempted to exhaustively enumerate the constructs, scenarios and corner cases we are compiling and optimizing. The JIT tests also include tests for things like the JIT hotness model, making it easier for future CPython developers to maintain and improve.

We have recently begun using fuzz testing [40] to stress-test the compiler. We have used both pyfuzz [41] and Fusil [42] in the past, and we recommend they be introduced as an automated part of the CPython testing process.

Known Incompatibilities

The only application or library we know to not work with Unladen Swallow that does work with CPython 2.6.4 is Psyco [66]. We are aware of some libraries such as PyGame [80] that work well with CPython 2.6.4, but suffer some degradation due to changes made in Unladen Swallow. We are tracking this issue [48] and are working to resolve these instances of degradation.

While Unladen Swallow is source-compatible with CPython 2.6.4, it is not binary compatible. C extension modules compiled against one will need to be recompiled to work with the other.

The merger of Unladen Swallow should have minimal impact on long-lived CPython optimization branches like WPython. WPython [106] and Unladen Swallow are largely orthogonal, and there is no technical reason why both could not be merged into CPython. The changes needed to make WPython compatible with a JIT-enhanced version of CPython should be minimal [115]. The same should be true for other CPython optimization projects (for example, [116]).

Invasive forks of CPython such as Stackless Python [117] are more challenging to support. Since Stackless is highly unlikely to be merged into CPython [118] and an increased maintenance burden is part and parcel of any fork, we consider compatibility with Stackless to be relatively low-priority. JIT-compiled stack frames use the C stack, so Stackless should be able to treat them the same as it treats calls through extension modules. If that turns out to be unacceptable, Stackless could either remove the JIT compiler or improve JIT code generation to better support heap-based stack frames [119], [120].

Platform Support

Unladen Swallow is inherently limited by the platform support provided by LLVM, especially LLVM's JIT compilation system [7]. LLVM's JIT has the best support on x86 and x86-64 systems, and these are the platforms where Unladen Swallow has received the most testing. We are confident in LLVM/Unladen Swallow's support for x86 and x86-64 hardware. PPC and ARM support exists, but is not widely used and may be buggy (for example, [101], [85], [102]).

Unladen Swallow is known to work on the following operating systems: Linux, Darwin, Windows. Unladen Swallow has received the most testing on Linux and Darwin, though it still builds and passes its tests on Windows.

In order to support hardware and software platforms where LLVM's JIT does not work, Unladen Swallow provides a ./configure --without-llvm option. This flag carves out any part of Unladen Swallow that depends on LLVM, yielding a Python binary that works and passes its tests, but has no performance advantages. This configuration is recommended for hardware unsupported by LLVM, or systems that care more about memory usage than performance.

Impact on CPython Development

Experimenting with Changes to Python or CPython Bytecode

Unladen Swallow's JIT compiler operates on CPython bytecode, and as such, it is immune to Python language changes that affect only the parser.

We recommend that changes to the CPython bytecode compiler or the semantics of individual bytecodes be prototyped in the interpreter loop first, then be ported to the JIT compiler once the semantics are clear. To make this easier, Unladen Swallow includes a --without-llvm configure-time option that strips out the JIT compiler and all associated infrastructure. This leaves the current burden of experimentation unchanged so that developers can prototype in the current low-barrier-to-entry interpreter loop.

Unladen Swallow began implementing its JIT compiler by doing straightforward, naive translations from bytecode implementations into LLVM API calls. We found this process to be easily understood, and we recommend the same approach for CPython. We include several sample changes from the Unladen Swallow repository here as examples of this style of development: [26], [27], [28], [29].

Debugging

The Unladen Swallow team implemented changes to gdb to make it easier to use gdb to debug JIT-compiled Python code. These changes were released in gdb 7.0 [17]. They make it possible for gdb to identify and unwind past JIT-generated call stack frames. This allows gdb to continue to function as before for CPython development if one is changing, for example, the list type or builtin functions.

Example backtrace after our changes, where baz, bar and foo are JIT-compiled:

Program received signal SIGSEGV, Segmentation fault.
0x00002aaaabe7d1a8 in baz ()
(gdb) bt
#0 0x00002aaaabe7d1a8 in baz ()
#1 0x00002aaaabe7d12c in bar ()
#2 0x00002aaaabe7d0aa in foo ()
#3 0x00002aaaabe7d02c in main ()
#4 0x0000000000b870a2 in llvm::JIT::runFunction (this=0x1405b70, F=0x14024e0, ArgValues=...)
at /home/rnk/llvm-gdb/lib/ExecutionEngine/JIT/JIT.cpp:395
#5 0x0000000000baa4c5 in llvm::ExecutionEngine::runFunctionAsMain
(this=0x1405b70, Fn=0x14024e0, argv=..., envp=0x7fffffffe3c0)
at /home/rnk/llvm-gdb/lib/ExecutionEngine/ExecutionEngine.cpp:377
#6 0x00000000007ebd52 in main (argc=2, argv=0x7fffffffe3a8,
envp=0x7fffffffe3c0) at /home/rnk/llvm-gdb/tools/lli/lli.cpp:208

Previously, the JIT-compiled frames would have caused gdb to unwind incorrectly, generating lots of obviously-incorrect #6 0x00002aaaabe7d0aa in ?? ()-style stack frames.

Highlights:

  • gdb 7.0 is able to correctly parse JIT-compiled stack frames, allowing full use of gdb on non-JIT-compiled functions, that is, the vast majority of the CPython codebase.
  • Disassembling inside a JIT-compiled stack frame automatically prints the full list of instructions making up that function. This is an advance over the state of gdb before our work: developers needed to guess the starting address of the function and manually disassemble the assembly code.
  • Flexible underlying mechanism allows CPython to add more and more information, and eventually reach parity with C/C++ support in gdb for JIT-compiled machine code.

Lowlights:

  • gdb cannot print local variables or tell you what line you're currently executing inside a JIT-compiled function. Nor can it step through JIT-compiled code, except for one instruction at a time.
  • Not yet integrated with Apple's gdb or Microsoft's Visual Studio debuggers.

The Unladen Swallow team is working with Apple to get these changes incorporated into their future gdb releases.

Profiling

Unladen Swallow integrates with oProfile 0.9.4 and newer [18] to support assembly-level profiling on Linux systems. This means that oProfile will correctly symbolize JIT-compiled functions in its reports.

Example report, where the #u#-prefixed symbol names are JIT-compiled Python functions:

$ opreport -l ./python | less
CPU: Core 2, speed 1600 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
samples % image name symbol name
79589 4.2329 python PyString_FromFormatV
62971 3.3491 python PyEval_EvalCodeEx
62713 3.3354 python tupledealloc
57071 3.0353 python _PyEval_CallFunction
50009 2.6597 24532.jo #u#force_unicode
47468 2.5246 python PyUnicodeUCS2_Decode
45829 2.4374 python PyFrame_New
45173 2.4025 python lookdict_string
43082 2.2913 python PyType_IsSubtype
39763 2.1148 24532.jo #u#render5
38145 2.0287 python _PyType_Lookup
37643 2.0020 python PyObject_GC_UnTrack
37105 1.9734 python frame_dealloc
36849 1.9598 python PyEval_EvalFrame
35630 1.8950 24532.jo #u#resolve
33313 1.7717 python PyObject_IsInstance
33208 1.7662 python PyDict_GetItem
33168 1.7640 python PyTuple_New
30458 1.6199 python PyCFunction_NewEx

This support is functional, but as-yet unpolished. Unladen Swallow maintains a punchlist of items we feel are important to improve in our oProfile integration to make it more useful to core CPython developers [19].

Highlights:

  • Symbolization of JITted frames working in oProfile on Linux.

Lowlights:

  • No work yet invested in improving symbolization of JIT-compiled frames for Apple's Shark [20] or Microsoft's Visual Studio profiling tools.
  • Some polishing still desired for oProfile output.

We recommend using oProfile 0.9.5 (and newer) to work around a now-fixed bug on x86-64 platforms in oProfile. oProfile 0.9.4 will work fine on 32-bit platforms, however.

Given the ease of integrating oProfile with LLVM [21] and Unladen Swallow [22], other profiling tools should be easy as well, provided they support a similar JIT interface [23].

We have documented the process for using oProfile to profile Unladen Swallow [24]. This document will be merged into CPython's Doc/ tree in the merge.

Addition of C++ to CPython

In order to use LLVM, Unladen Swallow has introduced C++ into the core CPython tree and build process. This is an unavoidable part of depending on LLVM; though LLVM offers a C API [8], it is limited and does not expose the functionality needed by CPython. Because of this, we have implemented the internal details of the Unladen Swallow JIT and its supporting infrastructure in C++. We do not propose converting the entire CPython codebase to C++.

Highlights:

  • Easy use of LLVM's full, powerful code generation and related APIs.
  • Convenient, abstract data structures simplify code.
  • C++ is limited to relatively small corners of the CPython codebase.
  • C++ can be disabled via ./configure --without-llvm, which even omits the dependency on libstdc++.

Lowlights:

  • Developers must know two related languages, C and C++ to work on the full range of CPython's internals.
  • A C++ style guide will need to be developed and enforced. PEP 7 will be extended [121] to encompass C++ by taking the relevant parts of the C++ style guides from Unladen Swallow [71], LLVM [72] and Google [73].
  • Different C++ compilers emit different ABIs; this can cause problems if CPython is compiled with one C++ compiler and extensions modules are compiled with a different C++ compiler.

Managing LLVM Releases, C++ API Changes

LLVM is released regularly every six months. This means that LLVM may be released two or three times during the course of development of a CPython 3.x release. Each LLVM release brings newer and more powerful optimizations, improved platform support and more sophisticated code generation.

LLVM releases usually include incompatible changes to the LLVM C++ API; the release notes for LLVM 2.6 [9] include a list of intentionally-introduced incompatibilities. Unladen Swallow has tracked LLVM trunk closely over the course of development. Our experience has been that LLVM API changes are obvious and easily or mechanically remedied. We include two such changes from the Unladen Swallow tree as references here: [10], [11].

Due to API incompatibilities, we recommend that an LLVM-based CPython target compatibility with a single version of LLVM at a time. This will lower the overhead on the core development team. Pegging to an LLVM version should not be a problem from a packaging perspective, because pre-built LLVM packages generally become available via standard system package managers fairly quickly following an LLVM release, and failing that, llvm.org itself includes binary releases.

Unladen Swallow has historically included a copy of the LLVM and Clang source trees in the Unladen Swallow tree; this was done to allow us to closely track LLVM trunk as we made patches to it. We do not recommend this model of development for CPython. CPython releases should be based on official LLVM releases. Pre-built LLVM packages are available from MacPorts [12] for Darwin, and from most major Linux distributions ([13], [14], [16]). LLVM itself provides additional binaries, such as for MinGW [25].

LLVM is currently intended to be statically linked; this means that binary releases of CPython will include the relevant parts (not all!) of LLVM. This will increase the binary size, as noted above. To simplify downstream package management, we will modify LLVM to better support shared linking. This issue will block final merger [99].

Unladen Swallow has tasked a full-time engineer with fixing any remaining critical issues in LLVM before LLVM's 2.7 release. We consider it essential that CPython 3.x be able to depend on a released version of LLVM, rather than closely tracking LLVM trunk as Unladen Swallow has done. We believe we will finish this work [100] before the release of LLVM 2.7, expected in May 2010.

Building CPython

In addition to a runtime dependency on LLVM, Unladen Swallow includes a build-time dependency on Clang [5], an LLVM-based C/C++ compiler. We use this to compile parts of the C-language Python runtime to LLVM's intermediate representation; this allows us to perform cross-language inlining, yielding increased performance. Clang is not required to run Unladen Swallow. Clang binary packages are available from most major Linux distributions (for example, [15]).

We examined the impact of Unladen Swallow on the time needed to build Python, including configure, full builds and incremental builds after touching a single C source file.

./configure CPython 2.6.4 CPython 3.1.1 Unladen Swallow r988
Run 1 0m20.795s 0m16.558s 0m15.477s
Run 2 0m15.255s 0m16.349s 0m15.391s
Run 3 0m15.228s 0m16.299s 0m15.528s
Full make CPython 2.6.4 CPython 3.1.1 Unladen Swallow r988
Run 1 1m30.776s 1m22.367s 1m54.053s
Run 2 1m21.374s 1m22.064s 1m49.448s
Run 3 1m22.047s 1m23.645s 1m49.305s

Full builds take a hit due to a) additional .cc files needed for LLVM interaction, b) statically linking LLVM into libpython, c) compiling parts of the Python runtime to LLVM IR to enable cross-language inlining.

Incremental builds are also somewhat slower than mainline CPython. The table below shows incremental rebuild times after touching Objects/listobject.c.

Incr make CPython 2.6.4 CPython 3.1.1 Unladen Swallow r1024
Run 1 0m1.854s 0m1.456s 0m6.680s
Run 2 0m1.437s 0m1.442s 0m5.310s
Run 3 0m1.440s 0m1.425s 0m7.639s

As with full builds, this extra time comes from statically linking LLVM into libpython. If libpython were linked shared against LLVM, this overhead would go down.

Proposed Merge Plan

We propose focusing our efforts on eventual merger with CPython's 3.x line of development. The BDFL has indicated that 2.7 is to be the final release of CPython's 2.x line of development [30], and since 2.7 alpha 1 has already been released [31], we have missed the window. Python 3 is the future, and that is where we will target our performance efforts.

We recommend the following plan for merger of Unladen Swallow into the CPython source tree:

  • Creation of a branch in the CPython SVN repository to work in, call it py3k-jit as a strawman. This will be a branch of the CPython py3k branch.
  • We will keep this branch closely integrated to py3k. The further we deviate, the harder our work will be.
  • Any JIT-related patches will go into the py3k-jit branch.
  • Non-JIT-related patches will go into the py3k branch (once reviewed and approved) and be merged back into the py3k-jit branch.
  • Potentially-contentious issues, such as the introduction of new command line flags or environment variables, will be discussed on python-dev.

Because Google uses CPython 2.x internally, Unladen Swallow is based on CPython 2.6. We would need to port our compiler to Python 3; this would be done as patches are applied to the py3k-jit branch, so that the branch remains a consistent implementation of Python 3 at all times.

We believe this approach will be minimally disruptive to the 3.2 or 3.3 release process while we iron out any remaining issues blocking final merger into py3k. Unladen Swallow maintains a punchlist of known issues needed before final merger [32], which includes all problems mentioned in this PEP; we trust the CPython community will have its own concerns. This punchlist is not static; other issues may emerge in the future that will block final merger into the py3k branch.

Changes will be committed directly to the py3k-jit branch, with only large, tricky or controversial changes sent for pre-commit code review.

Contingency Plans

There is a chance that we will not be able to reduce memory usage or startup time to a level satisfactory to the CPython community. Our primary contingency plan for this situation is to shift from a online just-in-time compilation strategy to an offline ahead-of-time strategy using an instrumented CPython interpreter loop to obtain feedback. This is the same model used by gcc's feedback-directed optimizations (-fprofile-generate) [113] and Microsoft Visual Studio's profile-guided optimizations [114]; we will refer to this as "feedback-directed optimization" here, or FDO.

We believe that an FDO compiler for Python would be inferior to a JIT compiler. FDO requires a high-quality, representative benchmark suite, which is a relative rarity in both open- and closed-source development. A JIT compiler can dynamically find and optimize the hot spots in any application -- benchmark suite or no -- allowing it to adapt to changes in application bottlenecks without human intervention.

If an ahead-of-time FDO compiler is required, it should be able to leverage a large percentage of the code and infrastructure already developed for Unladen Swallow's JIT compiler. Indeed, these two compilation strategies could exist side-by-side.

Future Work

A JIT compiler is an extremely flexible tool, and we have by no means exhausted its full potential. Unladen Swallow maintains a list of yet-to-be-implemented performance optimizations [51] that the team has not yet had time to fully implement. Examples:

  • Python/Python inlining [68]. Our compiler currently performs no inlining between pure-Python functions. Work on this is on-going [70].
  • Unboxing [69]. Unboxing is critical for numerical performance. PyPy in particular has demonstrated the value of unboxing to heavily-numeric workloads.
  • Recompilation, adaptation. Unladen Swallow currently only compiles a Python function once, based on its usage pattern up to that point. If the usage pattern changes, limitations in LLVM [74] prevent us from recompiling the function to better serve the new usage pattern.
  • JIT-compile regular expressions. Modern JavaScript engines reuse their JIT compilation infrastructure to boost regex performance [75]. Unladen Swallow has developed benchmarks for Python regular expression performance ([76], [77], [78]), but work on regex performance is still at an early stage [79].
  • Trace compilation [93], [94]. Based on the results of PyPy and Tracemonkey [95], we believe that a CPython JIT should incorporate trace compilation to some degree. We initially avoided a purely-tracing JIT compiler in favor of a simpler, function-at-a-time compiler. However this function-at-a-time compiler has laid the groundwork for a future tracing compiler implemented in the same terms.
  • Profile generation/reuse. The runtime data gathered by the JIT could be persisted to disk and reused by subsequent JIT compilations, or by external tools such as Cython [103] or a feedback-enhanced code coverage tool.

This list is by no means exhaustive. There is a vast literature on optimizations for dynamic languages that could and should be implemented in terms of Unladen Swallow's LLVM-based JIT compiler [55].

Unladen Swallow Community

We would like to thank the community of developers who have contributed to Unladen Swallow, in particular: James Abbatiello, Joerg Blank, Eric Christopher, Alex Gaynor, Chris Lattner, Nick Lewycky, Evan Phoenix and Thomas Wouters.

Licensing

All work on Unladen Swallow is licensed to the Python Software Foundation (PSF) under the terms of the Python Software Foundation License v2 [57] under the umbrella of Google's blanket Contributor License Agreement with the PSF.

LLVM is licensed [58] under the University of llinois/NCSA Open Source License [59], a liberal, OSI-approved license. The University of Illinois Urbana-Champaign is the sole copyright holder for LLVM.

References

[1]http://qinsb.blogspot.com/2011/03/unladen-swallow-retrospective.html
[2]http://en.wikipedia.org/wiki/Dead_Parrot_sketch
[3]http://code.google.com/p/unladen-swallow/
[4]http://llvm.org/
[5]http://clang.llvm.org/
[6]http://code.google.com/p/unladen-swallow/wiki/Testing
[7]http://llvm.org/docs/GettingStarted.html#hardware
[8]http://llvm.org/viewvc/llvm-project/llvm/trunk/include/llvm-c/
[9]http://llvm.org/releases/2.6/docs/ReleaseNotes.html#whatsnew
[10]http://code.google.com/p/unladen-swallow/source/detail?r=820
[11]http://code.google.com/p/unladen-swallow/source/detail?r=532
[12]http://trac.macports.org/browser/trunk/dports/lang/llvm/Portfile
[13]http://packages.ubuntu.com/karmic/llvm
[14]http://packages.debian.org/unstable/devel/llvm
[15]http://packages.debian.org/sid/clang
[16]http://koji.fedoraproject.org/koji/buildinfo?buildID=134384
[17]http://www.gnu.org/software/gdb/download/ANNOUNCEMENT
[18]http://oprofile.sourceforge.net/news/
[19]http://code.google.com/p/unladen-swallow/issues/detail?id=63
[20]http://developer.apple.com/tools/sharkoptimize.html
[21]http://llvm.org/viewvc/llvm-project?view=rev&revision=75279
[22]http://code.google.com/p/unladen-swallow/source/detail?r=986
[23]http://oprofile.sourceforge.net/doc/devel/jit-interface.html
[24]http://code.google.com/p/unladen-swallow/wiki/UsingOProfile
[25]http://llvm.org/releases/download.html
[26]http://code.google.com/p/unladen-swallow/source/detail?r=359
[27]http://code.google.com/p/unladen-swallow/source/detail?r=376
[28]http://code.google.com/p/unladen-swallow/source/detail?r=417
[29]http://code.google.com/p/unladen-swallow/source/detail?r=517
[30]http://mail.python.org/pipermail/python-dev/2010-January/095682.html
[31]http://www.python.org/dev/peps/pep-0373/
[32](1, 2) http://code.google.com/p/unladen-swallow/issues/list?q=label:Merger
[33]http://code.google.com/p/unladen-swallow/issues/detail?id=118
[34]http://code.google.com/p/unladen-swallow/issues/detail?id=64
[35]http://www.zope.org/Products/ZopeInterface
[36]http://en.wikipedia.org/wiki/BigTable
[37]http://www.niallkennedy.com/blog/2006/11/google-mondrian.html
[38]http://code.google.com/p/unladen-swallow/source/browse/tests/lib/sqlalchemy/README.unladen
[39]http://code.google.com/p/unladen-swallow/source/browse/trunk/Lib/test/test_llvm.py
[40]http://en.wikipedia.org/wiki/Fuzz_testing
[41]http://bitbucket.org/ebo/pyfuzz/overview/
[42]http://lwn.net/Articles/322826/
[43]http://code.google.com/p/unladen-swallow/issues/detail?id=68
[44](1, 2) http://code.google.com/p/unladen-swallow/wiki/Benchmarks
[45]http://en.wikipedia.org/wiki/Student's_t-test
[46]http://bmaurer.blogspot.com/2006/03/memory-usage-with-smaps.html
[47]http://code.google.com/p/unladen-swallow/source/browse/branches/background-thread
[48](1, 2) http://code.google.com/p/unladen-swallow/issues/detail?id=40
[49]http://code.google.com/p/unladen-swallow/source/detail?r=888
[50]http://code.google.com/p/unladen-swallow/source/diff?spec=svn576&r=576&format=side&path=/trunk/Lib/test/test_trace.py
[51](1, 2) http://code.google.com/p/unladen-swallow/issues/list?q=label:Performance
[52]http://en.wikipedia.org/wiki/Just-in-time_compilation
[53](1, 2) http://research.sun.com/self/papers/urs-thesis.html
[54]http://code.google.com/p/unladen-swallow/wiki/ProjectPlan
[55](1, 2) http://code.google.com/p/unladen-swallow/wiki/RelevantPapers
[56](1, 2) http://code.google.com/p/unladen-swallow/source/browse/trunk/Python/llvm_notes.txt
[57]http://www.python.org/psf/license/
[58]http://llvm.org/docs/DeveloperPolicy.html#clp
[59]http://www.opensource.org/licenses/UoI-NCSA.php
[60]http://code.google.com/p/v8/
[61]http://webkit.org/blog/214/introducing-squirrelfish-extreme/
[62](1, 2) http://rubini.us/
[63]http://lists.parrot.org/pipermail/parrot-dev/2009-September/002811.html
[64](1, 2) http://www.macruby.org/
[65]http://en.wikipedia.org/wiki/HotSpot
[66](1, 2, 3) http://psyco.sourceforge.net/
[67]http://codespeak.net/pypy/dist/pypy/doc/
[68]http://en.wikipedia.org/wiki/Inline_expansion
[69]http://en.wikipedia.org/wiki/Object_type_(object-oriented_programming%29
[70]http://code.google.com/p/unladen-swallow/issues/detail?id=86
[71]http://code.google.com/p/unladen-swallow/wiki/StyleGuide
[72]http://llvm.org/docs/CodingStandards.html
[73]http://google-styleguide.googlecode.com/svn/trunk/cppguide.xml
[74]http://code.google.com/p/unladen-swallow/issues/detail?id=41
[75]http://code.google.com/p/unladen-swallow/wiki/ProjectPlan#Regular_Expressions
[76]http://code.google.com/p/unladen-swallow/source/browse/tests/performance/bm_regex_compile.py
[77]http://code.google.com/p/unladen-swallow/source/browse/tests/performance/bm_regex_v8.py
[78]http://code.google.com/p/unladen-swallow/source/browse/tests/performance/bm_regex_effbot.py
[79]http://code.google.com/p/unladen-swallow/issues/detail?id=13
[80]http://www.pygame.org/
[81]http://numpy.scipy.org/
[82]http://codespeak.net:8099/plotsummary.html
[83]http://llvm.org/Users.html
[84]http://www.ffconsultancy.com/ocaml/hlvm/
[85](1, 2) http://llvm.org/PR5201
[86]http://llvm.org/viewvc/llvm-project?view=rev&revision=76828
[87]http://llvm.org/viewvc/llvm-project?rev=91611&view=rev
[88]http://llvm.org/viewvc/llvm-project?rev=85182&view=rev
[89]http://llvm.org/PR5735
[90]http://code.google.com/p/unladen-swallow/issues/detail?id=73
[91]http://code.google.com/p/unladen-swallow/issues/detail?id=88
[92]http://code.google.com/p/unladen-swallow/issues/detail?id=67
[93]http://www.ics.uci.edu/~franz/Site/pubs-pdf/C44Prepub.pdf
[94]http://www.ics.uci.edu/~franz/Site/pubs-pdf/ICS-TR-07-12.pdf
[95]https://wiki.mozilla.org/JavaScript:TraceMonkey
[96]http://llvm.org/docs/LangRef.html
[97]http://code.google.com/p/unladen-swallow/issues/detail?id=120
[98]http://code.google.com/p/unladen-swallow/source/browse/tests/performance/bm_nbody.py
[100]http://code.google.com/p/unladen-swallow/issues/detail?id=131
[101]http://llvm.org/PR4816
[102]http://llvm.org/PR6065
[103](1, 2) http://www.cython.org/
[104]http://shed-skin.blogspot.com/
[105]http://shedskin.googlecode.com/files/shedskin-tutorial-0.3.html
[106](1, 2) http://code.google.com/p/wpython/
[107]http://www.mail-archive.com/python-dev@python.org/msg45143.html
[108]http://ironpython.net/
[109]http://www.mono-project.com/
[110]http://www.jython.org/
[111]http://wiki.python.org/jython/JythonFaq/GeneralInfo
[112]http://code.google.com/p/pyv8/
[113]http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
[114]http://msdn.microsoft.com/en-us/library/e7k32f4k.aspx
[115]http://www.mail-archive.com/python-dev@python.org/msg44962.html
[116]http://portal.acm.org/citation.cfm?id=1534530.1534550
[117]http://www.stackless.com/
[118]http://mail.python.org/pipermail/python-dev/2004-June/045165.html
[119]http://www.nondot.org/sabre/LLVMNotes/ExplicitlyManagedStackFrames.txt
[120]http://old.nabble.com/LLVM-and-coroutines-microthreads-td23080883.html
[121]http://www.mail-archive.com/python-dev@python.org/msg45544.html

pep-3147 PYC Repository Directories

PEP:3147
Title:PYC Repository Directories
Version:$Revision$
Last-Modified:$Date$
Author:Barry Warsaw <barry at python.org>
Status:Final
Type:Standards Track
Content-Type:text/x-rst
Created:2009-12-16
Python-Version:3.2
Post-History:2010-01-30, 2010-02-25, 2010-03-03, 2010-04-12
Resolution:http://mail.python.org/pipermail/python-dev/2010-April/099414.html

Abstract

This PEP describes an extension to Python's import mechanism which improves sharing of Python source code files among multiple installed different versions of the Python interpreter. It does this by allowing more than one byte compilation file (.pyc files) to be co-located with the Python source file (.py file). The extension described here can also be used to support different Python compilation caches, such as JIT output that may be produced by an Unladen Swallow [1] enabled C Python.

Background

CPython compiles its source code into "byte code", and for performance reasons, it caches this byte code on the file system whenever the source file has changes. This makes loading of Python modules much faster because the compilation phase can be bypassed. When your source file is foo.py, CPython caches the byte code in a foo.pyc file right next to the source.

Byte code files contain two 32-bit big-endian numbers followed by the marshaled [2] code object. The 32-bit numbers represent a magic number and a timestamp. The magic number changes whenever Python changes the byte code format, e.g. by adding new byte codes to its virtual machine. This ensures that pyc files built for previous versions of the VM won't cause problems. The timestamp is used to make sure that the pyc file match the py file that was used to create it. When either the magic number or timestamp do not match, the py file is recompiled and a new pyc file is written.

In practice, it is well known that pyc files are not compatible across Python major releases. A reading of import.c [3] in the Python source code proves that within recent memory, every new CPython major release has bumped the pyc magic number.

Rationale

Linux distributions such as Ubuntu [4] and Debian [5] provide more than one Python version at the same time to their users. For example, Ubuntu 9.10 Karmic Koala users can install Python 2.5, 2.6, and 3.1, with Python 2.6 being the default.

This causes a conflict for third party Python source files installed by the system, because you cannot compile a single Python source file for more than one Python version at a time. When Python finds a pyc file with a non-matching magic number, it falls back to the slower process of recompiling the source. Thus if your system installed a /usr/share/python/foo.py, two different versions of Python would fight over the pyc file and rewrite it each time the source is compiled. (The standard library is unaffected by this, since multiple versions of the stdlib are installed on such distributions..)

Furthermore, in order to ease the burden on operating system packagers for these distributions, the distribution packages do not contain Python version numbers [6]; they are shared across all Python versions installed on the system. Putting Python version numbers in the packages would be a maintenance nightmare, since all the packages - and their dependencies - would have to be updated every time a new Python release was added or removed from the distribution. Because of the sheer number of packages available, this amount of work is infeasible.

(PEP 384 [7] has been proposed to address binary compatibility issues of third party extension modules across different versions of Python.)

Because these distributions cannot share pyc files, elaborate mechanisms have been developed to put the resulting pyc files in non-shared locations while the source code is still shared. Examples include the symlink-based Debian regimes python-support [8] and python-central [9]. These approaches make for much more complicated, fragile, inscrutable, and fragmented policies for delivering Python applications to a wide range of users. Arguably more users get Python from their operating system vendor than from upstream tarballs. Thus, solving this pyc sharing problem for CPython is a high priority for such vendors.

This PEP proposes a solution to this problem.

Proposal

Python's import machinery is extended to write and search for byte code cache files in a single directory inside every Python package directory. This directory will be called __pycache__.

Further, pyc file names will contain a magic string (called a "tag") that differentiates the Python version they were compiled for. This allows multiple byte compiled cache files to co-exist for a single Python source file.

The magic tag is implementation defined, but should contain the implementation name and a version number shorthand, e.g. cpython-32. It must be unique among all versions of Python, and whenever the magic number is bumped, a new magic tag must be defined. An example pyc file for Python 3.2 is thus foo.cpython-32.pyc.

The magic tag is available in the imp module via the get_tag() function. This is parallel to the imp.get_magic() function.

This scheme has the added benefit of reducing the clutter in a Python package directory.

When a Python source file is imported for the first time, a __pycache__ directory will be created in the package directory, if one does not already exist. The pyc file for the imported source will be written to the __pycache__ directory, using the magic-tag formatted name. If either the creation of the __pycache__ directory or the pyc file inside that fails, the import will still succeed, just as it does in a pre-PEP-3147 world.

If the py source file is missing, the pyc file inside __pycache__ will be ignored. This eliminates the problem of accidental stale pyc file imports.

For backward compatibility, Python will still support pyc-only distributions, however it will only do so when the pyc file lives in the directory where the py file would have been, i.e. not in the __pycache__ directory. pyc file outside of __pycache__ will only be imported if the py source file is missing.

Tools such as py_compile [15] and compileall [16] will be extended to create PEP 3147 formatted layouts automatically, but will have an option to create pyc-only distribution layouts.

Examples

What would this look like in practice?

Let's say we have a Python package named alpha which contains a sub-package name beta. The source directory layout before byte compilation might look like this:

alpha/
    __init__.py
    one.py
    two.py
    beta/
        __init__.py
        three.py
        four.py

After byte compiling this package with Python 3.2, you would see the following layout:

alpha/
    __pycache__/
        __init__.cpython-32.pyc
        one.cpython-32.pyc
        two.cpython-32.pyc
    __init__.py
    one.py
    two.py
    beta/
        __pycache__/
            __init__.cpython-32.pyc
            three.cpython-32.pyc
            four.cpython-32.pyc
        __init__.py
        three.py
        four.py

Note: listing order may differ depending on the platform.

Let's say that two new versions of Python are installed, one is Python 3.3 and another is Unladen Swallow. After byte compilation, the file system would look like this:

alpha/
    __pycache__/
        __init__.cpython-32.pyc
        __init__.cpython-33.pyc
        __init__.unladen-10.pyc
        one.cpython-32.pyc
        one.cpython-33.pyc
        one.unladen-10.pyc
        two.cpython-32.pyc
        two.cpython-33.pyc
        two.unladen-10.pyc
    __init__.py
    one.py
    two.py
    beta/
        __pycache__/
            __init__.cpython-32.pyc
            __init__.cpython-33.pyc
            __init__.unladen-10.pyc
            three.cpython-32.pyc
            three.cpython-33.pyc
            three.unladen-10.pyc
            four.cpython-32.pyc
            four.cpython-33.pyc
            four.unladen-10.pyc
        __init__.py
        three.py
        four.py

As you can see, as long as the Python version identifier string is unique, any number of pyc files can co-exist. These identifier strings are described in more detail below.

A nice property of this layout is that the __pycache__ directories can generally be ignored, such that a normal directory listing would show something like this:

alpha/
    __pycache__/
    __init__.py
    one.py
    two.py
    beta/
        __pycache__/
        __init__.py
        three.py
        four.py

This is much less cluttered than even today's Python.

Python behavior

When Python searches for a module to import (say foo), it may find one of several situations. As per current Python rules, the term "matching pyc" means that the magic number matches the current interpreter's magic number, and the source file's timestamp matches the timestamp in the pyc file exactly.

Case 0: The steady state

When Python is asked to import module foo, it searches for a foo.py file (or foo package, but that's not important for this discussion) along its sys.path. If found, Python looks to see if there is a matching __pycache__/foo.<magic>.pyc file, and if so, that pyc file is loaded.

Case 1: The first import

When Python locates the foo.py, if the __pycache__/foo.<magic>.pyc file is missing, Python will create it, also creating the __pycache__ directory if necessary. Python will parse and byte compile the foo.py file and save the byte code in __pycache__/foo.<magic>.pyc.

Case 2: The second import

When Python is asked to import module foo a second time (in a different process of course), it will again search for the foo.py file along its sys.path. When Python locates the foo.py file, it looks for a matching __pycache__/foo.<magic>.pyc and finding this, it reads the byte code and continues as usual.

Case 3: __pycache__/foo.<magic>.pyc with no source

It's possible that the foo.py file somehow got removed, while leaving the cached pyc file still on the file system. If the __pycache__/foo.<magic>.pyc file exists, but the foo.py file used to create it does not, Python will raise an ImportError when asked to import foo. In other words, Python will not import a pyc file from the cache directory unless the source file exists.

Case 4: legacy pyc files and source-less imports

Python will ignore all legacy pyc files when a source file exists next to it. In other words, if a foo.pyc file exists next to the foo.py file, the pyc file will be ignored in all cases

In order to continue to support source-less distributions though, if the source file is missing, Python will import a lone pyc file if it lives where the source file would have been.

Case 5: read-only file systems

When the source lives on a read-only file system, or the __pycache__ directory or pyc file cannot otherwise be written, all the same rules apply. This is also the case when __pycache__ happens to be written with permissions which do not allow for writing containing pyc files.

Flow chart

Here is a flow chart describing how modules are loaded:

pep-3147-1.png

Alternative Python implementations

Alternative Python implementations such as Jython [11], IronPython [12], PyPy [13], Pynie [14], and Unladen Swallow can also use the __pycache__ directory to store whatever compilation artifacts make sense for their platforms. For example, Jython could store the class file for the module in __pycache__/foo.jython-32.class.

Implementation strategy

This feature is targeted for Python 3.2, solving the problem for those and all future versions. It may be back-ported to Python 2.7. Vendors are free to backport the changes to earlier distributions as they see fit. For backports of this feature to Python 2, when the -U flag is used, a file such as foo.cpython-27u.pyc can be written.

Effects on existing code

Adoption of this PEP will affect existing code and idioms, both inside Python and outside. This section enumerates some of these effects.

Detecting PEP 3147 availability

The easiest way to detect whether your version of Python provides PEP 3147 functionality is to do the following check:

>>> import imp
>>> has3147 = hasattr(imp, 'get_tag')

__file__

In Python 3, when you import a module, its __file__ attribute points to its source py file (in Python 2, it points to the pyc file). A package's __file__ points to the py file for its __init__.py. E.g.:

>>> import foo
>>> foo.__file__
'foo.py'
# baz is a package
>>> import baz
>>> baz.__file__
'baz/__init__.py'

Nothing in this PEP would change the semantics of __file__.

This PEP proposes the addition of an __cached__ attribute to modules, which will always point to the actual pyc file that was read or written. When the environment variable $PYTHONDONTWRITEBYTECODE is set, or the -B option is given, or if the source lives on a read-only filesystem, then the __cached__ attribute will point to the location that the pyc file would have been written to if it didn't exist. This location of course includes the __pycache__ subdirectory in its path.

For alternative Python implementations which do not support pyc files, the __cached__ attribute may point to whatever information makes sense. E.g. on Jython, this might be the .class file for the module: __pycache__/foo.jython-32.class. Some implementations may use multiple compiled files to create the module, in which case __cached__ may be a tuple. The exact contents of __cached__ are Python implementation specific.

It is recommended that when nothing sensible can be calculated, implementations should set the __cached__ attribute to None.

py_compile and compileall

Python comes with two modules, py_compile [15] and compileall [16] which support compiling Python modules external to the built-in import machinery. py_compile in particular has intimate knowledge of byte compilation, so these will be updated to understand the new layout. The -b flag is added to compileall for writing legacy .pyc byte-compiled file path names.

bdist_wininst and the Windows installer

These tools also compile modules explicitly on installation. If they do not use py_compile and compileall, then they would also have to be modified to understand the new layout.

File extension checks

There exists some code which checks for files ending in .pyc and simply chops off the last character to find the matching .py file. This code will obviously fail once this PEP is implemented.

To support this use case, we'll add two new methods to the imp package [17]:

  • imp.cache_from_source(py_path) -> pyc_path
  • imp.source_from_cache(pyc_path) -> py_path

Alternative implementations are free to override these functions to return reasonable values based on their own support for this PEP. These methods are allowed to return None when the implementation (or PEP 302 loader [18] in effect) for whatever reason cannot calculate the appropriate file name. They should not raise exceptions.

Backports

For versions of Python earlier than 3.2 (and possibly 2.7), it is possible to backport this PEP. However, in Python 3.2 (and possibly 2.7), this behavior will be turned on by default, and in fact, it will replace the old behavior. Backports will need to support the old layout by default. We suggest supporting PEP 3147 through the use of an environment variable called $PYTHONENABLECACHEDIR or the command line switch -Xenablecachedir to enable the feature.

Makefiles and other dependency tools

Makefiles and other tools which calculate dependencies on .pyc files (e.g. to byte-compile the source if the .pyc is missing) will have to be updated to check the new paths.

Alternatives

This section describes some alternative approaches or details that were considered and rejected during the PEP's development.

Hexadecimal magic tags

pyc files inside of the __pycache__ directories contain a magic tag in their file names. These are mnemonic tags for the actual magic numbers used by the importer. We could have used the hexadecimal representation [10] of the binary magic number as a unique identifier. For example, in Python 3.2:

>>> from binascii import hexlify
>>> from imp import get_magic
>>> 'foo.{}.pyc'.format(hexlify(get_magic()).decode('ascii'))
'foo.580c0d0a.pyc'

This isn't particularly human friendly though, thus the magic tag proposed in this PEP.

PEP 304

There is some overlap between the goals of this PEP and PEP 304 [19], which has been withdrawn. However PEP 304 would allow a user to create a shadow file system hierarchy in which to store pyc files. This concept of a shadow hierarchy for pyc files could be used to satisfy the aims of this PEP. Although the PEP 304 does not indicate why it was withdrawn, shadow directories have a number of problems. The location of the shadow pyc files would not be easily discovered and would depend on the proper and consistent use of the $PYTHONBYTECODE environment variable both by the system and by end users. There are also global implications, meaning that while the system might want to shadow pyc files, users might not want to, but the PEP defines only an all-or-nothing approach.

As an example of the problem, a common (though fragile) Python idiom for locating data files is to do something like this:

from os import dirname, join
import foo.bar
data_file = join(dirname(foo.bar.__file__), 'my.dat')

This would be problematic since foo.bar.__file__ will give the location of the pyc file in the shadow directory, and it may not be possible to find the my.dat file relative to the source directory from there.

Fat byte compilation files

An earlier version of this PEP described "fat" Python byte code files. These files would contain the equivalent of multiple pyc files in a single pyf file, with a lookup table keyed off the appropriate magic number. This was an extensible file format so that the first 5 parallel Python implementations could be supported fairly efficiently, but with extension lookup tables available to scale pyf byte code objects as large as necessary.

The fat byte compilation files were fairly complex, and inherently introduced difficult race conditions, so the current simplification of using directories was suggested. The same problem applies to using zip files as the fat pyc file format.

Multiple file extensions

The PEP author also considered an approach where multiple thin byte compiled files lived in the same place, but used different file extensions to designate the Python version. E.g. foo.pyc25, foo.pyc26, foo.pyc31 etc. This was rejected because of the clutter involved in writing so many different files. The multiple extension approach makes it more difficult (and an ongoing task) to update any tools that are dependent on the file extension.

.pyc

A proposal was floated to call the __pycache__ directory .pyc or some other dot-file name. This would have the effect on *nix systems of hiding the directory. There are many reasons why this was rejected by the BDFL [20] including the fact that dot-files are only special on some platforms, and we actually do not want to hide these completely from users.

Reference implementation

Work on this code is tracked in a Bazaar branch on Launchpad [22] until it's ready for merge into Python 3.2. The work-in-progress diff can also be viewed [23] and is updated automatically as new changes are uploaded.

A Rietveld code review issue [24] has been opened as of 2010-04-01 (no, this is not an April Fools joke :).

ACKNOWLEDGMENTS

Barry Warsaw's original idea was for fat Python byte code files. Martin von Loewis reviewed an early draft of the PEP and suggested the simplification to store traditional pyc and pyo files in a directory. Many other people reviewed early versions of this PEP and provided useful feedback including but not limited to:

  • David Malcolm
  • Josselin Mouette
  • Matthias Klose
  • Michael Hudson
  • Michael Vogt
  • Piotr OĹźarowski
  • Scott Kitterman
  • Toshio Kuratomi

pep-3148 futures - execute computations asynchronously

PEP:3148
Title:futures - execute computations asynchronously
Version:$Revision$
Last-Modified:$Date$
Author:Brian Quinlan <brian at sweetapp.com>
Status:Final
Type:Standards Track
Content-Type:text/x-rst
Created:16-Oct-2009
Python-Version:3.2
Post-History:

Abstract

This PEP proposes a design for a package that facilitates the evaluation of callables using threads and processes.

Motivation

Python currently has powerful primitives to construct multi-threaded and multi-process applications but parallelizing simple operations requires a lot of work i.e. explicitly launching processes/threads, constructing a work/results queue, and waiting for completion or some other termination condition (e.g. failure, timeout). It is also difficult to design an application with a global process/thread limit when each component invents its own parallel execution strategy.

Specification

Naming

The proposed package would be called "futures" and would live in a new "concurrent" top-level package. The rationale behind pushing the futures library into a "concurrent" namespace has multiple components. The first, most simple one is to prevent any and all confusion with the existing "from __future__ import x" idiom which has been in use for a long time within Python. Additionally, it is felt that adding the "concurrent" precursor to the name fully denotes what the library is related to - namely concurrency - this should clear up any addition ambiguity as it has been noted that not everyone in the community is familiar with Java Futures, or the Futures term except as it relates to the US stock market.

Finally; we are carving out a new namespace for the standard library - obviously named "concurrent". We hope to either add, or move existing, concurrency-related libraries to this in the future. A prime example is the multiprocessing.Pool work, as well as other "addons" included in that module, which work across thread and process boundaries.

Interface

The proposed package provides two core classes: Executor and Future. An Executor receives asynchronous work requests (in terms of a callable and its arguments) and returns a Future to represent the execution of that work request.

Executor

Executor is an abstract class that provides methods to execute calls asynchronously.

submit(fn, *args, **kwargs)

Schedules the callable to be executed as fn(*args, **kwargs) and returns a Future instance representing the execution of the callable.

This is an abstract method and must be implemented by Executor subclasses.

map(func, *iterables, timeout=None)

Equivalent to map(func, *iterables) but func is executed asynchronously and several calls to func may be made concurrently. The returned iterator raises a TimeoutError if __next__() is called and the result isn't available after timeout seconds from the original call to map(). If timeout is not specified or None then there is no limit to the wait time. If a call raises an exception then that exception will be raised when its value is retrieved from the iterator.

shutdown(wait=True)

Signal the executor that it should free any resources that it is using when the currently pending futures are done executing. Calls to Executor.submit and Executor.map and made after shutdown will raise RuntimeError.

If wait is True then this method will not return until all the pending futures are done executing and the resources associated with the executor have been freed. If wait is False then this method will return immediately and the resources associated with the executor will be freed when all pending futures are done executing. Regardless of the value of wait, the entire Python program will not exit until all pending futures are done executing.

__enter__()
__exit__(exc_type, exc_val, exc_tb)
When using an executor as a context manager, __exit__ will call Executor.shutdown(wait=True).

ProcessPoolExecutor

The ProcessPoolExecutor class is an Executor subclass that uses a pool of processes to execute calls asynchronously. The callable objects and arguments passed to ProcessPoolExecutor.submit must be pickleable according to the same limitations as the multiprocessing module.

Calling Executor or Future methods from within a callable submitted to a ProcessPoolExecutor will result in deadlock.

__init__(max_workers)

Executes calls asynchronously using a pool of a most max_workers processes. If max_workers is None or not given then as many worker processes will be created as the machine has processors.

ThreadPoolExecutor

The ThreadPoolExecutor class is an Executor subclass that uses a pool of threads to execute calls asynchronously.

Deadlock can occur when the callable associated with a Future waits on the results of another Future. For example:

import time
def wait_on_b():
    time.sleep(5)
    print(b.result())  # b will never complete because it is waiting on a.
    return 5

def wait_on_a():
    time.sleep(5)
    print(a.result())  # a will never complete because it is waiting on b.
    return 6


executor = ThreadPoolExecutor(max_workers=2)
a = executor.submit(wait_on_b)
b = executor.submit(wait_on_a)

And:

def wait_on_future():
    f = executor.submit(pow, 5, 2)
    # This will never complete because there is only one worker thread and
    # it is executing this function.
    print(f.result())

executor = ThreadPoolExecutor(max_workers=1)
executor.submit(wait_on_future)

__init__(max_workers)

Executes calls asynchronously using a pool of at most max_workers threads.

Future Objects

The Future class encapsulates the asynchronous execution of a callable. Future instances are returned by Executor.submit.

cancel()

Attempt to cancel the call. If the call is currently being executed then it cannot be cancelled and the method will return False, otherwise the call will be cancelled and the method will return True.

cancelled()

Return True if the call was successfully cancelled.

running()

Return True if the call is currently being executed and cannot be cancelled.

done()

Return True if the call was successfully cancelled or finished running.

result(timeout=None)

Return the value returned by the call. If the call hasn't yet completed then this method will wait up to timeout seconds. If the call hasn't completed in timeout seconds then a TimeoutError will be raised. If timeout is not specified or None then there is no limit to the wait time.

If the future is cancelled before completing then CancelledError will be raised.

If the call raised then this method will raise the same exception.

exception(timeout=None)

Return the exception raised by the call. If the call hasn't yet completed then this method will wait up to timeout seconds. If the call hasn't completed in timeout seconds then a TimeoutError will be raised. If timeout is not specified or None then there is no limit to the wait time.

If the future is cancelled before completing then CancelledError will be raised.

If the call completed without raising then None is returned.

add_done_callback(fn)

Attaches a callable fn to the future that will be called when the future is cancelled or finishes running. fn will be called with the future as its only argument.

Added callables are called in the order that they were added and are always called in a thread belonging to the process that added them. If the callable raises an Exception then it will be logged and ignored. If the callable raises another BaseException then behavior is not defined.

If the future has already completed or been cancelled then fn will be called immediately.

Internal Future Methods

The following Future methods are meant for use in unit tests and Executor implementations.

set_running_or_notify_cancel()

Should be called by Executor implementations before executing the work associated with the Future.

If the method returns False then the Future was cancelled, i.e. Future.cancel was called and returned True. Any threads waiting on the Future completing (i.e. through as_completed() or wait()) will be woken up.

If the method returns True then the Future was not cancelled and has been put in the running state, i.e. calls to Future.running() will return True.

This method can only be called once and cannot be called after Future.set_result() or Future.set_exception() have been called.

set_result(result)

Sets the result of the work associated with the Future.

set_exception(exception)

Sets the result of the work associated with the Future to the given Exception.

Module Functions

wait(fs, timeout=None, return_when=ALL_COMPLETED)

Wait for the Future instances (possibly created by different Executor instances) given by fs to complete. Returns a named 2-tuple of sets. The first set, named "done", contains the futures that completed (finished or were cancelled) before the wait completed. The second set, named "not_done", contains uncompleted futures.

timeout can be used to control the maximum number of seconds to wait before returning. If timeout is not specified or None then there is no limit to the wait time.

return_when indicates when the method should return. It must be one of the following constants:

Constant Description
FIRST_COMPLETED The method will return when any future finishes or is cancelled.
FIRST_EXCEPTION The method will return when any future finishes by raising an exception. If not future raises an exception then it is equivalent to ALL_COMPLETED.
ALL_COMPLETED The method will return when all calls finish.

as_completed(fs, timeout=None)

Returns an iterator over the Future instances given by fs that yields futures as they complete (finished or were cancelled). Any futures that completed before as_completed() was called will be yielded first. The returned iterator raises a TimeoutError if __next__() is called and the result isn't available after timeout seconds from the original call to as_completed(). If timeout is not specified or None then there is no limit to the wait time.

The Future instances can have been created by different Executor instances.

Check Prime Example

from concurrent import futures
import math

PRIMES = [
    112272535095293,
    112582705942171,
    112272535095293,
    115280095190773,
    115797848077099,
    1099726899285419]

def is_prime(n):
    if n % 2 == 0:
        return False

    sqrt_n = int(math.floor(math.sqrt(n)))
    for i in range(3, sqrt_n + 1, 2):
        if n % i == 0:
            return False
    return True

def main():
    with futures.ProcessPoolExecutor() as executor:
        for number, prime in zip(PRIMES, executor.map(is_prime,
                                                      PRIMES)):
            print('%d is prime: %s' % (number, prime))

if __name__ == '__main__':
    main()

Web Crawl Example

from concurrent import futures
import urllib.request

URLS = ['http://www.foxnews.com/',
        'http://www.cnn.com/',
        'http://europe.wsj.com/',
        'http://www.bbc.co.uk/',
        'http://some-made-up-domain.com/']

def load_url(url, timeout):
    return urllib.request.urlopen(url, timeout=timeout).read()

def main():
    with futures.ThreadPoolExecutor(max_workers=5) as executor:
        future_to_url = dict(
            (executor.submit(load_url, url, 60), url)
             for url in URLS)

        for future in futures.as_completed(future_to_url):
            url = future_to_url[future]
            try:
                print('%r page is %d bytes' % (
                          url, len(future.result())))
            except Exception as e:
                print('%r generated an exception: %s' % (
                          url, e))

if __name__ == '__main__':
    main()

Rationale

The proposed design of this module was heavily influenced by the the Java java.util.concurrent package [1]. The conceptual basis of the module, as in Java, is the Future class, which represents the progress and result of an asynchronous computation. The Future class makes little commitment to the evaluation mode being used e.g. it can be be used to represent lazy or eager evaluation, for evaluation using threads, processes or remote procedure call.

Futures are created by concrete implementations of the Executor class (called ExecutorService in Java). The reference implementation provides classes that use either a process or a thread pool to eagerly evaluate computations.

Futures have already been seen in Python as part of a popular Python cookbook recipe [2] and have discussed on the Python-3000 mailing list [3].

The proposed design is explicit, i.e. it requires that clients be aware that they are consuming Futures. It would be possible to design a module that would return proxy objects (in the style of weakref) that could be used transparently. It is possible to build a proxy implementation on top of the proposed explicit mechanism.

The proposed design does not introduce any changes to Python language syntax or semantics. Special syntax could be introduced [4] to mark function and method calls as asynchronous. A proxy result would be returned while the operation is eagerly evaluated asynchronously, and execution would only block if the proxy object were used before the operation completed.

Anh Hai Trinh proposed a simpler but more limited API concept [5] and the API has been discussed in some detail on stdlib-sig [6].

The proposed design was discussed on the Python-Dev mailing list [7]. Following those discussions, the following changes were made:

  • The Executor class was made into an abstract base class
  • The Future.remove_done_callback method was removed due to a lack of convincing use cases
  • The Future.add_done_callback method was modified to allow the same callable to be added many times
  • The Future class's mutation methods were better documented to indicate that they are private to the Executor that created them

Reference Implementation

The reference implementation [8] contains a complete implementation of the proposed design. It has been tested on Linux and Mac OS X.

References

[1]java.util.concurrent package documentation http://java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/package-summary.html
[2]Python Cookbook recipe 84317, "Easy threading with Futures" http://code.activestate.com/recipes/84317/
[3]Python-3000 thread, "mechanism for handling asynchronous concurrency" http://mail.python.org/pipermail/python-3000/2006-April/000960.html
[4]Python 3000 thread, "Futures in Python 3000 (was Re: mechanism for handling asynchronous concurrency)" http://mail.python.org/pipermail/python-3000/2006-April/000970.html
[5]A discussion of stream, a similar concept proposed by Anh Hai Trinh http://www.mail-archive.com/stdlib-sig@python.org/msg00480.html
[6]A discussion of the proposed API on stdlib-sig http://mail.python.org/pipermail/stdlib-sig/2009-November/000731.html
[7]A discussion of the PEP on python-dev http://mail.python.org/pipermail/python-dev/2010-March/098169.html
[8]Reference futures implementation http://code.google.com/p/pythonfutures/source/browse/#svn/branches/feedback

pep-3149 ABI version tagged .so files

PEP:3149
Title:ABI version tagged .so files
Version:$Revision$
Last-Modified:$Date$
Author:Barry Warsaw <barry at python.org>
Status:Final
Type:Standards Track
Content-Type:text/x-rst
Created:2010-07-09
Python-Version:3.2
Post-History:2010-07-14, 2010-07-22
Resolution:http://mail.python.org/pipermail/python-dev/2010-September/103408.html

Abstract

PEP 3147 [1] described an extension to Python's import machinery that improved the sharing of Python source code, by allowing more than one byte compilation file (.pyc) to be co-located with each source file.

This PEP defines an adjunct feature which allows the co-location of extension module files (.so) in a similar manner. This optional, build-time feature will enable downstream distributions of Python to more easily provide more than one Python major version at a time.

Background

PEP 3147 defined the file system layout for a pure-Python package, where multiple versions of Python are available on the system. For example, where the alpha package containing source modules one.py and two.py exist on a system with Python 3.2 and 3.3, the post-byte compilation file system layout would be:

alpha/
    __pycache__/
        __init__.cpython-32.pyc
        __init__.cpython-33.pyc
        one.cpython-32.pyc
        one.cpython-33.pyc
        two.cpython-32.pyc
        two.cpython-33.pyc
    __init__.py
    one.py
    two.py

For packages with extension modules, a similar differentiation is needed for the module's .so files. Extension modules compiled for different Python major versions are incompatible with each other due to changes in the ABI. Different configuration/compilation options for the same Python version can result in different ABIs (e.g. --with-wide-unicode).

While PEP 384 [2] defines a stable ABI, it will minimize, but not eliminate extension module incompatibilities between Python builds or major versions. Thus a mechanism for discriminating extension module file names is proposed.

Rationale

Linux distributions such as Ubuntu [3] and Debian [4] provide more than one Python version at the same time to their users. For example, Ubuntu 9.10 Karmic Koala users can install Python 2.5, 2.6, and 3.1, with Python 2.6 being the default.

In order to share as much as possible between the available Python versions, these distributions install third party package modules (.pyc and .so files) into /usr/share/pyshared and symlink to them from /usr/lib/pythonX.Y/dist-packages. The symlinks exist because in a pre-PEP 3147 world (i.e < Python 3.2), the .pyc files resulting from byte compilation by the various installed Pythons will name collide with each other. For Python versions >= 3.2, all pure-Python packages can be shared, because the .pyc files will no longer cause file system naming conflicts. Eliminating these symlinks makes for a simpler, more robust Python distribution.

A similar situation arises with shared library extensions. Because extension modules are typically named foo.so for a foo extension module, these would also name collide if foo was provided for more than one Python version.

In addition, because different configuration/compilation options for the same Python version can cause different ABIs to be presented to extension modules. On POSIX systems for example, the configure options --with-pydebug, --with-pymalloc, and --with-wide-unicode all change the ABI. This PEP proposes to encode build-time options in the file name of the .so extension module files.

PyPy [5] can also benefit from this PEP, allowing it to avoid name collisions in extension modules built for its API, but with a different .so tag.

Proposal

The configure/compilation options chosen at Python interpreter build-time will be encoded in the shared library file name for extension modules. This "tag" will appear between the module base name and the operation file system extension for shared libraries.

The following information MUST be included in the shared library file name:

  • The Python implementation (e.g. cpython, pypy, jython, etc.)
  • The interpreter's major and minor version numbers

These two fields are separated by a hyphen and no dots are to appear between the major and minor version numbers. E.g. cpython-32.

Python implementations MAY include additional flags in the file name tag as appropriate. For example, on POSIX systems these flags will also contribute to the file name:

  • --with-pydebug (flag: d)
  • --with-pymalloc (flag: m)
  • --with-wide-unicode (flag: u)

By default in Python 3.2, configure enables --with-pymalloc so shared library file names would appear as foo.cpython-32m.so. When the other two flags are also enabled, the file names would be foo.cpython-32dmu.so.

The shared library file name tag is used unconditionally; it cannot be changed. The tag and extension module suffix are available through the sysconfig modules via the following variables:

>>> sysconfig.get_config_var('EXT_SUFFIX')
'.cpython-32mu.so'
>>> sysconfig.get_config_var('SOABI')
'cpython-32mu'

Note that $SOABI contains just the tag, while $EXT_SUFFIX includes the platform extension for shared library files, and is the exact suffix added to the extension module name.

For an arbitrary package foo, you might see these files when the distribution package was installed:

/usr/lib/python/foo.cpython-32m.so
/usr/lib/python/foo.cpython-33m.so

(These paths are for example purposes only. Distributions are free to use whatever filesystem layout they choose, and nothing in this PEP changes the locations where from-source builds of Python are installed.)

Python's dynamic module loader will recognize and import shared library extension modules with a tag that matches its build-time options. For backward compatibility, Python will also continue to import untagged extension modules, e.g. foo.so.

This shared library tag would be used globally for all distutils-based extension modules, regardless of where on the file system they are built. Extension modules built by means other than distutils would either have to calculate the tag manually, or fallback to the non-tagged .so file name.

Proven approach

The approach described here is already proven, in a sense, on Debian and Ubuntu system where different extensions are used for debug builds of Python and extension modules. Debug builds on Windows also already use a different file extension for dynamic libraries, and in fact encoded (in a different way than proposed in this PEP) the Python major and minor version in the .dll file name.

Windows

This PEP only addresses build issues on POSIX systems that use the configure script. While Windows or other platform support is not explicitly disallowed under this PEP, platform expertise is needed in order to evaluate, describe, and implement support on such platforms. It is not currently clear that the facilities in this PEP are even useful for Windows.

PEP 384

PEP 384 defines a stable ABI for extension modules. In theory, universal adoption of PEP 384 would eliminate the need for this PEP because all extension modules could be compatible with any Python version. In practice of course, it will be impossible to achieve universal adoption, and as described above, different built-time flags still affect the ABI. Thus even with a stable ABI, this PEP may still be necessary. While a complete specification is reserved for PEP 384, here is a discussion of the relevant issues.

PEP 384 describes a change to PyModule_Create() where 3 is passed as the API version if the extension was complied with Py_LIMITED_API. This should be formalized into an official macro called PYTHON_ABI_VERSION to mirror PYTHON_API_VERSION. If and when the ABI changes in an incompatible way, this version number would be bumped. To facilitate sharing, Python would be extended to search for extension modules with the PYTHON_ABI_VERSION number in its name. The prefix abi is reserved for Python's use.

Thus, an initial implementation of PEP 384, when Python is configured with the default set of flags, would search for the following file names when extension module foo is imported (in this order):

foo.cpython-XYm.so
foo.abi3.so
foo.so

The distutils [6] build_ext command would also have to be extended to compile to shared library files with the abi3 tag, when the module author indicates that their extension supports that version of the ABI. This could be done in a backward compatible way by adding a keyword argument to the Extension class, such as:

Extension('foo', ['foo.c'], abi=3)

Martin v. Lรถwis describes his thoughts [7] about the applicability of this PEP to PEP 384. In summary:

  • --with-pydebug would not be supported by the stable ABI because this changes the layout of PyObject, which is an exposed structure.
  • --with-pymalloc has no bearing on the issue.
  • --with-wide-unicode is trickier, though Martin's inclination is to force the stable ABI to use a Py_UNICODE that matches the platform's wchar_t.

Alternatives

In the initial python-dev thread [8] where this idea was first introduced, several alternatives were suggested. For completeness they are listed here, along with the reasons for not adopting them.

Don't share packages with extension modules

It has been suggested that Python packages with extension modules not be shared among all supported Python versions on a distribution. Even with adoption of PEP 3149, extension modules will have to be compiled for every supported Python version, so perhaps sharing of such packages isn't useful anyway. Not sharing packages with extensions though is infeasible for several reasons.

If a pure-Python package is shared in one version, should it suddenly be not-shared if the next release adds an extension module for speed? Also, even though all extension shared libraries will be compiled and distributed once for every supported Python, there's a big difference between duplicating the .so files and duplicating all .py files. The extra size increases the download time for such packages, and more immediately, increases the space pressures on already constrained distribution CD-ROMs.

Reference implementation

Work on this code is tracked in a Bazaar branch on Launchpad [9] until it's ready for merge into Python 3.2. The work-in-progress diff can also be viewed [10] and is updated automatically as new changes are uploaded.

pep-3150 Statement local namespaces (aka "given" clause)

PEP:3150
Title:Statement local namespaces (aka "given" clause)
Version:$Revision$
Last-Modified:$Date$
Author:Nick Coghlan <ncoghlan at gmail.com>
Status:Deferred
Type:Standards Track
Content-Type:text/x-rst
Created:2010-07-09
Python-Version:3.4
Post-History:2010-07-14, 2011-04-21, 2011-06-13
Resolution:TBD

Abstract

This PEP proposes the addition of an optional given clause to several Python statements that do not currently have an associated code suite. This clause will create a statement local namespace for additional names that are accessible in the associated statement, but do not become part of the containing namespace.

Adoption of a new symbol, ?, is proposed to denote a forward reference to the namespace created by running the associated code suite. It will be a reference to a types.SimpleNamespace object.

The primary motivation is to enable a more declarative style of programming, where the operation to be performed is presented to the reader first, and the details of the necessary subcalculations are presented in the following indented suite. As a key example, this would elevate ordinary assignment statements to be on par with class and def statements where the name of the item to be defined is presented to the reader in advance of the details of how the value of that item is calculated. It also allows named functions to be used in a "multi-line lambda" fashion, where the name is used solely as a placeholder in the current expression and then defined in the following suite.

A secondary motivation is to simplify interim calculations in module and class level code without polluting the resulting namespaces.

The intent is that the relationship between a given clause and a separate function definition that performs the specified operation will be similar to the existing relationship between an explicit while loop and a generator that produces the same sequence of operations as that while loop.

The specific proposal in this PEP has been informed by various explorations of this and related concepts over the years (e.g. [1], [2], [3], [6], [8]), and is inspired to some degree by the where and let clauses in Haskell. It avoids some problems that have been identified in past proposals, but has not yet itself been subject to the test of implementation.

Proposal

This PEP proposes the addition of an optional given clause to the syntax for simple statements which may contain an expression, or may substitute for such a statement for purely syntactic purposes. The current list of simple statements that would be affected by this addition is as follows:

  • expression statement
  • assignment statement
  • augmented assignment statement
  • del statement
  • return statement
  • yield statement
  • raise statement
  • assert statement
  • pass statement

The given clause would allow subexpressions to be referenced by name in the header line, with the actual definitions following in the indented clause. As a simple example:

sorted_data = sorted(data, key=?.sort_key) given:
    def sort_key(item):
        return item.attr1, item.attr2

The new symbol ? is used to refer to the given namespace. It would be a types.SimpleNamespace instance, so ?.sort_key functions as a forward reference to a name defined in the given clause.

A docstring would be permitted in the given clause, and would be attached to the result namespace as its __doc__ attribute.

The pass statement is included to provide a consistent way to skip inclusion of a meaningful expression in the header line. While this is not an intended use case, it isn't one that can be prevented as multiple alternatives (such as ... and ()) remain available even if pass itself is disallowed.

The body of the given clause will execute in a new scope, using normal function closure semantics. To support early binding of loop variables and global references, as well as to allow access to other names defined at class scope, the given clause will also allow explicit binding operations in the header line:

# Explicit early binding via given clause
seq = []
for i in range(10):
    seq.append(?.f) given i=i in:
        def f():
            return i
assert [f() for f in seq] == list(range(10))

Semantics

The following statement:

op(?.f, ?.g) given bound_a=a, bound_b=b in:
    def f():
        return bound_a + bound_b
    def g():
        return bound_a - bound_b

Would be roughly equivalent to the following code (__var denotes a hidden compiler variable or simply an entry on the interpreter stack):

__arg1 = a
__arg2 = b
def __scope(bound_a, bound_b):
    def f():
        return bound_a + bound_b
    def g():
        return bound_a - bound_b
   return types.SimpleNamespace(**locals())
__ref = __scope(__arg1, __arg2)
__ref.__doc__ = __scope.__doc__
op(__ref.f, __ref.g)

A given clause is essentially a nested function which is created and then immediately executed. Unless explicitly passed in, names are looked up using normal scoping rules, and thus names defined at class scope will not be visible. Names declared as forward references are returned and used in the header statement, without being bound locally in the surrounding namespace.

Syntax Change

Current:

expr_stmt: testlist_star_expr (augassign (yield_expr|testlist) |
             ('=' (yield_expr|testlist_star_expr))*)
del_stmt: 'del' exprlist
pass_stmt: 'pass'
return_stmt: 'return' [testlist]
yield_stmt: yield_expr
raise_stmt: 'raise' [test ['from' test]]
assert_stmt: 'assert' test [',' test]

New:

expr_stmt: testlist_star_expr (augassign (yield_expr|testlist) |
             ('=' (yield_expr|testlist_star_expr))*) [given_clause]
del_stmt: 'del' exprlist [given_clause]
pass_stmt: 'pass' [given_clause]
return_stmt: 'return' [testlist] [given_clause]
yield_stmt: yield_expr [given_clause]
raise_stmt: 'raise' [test ['from' test]] [given_clause]
assert_stmt: 'assert' test [',' test] [given_clause]
given_clause: "given" [(NAME '=' test)+ "in"]":" suite

(Note that expr_stmt in the grammar is a slight misnomer, as it covers assignment and augmented assignment in addition to simple expression statements)

Note

These proposed grammar changes don't yet cover the forward reference expression syntax for accessing names defined in the statement local namespace.

The new clause is added as an optional element of the existing statements rather than as a new kind of compound statement in order to avoid creating an ambiguity in the grammar. It is applied only to the specific elements listed so that nonsense like the following is disallowed:

break given:
    a = b = 1

import sys given:
    a = b = 1

However, the precise Grammar change described above is inadequate, as it creates problems for the definition of simple_stmt (which allows chaining of multiple single line statements with ";" rather than "\n").

So the above syntax change should instead be taken as a statement of intent. Any actual proposal would need to resolve the simple_stmt parsing problem before it could be seriously considered. This would likely require a non-trivial restructuring of the grammar, breaking up small_stmt and flow_stmt to separate the statements that potentially contain arbitrary subexpressions and then allowing a single one of those statements with a given clause at the simple_stmt level. Something along the lines of:

stmt: simple_stmt | given_stmt | compound_stmt
simple_stmt: small_stmt (';' (small_stmt | subexpr_stmt))* [';'] NEWLINE
small_stmt: (pass_stmt | flow_stmt | import_stmt |
             global_stmt | nonlocal_stmt)
flow_stmt: break_stmt | continue_stmt
given_stmt: subexpr_stmt (given_clause |
              (';' (small_stmt | subexpr_stmt))* [';']) NEWLINE
subexpr_stmt: expr_stmt | del_stmt | flow_subexpr_stmt | assert_stmt
flow_subexpr_stmt: return_stmt | raise_stmt | yield_stmt
given_clause: "given" (NAME '=' test)* ":" suite

For reference, here are the current definitions at that level:

stmt: simple_stmt | compound_stmt
simple_stmt: small_stmt (';' small_stmt)* [';'] NEWLINE
small_stmt: (expr_stmt | del_stmt | pass_stmt | flow_stmt |
             import_stmt | global_stmt | nonlocal_stmt | assert_stmt)
flow_stmt: break_stmt | continue_stmt | return_stmt | raise_stmt | yield_stmt

In addition to the above changes, the definition of atom would be changed to also allow ?. The restriction of this usage to statements with an associated given clause would be handled by a later stage of the compilation process (likely AST construction, which already enforces other restrictions where the grammar is overly permissive in order to simplify the initial parsing step).

New PEP 8 Guidelines

As discussed on python-ideas ([7], [9]) new PEP 8 guidelines would also need to be developed to provide appropriate direction on when to use the given clause over ordinary variable assignments.

Based on the similar guidelines already present for try statements, this PEP proposes the following additions for given statements to the "Programming Conventions" section of PEP 8:

  • for code that could reasonably be factored out into a separate function, but is not currently reused anywhere, consider using a given clause. This clearly indicates which variables are being used only to define subcomponents of another statement rather than to hold algorithm or application state. This is an especially useful technique when passing multi-line functions to operations which take callable arguments.
  • keep given clauses concise. If they become unwieldy, either break them up into multiple steps or else move the details into a separate function.

Rationale

Function and class statements in Python have a unique property relative to ordinary assignment statements: to some degree, they are declarative. They present the reader of the code with some critical information about a name that is about to be defined, before proceeding on with the details of the actual definition in the function or class body.

The name of the object being declared is the first thing stated after the keyword. Other important information is also given the honour of preceding the implementation details:

  • decorators (which can greatly affect the behaviour of the created object, and were placed ahead of even the keyword and name as a matter of practicality moreso than aesthetics)
  • the docstring (on the first line immediately following the header line)
  • parameters, default values and annotations for function definitions
  • parent classes, metaclass and optionally other details (depending on the metaclass) for class definitions

This PEP proposes to make a similar declarative style available for arbitrary assignment operations, by permitting the inclusion of a "given" suite following any simple assignment statement:

TARGET = [TARGET2 = ... TARGETN =] EXPR given:
    SUITE

By convention, code in the body of the suite should be oriented solely towards correctly defining the assignment operation carried out in the header line. The header line operation should also be adequately descriptive (e.g. through appropriate choices of variable names) to give a reader a reasonable idea of the purpose of the operation without reading the body of the suite.

However, while they are the initial motivating use case, limiting this feature solely to simple assignments would be overly restrictive. Once the feature is defined at all, it would be quite arbitrary to prevent its use for augmented assignments, return statements, yield expressions, comprehensions and arbitrary expressions that may modify the application state.

The given clause may also function as a more readable alternative to some uses of lambda expressions and similar constructs when passing one-off functions to operations like sorted() or in callback based event-driven programming.

In module and class level code, the given clause will serve as a clear and reliable replacement for usage of the del statement to keep interim working variables from polluting the resulting namespace.

One potentially useful way to think of the proposed clause is as a middle ground between conventional in-line code and separation of an operation out into a dedicated function, just as an inline while loop may eventually be factored out into a dedicated generator.

Design Discussion

Keyword Choice

This proposal initially used where based on the name of a similar construct in Haskell. However, it has been pointed out that there are existing Python libraries (such as Numpy [4]) that already use where in the SQL query condition sense, making that keyword choice potentially confusing.

While given may also be used as a variable name (and hence would be deprecated using the usual __future__ dance for introducing new keywords), it is associated much more strongly with the desired "here are some extra variables this expression may use" semantics for the new clause.

Reusing the with keyword has also been proposed. This has the advantage of avoiding the addition of a new keyword, but also has a high potential for confusion as the with clause and with statement would look similar but do completely different things. That way lies C++ and Perl :)

Relation to PEP 403

PEP 403 (General Purpose Decorator Clause) attempts to achieve the main goals of this PEP using a less radical language change inspired by the existing decorator syntax.

Despite having the same author, the two PEPs are in direct competition with each other. PEP 403 represents a minimalist approach that attempts to achieve useful functionality with a minimum of change from the status quo. This PEP instead aims for a more flexible standalone statement design, which requires a larger degree of change to the language.

Note that where PEP 403 is better suited to explaining the behaviour of generator expressions correctly, this PEP is better able to explain the behaviour of decorator clauses in general. Both PEPs support adequate explanations for the semantics of container comprehensions.

Explaining Container Comprehensions and Generator Expressions

One interesting feature of the proposed construct is that it can be used as a primitive to explain the scoping and execution order semantics of container comprehensions:

seq2 = [x for x in y if q(x) for y in seq if p(y)]

# would be equivalent to

seq2 = ?.result given seq=seq:
    result = []
    for y in seq:
        if p(y):
            for x in y:
                if q(x):
                    result.append(x)

The important point in this expansion is that it explains why comprehensions appear to misbehave at class scope: only the outermost iterator is evaluated at class scope, while all predicates, nested iterators and value expressions are evaluated inside a nested scope.

Not that, unlike PEP 403, the current version of this PEP cannot provide a precisely equivalent expansion for a generator expression. The closest it can get is to define an additional level of scoping:

seq2 = ?.g(seq) given:
    def g(seq):
        for y in seq:
            if p(y):
                for x in y:
                    if q(x):
                        yield x

This limitation could be remedied by permitting the given clause to be a generator function, in which case ? would refer to a generator-iterator object rather than a simple namespace:

seq2 = ? given seq=seq in:
    for y in seq:
        if p(y):
            for x in y:
                if q(x):
                    yield x

However, this would make the meaning of "?" quite ambiguous, even more so than is already the case for the meaning of def statements (which will usually have a docstring indicating whether or not a function definition is actually a generator)

Explaining Decorator Clause Evaluation and Application

The standard explanation of decorator clause evaluation and application has to deal with the idea of hidden compiler variables in order to show steps in their order of execution. The given statement allows a decorated function definition like:

@classmethod
def classname(cls):
    return cls.__name__

To instead be explained as roughly equivalent to:

classname = .d1(classname) given:
    d1 = classmethod
    def classname(cls):
        return cls.__name__

Anticipated Objections

Two Ways To Do It

A lot of code may now be written with values defined either before the expression where they are used or afterwards in a given clause, creating two ways to do it, perhaps without an obvious way of choosing between them.

On reflection, I feel this is a misapplication of the "one obvious way" aphorism. Python already offers lots of ways to write code. We can use a for loop or a while loop, a functional style or an imperative style or an object oriented style. The language, in general, is designed to let people write code that matches the way they think. Since different people think differently, the way they write their code will change accordingly.

Such stylistic questions in a code base are rightly left to the development group responsible for that code. When does an expression get so complicated that the subexpressions should be taken out and assigned to variables, even though those variables are only going to be used once? When should an inline while loop be replaced with a generator that implements the same logic? Opinions differ, and that's OK.

However, explicit PEP 8 guidance will be needed for CPython and the standard library, and that is discussed in the proposal above.

Out of Order Execution

The given clause makes execution jump around a little strangely, as the body of the given clause is executed before the simple statement in the clause header. The closest any other part of Python comes to this is the out of order evaluation in list comprehensions, generator expressions and conditional expressions and the delayed application of decorator functions to the function they decorate (the decorator expressions themselves are executed in the order they are written).

While this is true, the syntax is intended for cases where people are themselves thinking about a problem out of sequence (at least as far as the language is concerned). As an example of this, consider the following thought in the mind of a Python user:

I want to sort the items in this sequence according to the values of attr1 and attr2 on each item.

If they're comfortable with Python's lambda expressions, then they might choose to write it like this:

sorted_list = sorted(original, key=(lambda v: v.attr1, v.attr2))

That gets the job done, but it hardly reaches the standard of executable pseudocode that fits Python's reputation.

If they don't like lambda specifically, the operator module offers an alternative that still allows the key function to be defined inline:

sorted_list = sorted(original,
                     key=operator.attrgetter(v. 'attr1', 'attr2'))
Again, it gets the job done, but even the most generous of readers would
not consider that to be "executable pseudocode".

If they think both of the above options are ugly and confusing, or they need logic in their key function that can't be expressed as an expression (such as catching an exception), then Python currently forces them to reverse the order of their original thought and define the sorting criteria first:

def sort_key(item):
    return item.attr1, item.attr2

sorted_list = sorted(original, key=sort_key)

"Just define a function" has been the rote response to requests for multi-line lambda support for years. As with the above options, it gets the job done, but it really does represent a break between what the user is thinking and what the language allows them to express.

I believe the proposal in this PEP would finally let Python get close to the "executable pseudocode" bar for the kind of thought expressed above:

sorted_list = sorted(original, key=?.key) given:
    def key(item):
        return item.attr1, item.attr2

Everything is in the same order as it was in the user's original thought, and they don't even need to come up with a name for the sorting criteria: it is possible to reuse the keyword argument name directly.

A possible enhancement to those proposal would be to provide a convenient shorthand syntax to say "use the given clause contents as keyword arguments". Even without dedicated syntax, that can be written simply as **vars(?).

Harmful to Introspection

Poking around in module and class internals is an invaluable tool for white-box testing and interactive debugging. The given clause will be quite effective at preventing access to temporary state used during calculations (although no more so than current usage of del statements in that regard).

While this is a valid concern, design for testability is an issue that cuts across many aspects of programming. If a component needs to be tested independently, then a given statement should be refactored in to separate statements so that information is exposed to the test suite. This isn't significantly different from refactoring an operation hidden inside a function or generator out into its own function purely to allow it to be tested in isolation.

Lack of Real World Impact Assessment

The examples in the current PEP are almost all relatively small "toy" examples. The proposal in this PEP needs to be subjected to the test of application to a large code base (such as the standard library or a large Twisted application) in a search for examples where the readability of real world code is genuinely enhanced.

This is more of a deficiency in the PEP rather than the idea, though. If it wasn't a real world problem, we wouldn't get so many complaints about the lack of multi-line lambda support and Ruby's block construct probably wouldn't be quite so popular.

Open Questions

Syntax for Forward References

The ? symbol is proposed for forward references to the given namespace as it is short, currently unused and suggests "there's something missing here that will be filled in later".

The proposal in the PEP doesn't neatly parallel any existing Python feature, so reusing an already used symbol has been deliberately avoided.

Handling of nonlocal and global

nonlocal and global are explicitly disallowed in the given clause suite and will be syntax errors if they occur. They will work normally if they appear within a def statement within that suite.

Alternatively, they could be defined as operating as if the anonymous functions were defined as in the expansion above.

Handling of break and continue

break and continue will operate as if the anonymous functions were defined as in the expansion above. They will be syntax errors if they occur in the given clause suite but will work normally if they appear within a for or while loop as part of that suite.

Handling of return and yield

return and yield are explicitly disallowed in the given clause suite and will be syntax errors if they occur. They will work normally if they appear within a def statement within that suite.

Examples

Defining callbacks for event driven programming:

# Current Python (definition before use)
def cb(sock):
    # Do something with socket
def eb(exc):
    logging.exception(
        "Failed connecting to %s:%s", host, port)
loop.create_connection((host, port), cb, eb) given:

# Becomes:
loop.create_connection((host, port), ?.cb, ?.eb) given:
    def cb(sock):
        # Do something with socket
    def eb(exc):
        logging.exception(
            "Failed connecting to %s:%s", host, port)

Defining "one-off" classes which typically only have a single instance:

# Current Python (instantiation after definition)
class public_name():
  ... # However many lines
public_name = public_name(*params)

# Current Python (custom decorator)
def singleton(*args, **kwds):
    def decorator(cls):
        return cls(*args, **kwds)
    return decorator

@singleton(*params)
class public_name():
  ... # However many lines

# Becomes:
public_name = ?.MeaningfulClassName(*params) given:
  class MeaningfulClassName():
    ... # Should trawl the stdlib for an example of doing this

Calculating attributes without polluting the local namespace (from os.py):

# Current Python (manual namespace cleanup)
def _createenviron():
  ... # 27 line function

environ = _createenviron()
del _createenviron

# Becomes:
environ = ?._createenviron() given:
    def _createenviron():
      ... # 27 line function

Replacing default argument hack (from functools.lru_cache):

# Current Python (default argument hack)
def decorating_function(user_function,
               tuple=tuple, sorted=sorted, len=len, KeyError=KeyError):
  ... # 60 line function
return decorating_function

# Becomes:
return ?.decorating_function given:
  # Cell variables rather than locals, but should give similar speedup
  tuple, sorted, len, KeyError = tuple, sorted, len, KeyError
  def decorating_function(user_function):
    ... # 60 line function

# This example also nicely makes it clear that there is nothing in the
# function after the nested function definition. Due to additional
# nested functions, that isn't entirely clear in the current code.

Possible Additions

  • The current proposal allows the addition of a given clause only for simple statements. Extending the idea to allow the use of compound statements would be quite possible (by appending the given clause as an independent suite at the end), but doing so raises serious readability concerns (as values defined in the given clause may be used well before they are defined, exactly the kind of readability trap that other features like decorators and with statements are designed to eliminate)
  • The "explicit early binding" variant may be applicable to the discussions on python-ideas on how to eliminate the default argument hack. A given clause in the header line for functions (after the return type annotation) may be the answer to that question.

Rejected Alternatives

  • An earlier version of this PEP allowed implicit forward references to the names in the trailing suite, and also used implicit early binding semantics. Both of these ideas substantially complicated the proposal without providing a sufficient increase in expressive power. The current proposing with explicit forward references and early binding brings the new construct into line with existing scoping semantics, greatly improving the chances the idea can actually be implemented.
  • In addition to the proposals made here, there have also been suggestions of two suite "in-order" variants which provide the limited scoping of names without supporting out-of-order execution. I believe these suggestions largely miss the point of what people are complaining about when they ask for multi-line lambda support - it isn't that coming up with a name for the subexpression is especially difficult, it's that naming the function before the statement that uses it means the code no longer matches the way the developer thinks about the problem at hand.
  • I've made some unpublished attempts to allow direct references to the closure implicitly created by the given clause, while still retaining the general structure of the syntax as defined in this PEP (For example, allowing a subexpression like ?given or :given to be used in expressions to indicate a direct reference to the implied closure, thus preventig it from being called automatically to create the local namespace). All such attempts have appeared unattractive and confusing compared to the simpler decorator-inspired proposal in PEP 403.

Reference Implementation

None as yet. If you want a crash course in Python namespace semantics and code compilation, feel free to try ;)

TO-DO

  • Mention PEP 359 and possible uses for locals() in the given clause
  • Figure out if this can be used internally to make the implementation of zero-argument super() calls less awful

pep-3151 Reworking the OS and IO exception hierarchy

PEP:3151
Title:Reworking the OS and IO exception hierarchy
Version:$Revision$
Last-Modified:$Date$
Author:Antoine Pitrou <solipsis at pitrou.net>
BDFL-Delegate:Barry Warsaw
Status:Final
Type:Standards Track
Content-Type:text/x-rst
Created:2010-07-21
Python-Version:3.3
Post-History:
Resolution:http://mail.python.org/pipermail/python-dev/2011-October/114033.html

Abstract

The standard exception hierarchy is an important part of the Python language. It has two defining qualities: it is both generic and selective. Generic in that the same exception type can be raised - and handled - regardless of the context (for example, whether you are trying to add something to an integer, to call a string method, or to write an object on a socket, a TypeError will be raised for bad argument types). Selective in that it allows the user to easily handle (silence, examine, process, store or encapsulate...) specific kinds of error conditions while letting other errors bubble up to higher calling contexts. For example, you can choose to catch ZeroDivisionErrors without affecting the default handling of other ArithmeticErrors (such as OverflowErrors).

This PEP proposes changes to a part of the exception hierarchy in order to better embody the qualities mentioned above: the errors related to operating system calls (OSError, IOError, mmap.error, select.error, and all their subclasses).

Rationale

Lack of fine-grained exceptions

The current variety of OS-related exceptions doesn't allow the user to filter easily for the desired kinds of failures. As an example, consider the task of deleting a file if it exists. The Look Before You Leap (LBYL) idiom suffers from an obvious race condition:

if os.path.exists(filename):
    os.remove(filename)

If a file named as filename is created by another thread or process between the calls to os.path.exists and os.remove, it won't be deleted. This can produce bugs in the application, or even security issues.

Therefore, the solution is to try to remove the file, and ignore the error if the file doesn't exist (an idiom known as Easier to Ask Forgiveness than to get Permission, or EAFP). Careful code will read like the following (which works under both POSIX and Windows systems):

try:
    os.remove(filename)
except OSError as e:
    if e.errno != errno.ENOENT:
        raise

or even:

try:
    os.remove(filename)
except EnvironmentError as e:
    if e.errno != errno.ENOENT:
        raise

This is a lot more to type, and also forces the user to remember the various cryptic mnemonics from the errno module. It imposes an additional cognitive burden and gets tiresome rather quickly. Consequently, many programmers will instead write the following code, which silences exceptions too broadly:

try:
    os.remove(filename)
except OSError:
    pass

os.remove can raise an OSError not only when the file doesn't exist, but in other possible situations (for example, the filename points to a directory, or the current process doesn't have permission to remove the file), which all indicate bugs in the application logic and therefore shouldn't be silenced. What the programmer would like to write instead is something such as:

try:
    os.remove(filename)
except FileNotFoundError:
    pass

Compatibility strategy

Reworking the exception hierarchy will obviously change the exact semantics of at least some existing code. While it is not possible to improve on the current situation without changing exact semantics, it is possible to define a narrower type of compatibility, which we will call useful compatibility.

For this we first must explain what we will call careful and careless exception handling. Careless (or "naïve") code is defined as code which blindly catches any of OSError, IOError, socket.error, mmap.error, WindowsError, select.error without checking the errno attribute. This is because such exception types are much too broad to signify anything. Any of them can be raised for error conditions as diverse as: a bad file descriptor (which will usually indicate a programming error), an unconnected socket (ditto), a socket timeout, a file type mismatch, an invalid argument, a transmission failure, insufficient permissions, a non-existent directory, a full filesystem, etc.

(moreover, the use of certain of these exceptions is irregular; Appendix B exposes the case of the select module, which raises different exceptions depending on the implementation)

Careful code is defined as code which, when catching any of the above exceptions, examines the errno attribute to determine the actual error condition and takes action depending on it.

Then we can define useful compatibility as follows:

  • useful compatibility doesn't make exception catching any narrower, but it can be broader for careless exception-catching code. Given the following kind of snippet, all exceptions caught before this PEP will also be caught after this PEP, but the reverse may be false (because the coalescing of OSError, IOError and others means the except clause throws a slightly broader net):

    try:
        ...
        os.remove(filename)
        ...
    except OSError:
        pass
    
  • useful compatibility doesn't alter the behaviour of careful exception-catching code. Given the following kind of snippet, the same errors should be silenced or re-raised, regardless of whether this PEP has been implemented or not:

    try:
        os.remove(filename)
    except OSError as e:
        if e.errno != errno.ENOENT:
            raise
    

The rationale for this compromise is that careless code can't really be helped, but at least code which "works" won't suddenly raise errors and crash. This is important since such code is likely to be present in scripts used as cron tasks or automated system administration programs.

Careful code, on the other hand, should not be penalized. Actually, one purpose of this PEP is to ease writing careful code.

Step 1: coalesce exception types

The first step of the resolution is to coalesce existing exception types. The following changes are proposed:

  • alias both socket.error and select.error to OSError
  • alias mmap.error to OSError
  • alias both WindowsError and VMSError to OSError
  • alias IOError to OSError
  • coalesce EnvironmentError into OSError

Each of these changes doesn't preserve exact compatibility, but it does preserve useful compatibility (see "compatibility" section above).

Each of these changes can be accepted or refused individually, but of course it is considered that the greatest impact can be achieved if this first step is accepted in full. In this case, the IO exception sub-hierarchy would become:

+-- OSError   (replacing IOError, WindowsError, EnvironmentError, etc.)
    +-- io.BlockingIOError
    +-- io.UnsupportedOperation (also inherits from ValueError)
    +-- socket.gaierror
    +-- socket.herror
    +-- socket.timeout

Justification

Not only does this first step present the user a simpler landscape as explained in the rationale section, but it also allows for a better and more complete resolution of Step 2 (see Prerequisite).

The rationale for keeping OSError as the official name for generic OS-related exceptions is that it, precisely, is more generic than IOError. EnvironmentError is more tedious to type and also much lesser-known.

The survey in Appendix B shows that IOError is the dominant error today in the standard library. As for third-party Python code, Google Code Search shows IOError being ten times more popular than EnvironmentError in user code, and three times more popular than OSError [3]. However, with no intention to deprecate IOError in the middle term, the lesser popularity of OSError is not a problem.

Exception attributes

Since WindowsError is coalesced into OSError, the latter gains a winerror attribute under Windows. It is set to None under situations where it is not meaningful, as is already the case with the errno, filename and strerror attributes (for example when OSError is raised directly by Python code).

Deprecation of names

The following paragraphs outline a possible deprecation strategy for old exception names. However, it has been decided to keep them as aliases for the time being. This decision could be revised in time for Python 4.0.

built-in exceptions

Deprecating the old built-in exceptions cannot be done in a straightforward fashion by intercepting all lookups in the builtins namespace, since these are performance-critical. We also cannot work at the object level, since the deprecated names will be aliased to non-deprecated objects.

A solution is to recognize these names at compilation time, and then emit a separate LOAD_OLD_GLOBAL opcode instead of the regular LOAD_GLOBAL. This specialized opcode will handle the output of a DeprecationWarning (or PendingDeprecationWarning, depending on the policy decided upon) when the name doesn't exist in the globals namespace, but only in the builtins one. This will be enough to avoid false positives (for example if someone defines their own OSError in a module), and false negatives will be rare (for example when someone accesses OSError through the builtins module rather than directly).

module-level exceptions

The above approach cannot be used easily, since it would require special-casing some modules when compiling code objects. However, these names are by construction much less visible (they don't appear in the builtins namespace), and lesser-known too, so we might decide to let them live in their own namespaces.

Step 2: define additional subclasses

The second step of the resolution is to extend the hierarchy by defining subclasses which will be raised, rather than their parent, for specific errno values. Which errno values is subject to discussion, but a survey of existing exception matching practices (see Appendix A) helps us propose a reasonable subset of all values. Trying to map all errno mnemonics, indeed, seems foolish, pointless, and would pollute the root namespace.

Furthermore, in a couple of cases, different errno values could raise the same exception subclass. For example, EAGAIN, EALREADY, EWOULDBLOCK and EINPROGRESS are all used to signal that an operation on a non-blocking socket would block (and therefore needs trying again later). They could therefore all raise an identical subclass and let the user examine the errno attribute if (s)he so desires (see below "exception attributes").

Prerequisite

Step 1 is a loose prerequisite for this.

Prerequisite, because some errnos can currently be attached to different exception classes: for example, ENOENT can be attached to both OSError and IOError, depending on the context. If we don't want to break useful compatibility, we can't make an except OSError (or IOError) fail to match an exception where it would succeed today.

Loose, because we could decide for a partial resolution of step 2 if existing exception classes are not coalesced: for example, ENOENT could raise a hypothetical FileNotFoundError where an IOError was previously raised, but continue to raise OSError otherwise.

The dependency on step 1 could be totally removed if the new subclasses used multiple inheritance to match with all of the existing superclasses (or, at least, OSError and IOError, which are arguable the most prevalent ones). It would, however, make the hierarchy more complicated and therefore harder to grasp for the user.

New exception classes

The following tentative list of subclasses, along with a description and the list of errnos mapped to them, is submitted to discussion:

  • FileExistsError: trying to create a file or directory which already exists (EEXIST)
  • FileNotFoundError: for all circumstances where a file and directory is requested but doesn't exist (ENOENT)
  • IsADirectoryError: file-level operation (open(), os.remove()...) requested on a directory (EISDIR)
  • NotADirectoryError: directory-level operation requested on something else (ENOTDIR)
  • PermissionError: trying to run an operation without the adequate access rights - for example filesystem permissions (EACCES, EPERM)
  • BlockingIOError: an operation would block on an object (e.g. socket) set for non-blocking operation (EAGAIN, EALREADY, EWOULDBLOCK, EINPROGRESS); this is the existing io.BlockingIOError with an extended role
  • BrokenPipeError: trying to write on a pipe while the other end has been closed, or trying to write on a socket which has been shutdown for writing (EPIPE, ESHUTDOWN)
  • InterruptedError: a system call was interrupted by an incoming signal (EINTR)
  • ConnectionAbortedError: connection attempt aborted by peer (ECONNABORTED)
  • ConnectionRefusedError: connection reset by peer (ECONNREFUSED)
  • ConnectionResetError: connection reset by peer (ECONNRESET)
  • TimeoutError: connection timed out (ETIMEDOUT); this can be re-cast as a generic timeout exception, replacing socket.timeout and also useful for other types of timeout (for example in Lock.acquire())
  • ChildProcessError: operation on a child process failed (ECHILD); this is raised mainly by the wait() family of functions.
  • ProcessLookupError: the given process (as identified by, e.g., its process id) doesn't exist (ESRCH).

In addition, the following exception class is proposed for inclusion:

  • ConnectionError: a base class for ConnectionAbortedError, ConnectionRefusedError and ConnectionResetError

The following drawing tries to sum up the proposed additions, along with the corresponding errno values (where applicable). The root of the sub-hierarchy (OSError, assuming Step 1 is accepted in full) is not shown:

+-- BlockingIOError        EAGAIN, EALREADY, EWOULDBLOCK, EINPROGRESS
+-- ChildProcessError                                          ECHILD
+-- ConnectionError
    +-- BrokenPipeError                              EPIPE, ESHUTDOWN
    +-- ConnectionAbortedError                           ECONNABORTED
    +-- ConnectionRefusedError                           ECONNREFUSED
    +-- ConnectionResetError                               ECONNRESET
+-- FileExistsError                                            EEXIST
+-- FileNotFoundError                                          ENOENT
+-- InterruptedError                                            EINTR
+-- IsADirectoryError                                          EISDIR
+-- NotADirectoryError                                        ENOTDIR
+-- PermissionError                                     EACCES, EPERM
+-- ProcessLookupError                                          ESRCH
+-- TimeoutError                                            ETIMEDOUT

Naming

Various naming controversies can arise. One of them is whether all exception class names should end in "Error". In favour is consistency with the rest of the exception hiearchy, against is concision (especially with long names such as ConnectionAbortedError).

Exception attributes

In order to preserve useful compatibility, these subclasses should still set adequate values for the various exception attributes defined on the superclass (for example errno, filename, and optionally winerror).

Implementation

Since it is proposed that the subclasses are raised based purely on the value of errno, little or no changes should be required in extension modules (either standard or third-party).

The first possibility is to adapt the PyErr_SetFromErrno() family of functions (PyErr_SetFromWindowsErr() under Windows) to raise the appropriate OSError subclass. This wouldn't cover, however, Python code raising OSError directly, using the following idiom (seen in Lib/tempfile.py):

raise IOError(_errno.EEXIST, "No usable temporary file name found")

A second possibility, suggested by Marc-Andre Lemburg, is to adapt OSError.__new__ to instantiate the appropriate subclass. This has the benefit of also covering Python code such as the above.

Possible objections

Namespace pollution

Making the exception hierarchy finer-grained makes the root (or builtins) namespace larger. This is to be moderated, however, as:

  • only a handful of additional classes are proposed;
  • while standard exception types live in the root namespace, they are visually distinguished by the fact that they use the CamelCase convention, while almost all other builtins use lowercase naming (except True, False, None, Ellipsis and NotImplemented)

An alternative would be to provide a separate module containing the finer-grained exceptions, but that would defeat the purpose of encouraging careful code over careless code, since the user would first have to import the new module instead of using names already accessible.

Earlier discussion

While this is the first time such as formal proposal is made, the idea has received informal support in the past [1]; both the introduction of finer-grained exception classes and the coalescing of OSError and IOError.

The removal of WindowsError alone has been discussed and rejected as part of another PEP [2], but there seemed to be a consensus that the distinction with OSError wasn't meaningful. This supports at least its aliasing with OSError.

Implementation

The reference implementation has been integrated into Python 3.3. It was formerly developed in http://hg.python.org/features/pep-3151/ in branch pep-3151, and also tracked on the bug tracker at http://bugs.python.org/issue12555. It has been successfully tested on a variety of systems: Linux, Windows, OpenIndiana and FreeBSD buildbots.

One source of trouble has been with the respective constructors of OSError and WindowsError, which were incompatible. The way it is solved is by keeping the OSError signature and adding a fourth optional argument to allow passing the Windows error code (which is different from the POSIX errno). The fourth argument is stored as winerror and its POSIX translation as errno. The PyErr_SetFromWindowsErr* functions have been adapted to use the right constructor call.

A slight complication is when the PyErr_SetExcFromWindowsErr* functions are called with OSError rather than WindowsError: the errno attribute of the exception object would store the Windows error code (such as 109 for ERROR_BROKEN_PIPE) rather than its POSIX translation (such as 32 for EPIPE), which it does now. For non-socket error codes, this only occurs in the private _multiprocessing module for which there is no compatibility concern.

Note

For socket errors, the "POSIX errno" as reflected by the errno module is numerically equal to the Windows Socket error code returned by the WSAGetLastError system call:

>>> errno.EWOULDBLOCK
10035
>>> errno.WSAEWOULDBLOCK
10035

Possible alternative

Pattern matching

Another possibility would be to introduce an advanced pattern matching syntax when catching exceptions. For example:

try:
    os.remove(filename)
except OSError as e if e.errno == errno.ENOENT:
    pass

Several problems with this proposal:

  • it introduces new syntax, which is perceived by the author to be a heavier change compared to reworking the exception hierarchy
  • it doesn't decrease typing effort significantly
  • it doesn't relieve the programmer from the burden of having to remember errno mnemonics

Exceptions ignored by this PEP

This PEP ignores EOFError, which signals a truncated input stream in various protocol and file format implementations (for example GzipFile). EOFError is not OS- or IO-related, it is a logical error raised at a higher level.

This PEP also ignores SSLError, which is raised by the ssl module in order to propagate errors signalled by the OpenSSL library. Ideally, SSLError would benefit from a similar but separate treatment since it defines its own constants for error types (ssl.SSL_ERROR_WANT_READ, etc.). In Python 3.2, SSLError is already replaced with socket.timeout when it signals a socket timeout (see issue 10272).

Endly, the fate of socket.gaierror and socket.herror is not settled. While they would deserve less cryptic names, this can be handled separately from the exception hierarchy reorganization effort.

Appendix A: Survey of common errnos

This is a quick inventory of the various errno mnemonics checked for in the standard library and its tests, as part of except clauses.

Common errnos with OSError

  • EBADF: bad file descriptor (usually means the file descriptor was closed)
  • EEXIST: file or directory exists
  • EINTR: interrupted function call
  • EISDIR: is a directory
  • ENOTDIR: not a directory
  • ENOENT: no such file or directory
  • EOPNOTSUPP: operation not supported on socket (possible confusion with the existing io.UnsupportedOperation)
  • EPERM: operation not permitted (when using e.g. os.setuid())

Common errnos with IOError

  • EACCES: permission denied (for filesystem operations)
  • EBADF: bad file descriptor (with select.epoll); read operation on a write-only GzipFile, or vice-versa
  • EBUSY: device or resource busy
  • EISDIR: is a directory (when trying to open())
  • ENODEV: no such device
  • ENOENT: no such file or directory (when trying to open())
  • ETIMEDOUT: connection timed out

Common errnos with socket.error

All these errors may also be associated with a plain IOError, for example when calling read() on a socket's file descriptor.

  • EAGAIN: resource temporarily unavailable (during a non-blocking socket call except connect())
  • EALREADY: connection already in progress (during a non-blocking connect())
  • EINPROGRESS: operation in progress (during a non-blocking connect())
  • EINTR: interrupted function call
  • EISCONN: the socket is connected
  • ECONNABORTED: connection aborted by peer (during an accept() call)
  • ECONNREFUSED: connection refused by peer
  • ECONNRESET: connection reset by peer
  • ENOTCONN: socket not connected
  • ESHUTDOWN: cannot send after transport endpoint shutdown
  • EWOULDBLOCK: same reasons as EAGAIN

Common errnos with select.error

  • EINTR: interrupted function call

Appendix B: Survey of raised OS and IO errors

About VMSError

VMSError is completely unused by the interpreter core and the standard library. It was added as part of the OpenVMS patches submitted in 2002 by Jean-François Piéronne [4]; the motivation for including VMSError was that it could be raised by third-party packages.

Interpreter core

Handling of PYTHONSTARTUP raises IOError (but the error gets discarded):

$ PYTHONSTARTUP=foox ./python
Python 3.2a0 (py3k:82920M, Jul 16 2010, 22:53:23)
[GCC 4.4.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Could not open PYTHONSTARTUP
IOError: [Errno 2] No such file or directory: 'foox'

PyObject_Print() raises IOError when ferror() signals an error on the FILE * parameter (which, in the source tree, is always either stdout or stderr).

Unicode encoding and decoding using the mbcs encoding can raise WindowsError for some error conditions.

Standard library

bz2

Raises IOError throughout (OSError is unused):

>>> bz2.BZ2File("foox", "rb")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IOError: [Errno 2] No such file or directory
>>> bz2.BZ2File("LICENSE", "rb").read()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IOError: invalid data stream
>>> bz2.BZ2File("/tmp/zzz.bz2", "wb").read()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IOError: file is not ready for reading

curses

Not examined.

dbm.gnu, dbm.ndbm

_dbm.error and _gdbm.error inherit from IOError:

>>> dbm.gnu.open("foox")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
_gdbm.error: [Errno 2] No such file or directory

fcntl

Raises IOError throughout (OSError is unused).

imp module

Raises IOError for bad file descriptors:

>>> imp.load_source("foo", "foo", 123)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IOError: [Errno 9] Bad file descriptor

io module

Raises IOError when trying to open a directory under Unix:

>>> open("Python/", "r")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IOError: [Errno 21] Is a directory: 'Python/'

Raises IOError or io.UnsupportedOperation (which inherits from the former) for unsupported operations:

>>> open("LICENSE").write("bar")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IOError: not writable
>>> io.StringIO().fileno()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
io.UnsupportedOperation: fileno
>>> open("LICENSE").seek(1, 1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IOError: can't do nonzero cur-relative seeks

Raises either IOError or TypeError when the inferior I/O layer misbehaves (i.e. violates the API it is expected to implement).

Raises IOError when the underlying OS resource becomes invalid:

>>> f = open("LICENSE")
>>> os.close(f.fileno())
>>> f.read()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IOError: [Errno 9] Bad file descriptor

...or for implementation-specific optimizations:

>>> f = open("LICENSE")
>>> next(f)
'A. HISTORY OF THE SOFTWARE\n'
>>> f.tell()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IOError: telling position disabled by next() call

Raises BlockingIOError (inheriting from IOError) when a call on a non-blocking object would block.

mmap

Under Unix, raises its own mmap.error (inheriting from EnvironmentError) throughout:

>>> mmap.mmap(123, 10)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
mmap.error: [Errno 9] Bad file descriptor
>>> mmap.mmap(os.open("/tmp", os.O_RDONLY), 10)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
mmap.error: [Errno 13] Permission denied

Under Windows, however, it mostly raises WindowsError (the source code also shows a few occurrences of mmap.error):

>>> fd = os.open("LICENSE", os.O_RDONLY)
>>> m = mmap.mmap(fd, 16384)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
WindowsError: [Error 5] Accès refusé
>>> sys.last_value.errno
13
>>> errno.errorcode[13]
'EACCES'

>>> m = mmap.mmap(-1, 4096)
>>> m.resize(16384)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
WindowsError: [Error 87] Paramètre incorrect
>>> sys.last_value.errno
22
>>> errno.errorcode[22]
'EINVAL'

multiprocessing

Not examined.

os / posix

The os (or posix) module raises OSError throughout, except under Windows where WindowsError can be raised instead.

ossaudiodev

Raises IOError throughout (OSError is unused):

>>> ossaudiodev.open("foo", "r")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IOError: [Errno 2] No such file or directory: 'foo'

readline

Raises IOError in various file-handling functions:

>>> readline.read_history_file("foo")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IOError: [Errno 2] No such file or directory
>>> readline.read_init_file("foo")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IOError: [Errno 2] No such file or directory
>>> readline.write_history_file("/dev/nonexistent")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IOError: [Errno 13] Permission denied

select

  • select() and poll objects raise select.error, which doesn't inherit from anything (but poll.modify() raises IOError);
  • epoll objects raise IOError;
  • kqueue objects raise both OSError and IOError.

As a side-note, not deriving from EnvironmentError means select.error does not get the useful errno attribute. User code must check args[0] instead:

>>> signal.alarm(1); select.select([], [], [])
0
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
select.error: (4, 'Interrupted system call')
>>> e = sys.last_value
>>> e
error(4, 'Interrupted system call')
>>> e.errno == errno.EINTR
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'error' object has no attribute 'errno'
>>> e.args[0] == errno.EINTR
True

signal

signal.ItimerError inherits from IOError.

socket

socket.error inherits from IOError.

sys

sys.getwindowsversion() raises WindowsError with a bogus error number if the GetVersionEx() call fails.

time

Raises IOError for internal errors in time.time() and time.sleep().

zipimport

zipimporter.get_data() can raise IOError.

Acknowledgments

Significant input has been received from Nick Coghlan.

References

[1]"IO module precisions and exception hierarchy": http://mail.python.org/pipermail/python-dev/2009-September/092130.html
[2]Discussion of "Removing WindowsError" in PEP 348: http://www.python.org/dev/peps/pep-0348/#removing-windowserror
[3]Google Code Search of IOError in Python code: around 40000 results; OSError: around 15200 results; EnvironmentError: around 3000 results
[4]http://bugs.python.org/issue614055

pep-3152 Cofunctions

PEP:3152
Title:Cofunctions
Version:$Revision$
Last-Modified:$Date$
Author:Gregory Ewing <greg.ewing at canterbury.ac.nz>
Status:Rejected
Type:Standards Track
Content-Type:text/x-rst
Created:13-Feb-2009
Python-Version:3.3
Post-History:

Abstract

A syntax is proposed for defining and calling a special type of generator called a 'cofunction'. It is designed to provide a streamlined way of writing generator-based coroutines, and allow the early detection of certain kinds of error that are easily made when writing such code, which otherwise tend to cause hard-to-diagnose symptoms.

This proposal builds on the 'yield from' mechanism described in PEP 380, and describes some of the semantics of cofunctions in terms of it. However, it would be possible to define and implement cofunctions independently of PEP 380 if so desired.

Specification

Cofunction definitions

A new keyword codef is introduced which is used in place of def to define a cofunction. A cofunction is a special kind of generator having the following characteristics:

  1. A cofunction is always a generator, even if it does not contain any yield or yield from expressions.
  2. A cofunction cannot be called the same way as an ordinary function. An exception is raised if an ordinary call to a cofunction is attempted.

Cocalls

Calls from one cofunction to another are made by marking the call with a new keyword cocall. The expression

cocall f(*args, **kwds)

is semantically equivalent to

yield from f.__cocall__(*args, **kwds)

except that the object returned by __cocall__ is expected to be an iterator, so the step of calling iter() on it is skipped.

The full syntax of a cocall expression is described by the following grammar lines:

atom: cocall | <existing alternatives for atom>
cocall: 'cocall' atom cotrailer* '(' [arglist] ')'
cotrailer: '[' subscriptlist ']' | '.' NAME

The cocall keyword is syntactically valid only inside a cofunction. A SyntaxError will result if it is used in any other context.

Objects which implement __cocall__ are expected to return an object obeying the iterator protocol. Cofunctions respond to __cocall__ the same way as ordinary generator functions respond to __call__, i.e. by returning a generator-iterator.

Certain objects that wrap other callable objects, notably bound methods, will be given __cocall__ implementations that delegate to the underlying object.

New builtins, attributes and C API functions

To facilitate interfacing cofunctions with non-coroutine code, there will be a built-in function costart whose definition is equivalent to

def costart(obj, *args, **kwds):
    return obj.__cocall__(*args, **kwds)

There will also be a corresponding C API function

PyObject *PyObject_CoCall(PyObject *obj, PyObject *args, PyObject *kwds)

It is left unspecified for now whether a cofunction is a distinct type of object or, like a generator function, is simply a specially-marked function instance. If the latter, a read-only boolean attribute __iscofunction__ should be provided to allow testing whether a given function object is a cofunction.

Motivation and Rationale

The yield from syntax is reasonably self-explanatory when used for the purpose of delegating part of the work of a generator to another function. It can also be used to good effect in the implementation of generator-based coroutines, but it reads somewhat awkwardly when used for that purpose, and tends to obscure the true intent of the code.

Furthermore, using generators as coroutines is somewhat error-prone. If one forgets to use yield from when it should have been used, or uses it when it shouldn't have, the symptoms that result can be obscure and confusing.

Finally, sometimes there is a need for a function to be a coroutine even though it does not yield anything, and in these cases it is necessary to resort to kludges such as if 0: yield to force it to be a generator.

The codef and cocall constructs address the first issue by making the syntax directly reflect the intent, that is, that the function forms part of a coroutine.

The second issue is addressed by making it impossible to mix coroutine and non-coroutine code in ways that don't make sense. If the rules are violated, an exception is raised that points out exactly what and where the problem is.

Lastly, the need for dummy yields is eliminated by making the form of definition determine whether the function is a coroutine, rather than what it contains.

Prototype Implementation

An implementation in the form of patches to Python 3.1.2 can be found here:

http://www.cosc.canterbury.ac.nz/greg.ewing/python/generators/cofunctions.html

pep-3153 Asynchronous IO support

PEP:3153
Title:Asynchronous IO support
Version:$Revision$
Last-Modified:$Date$
Author:Laurens Van Houtven <_ at lvh.cc>
Status:Superseded
Type:Standards Track
Content-Type:text/x-rst
Created:29-May-2011
Post-History:TBD
Superseded-By:3156

Abstract

This PEP describes an abstraction of asynchronous IO for the Python standard library.

The goal is to reach a abstraction that can be implemented by many different asynchronous IO backends and provides a target for library developers to write code portable between those different backends.

Rationale

People who want to write asynchronous code in Python right now have a few options:

  • asyncore and asynchat
  • something bespoke, most likely based on the select module
  • using a third party library, such as Twisted [2] or gevent [3]

Unfortunately, each of these options has its downsides, which this PEP tries to address.

Despite having been part of the Python standard library for a long time, the asyncore module suffers from fundamental flaws following from an inflexible API that does not stand up to the expectations of a modern asynchronous networking module.

Moreover, its approach is too simplistic to provide developers with all the tools they need in order to fully exploit the potential of asynchronous networking.

The most popular solution right now used in production involves the use of third party libraries. These often provide satisfactory solutions, but there is a lack of compatibility between these libraries, which tends to make codebases very tightly coupled to the library they use.

This current lack of portability between different asynchronous IO libraries causes a lot of duplicated effort for third party library developers. A sufficiently powerful abstraction could mean that asynchronous code gets written once, but used everywhere.

An eventual added goal would be for standard library implementations of wire and network protocols to evolve towards being real protocol implementations, as opposed to standalone libraries that do everything including calling recv() blockingly. This means they could be easily reused for both synchronous and asynchronous code.

Communication abstractions

Transports

Transports provide a uniform API for reading bytes from and writing bytes to different kinds of connections. Transports in this PEP are always ordered, reliable, bidirectional, stream-oriented two-endpoint connections. This might be a TCP socket, an SSL connection, a pipe (named or otherwise), a serial port... It may abstract a file descriptor on POSIX platforms or a Handle on Windows or some other data structure appropriate to a particular platform. It encapsulates all of the particular implementation details of using that platform data structure and presents a uniform interface for application developers.

Transports talk to two things: the other side of the connection on one hand, and a protocol on the other. It's a bridge between the specific underlying transfer mechanism and the protocol. Its job can be described as allowing the protocol to just send and receive bytes, taking care of all of the magic that needs to happen to those bytes to be eventually sent across the wire.

The primary feature of a transport is sending bytes to a protocol and receiving bytes from the underlying protocol. Writing to the transport is done using the write and write_sequence methods. The latter method is a performance optimization, to allow software to take advantage of specific capabilities in some transport mechanisms. Specifically, this allows transports to use writev [4] instead of write [5] or send [6], also known as scatter/gather IO.

A transport can be paused and resumed. This will cause it to buffer data coming from protocols and stop sending received data to the protocol.

A transport can also be closed, half-closed and aborted. A closed transport will finish writing all of the data queued in it to the underlying mechanism, and will then stop reading or writing data. Aborting a transport stops it, closing the connection without sending any data that is still queued.

Further writes will result in exceptions being thrown. A half-closed transport may not be written to anymore, but will still accept incoming data.

Protocols

Protocols are probably more familiar to new users. The terminology is consistent with what you would expect from something called a protocol: the protocols most people think of first, like HTTP, IRC, SMTP... are all examples of something that would be implemented in a protocol.

The shortest useful definition of a protocol is a (usually two-way) bridge between the transport and the rest of the application logic. A protocol will receive bytes from a transport and translates that information into some behavior, typically resulting in some method calls on an object. Similarly, application logic calls some methods on the protocol, which the protocol translates into bytes and communicates to the transport.

One of the simplest protocols is a line-based protocol, where data is delimited by \r\n. The protocol will receive bytes from the transport and buffer them until there is at least one complete line. Once that's done, it will pass this line along to some object. Ideally that would be accomplished using a callable or even a completely separate object composed by the protocol, but it could also be implemented by subclassing (as is the case with Twisted's LineReceiver). For the other direction, the protocol could have a write_line method, which adds the required \r\n and passes the new bytes buffer on to the transport.

This PEP suggests a generalized LineReceiver called ChunkProtocol, where a "chunk" is a message in a stream, delimited by the specified delimiter. Instances take a delimiter and a callable that will be called with a chunk of data once it's received (as opposed to Twisted's subclassing behavior). ChunkProtocol also has a write_chunk method analogous to the write_line method described above.

Why separate protocols and transports?

This separation between protocol and transport often confuses people who first come across it. In fact, the standard library itself does not make this distinction in many cases, particularly not in the API it provides to users.

It is nonetheless a very useful distinction. In the worst case, it simplifies the implementation by clear separation of concerns. However, it often serves the far more useful purpose of being able to reuse protocols across different transports.

Consider a simple RPC protocol. The same bytes may be transferred across many different transports, for example pipes or sockets. To help with this, we separate the protocol out from the transport. The protocol just reads and writes bytes, and doesn't really care what mechanism is used to eventually transfer those bytes.

This also allows for protocols to be stacked or nested easily, allowing for even more code reuse. A common example of this is JSON-RPC: according to the specification, it can be used across both sockets and HTTP [1]. In practice, it tends to be primarily encapsulated in HTTP. The protocol-transport abstraction allows us to build a stack of protocols and transports that allow you to use HTTP as if it were a transport. For JSON-RPC, that might get you a stack somewhat like this:

  1. TCP socket transport
  2. HTTP protocol
  3. HTTP-based transport
  4. JSON-RPC protocol
  5. Application code

Flow control

Consumers

Consumers consume bytes produced by producers. Together with producers, they make flow control possible.

Consumers primarily play a passive role in flow control. They get called whenever a producer has some data available. They then process that data, and typically yield control back to the producer.

Consumers typically implement buffers of some sort. They make flow control possible by telling their producer about the current status of those buffers. A consumer can instruct a producer to stop producing entirely, stop producing temporarily, or resume producing if it has been told to pause previously.

Producers are registered to the consumer using the register method.

Producers

Where consumers consume bytes, producers produce them.

Producers are modeled after the IPushProducer [7] interface found in Twisted. Although there is an IPullProducer [8] as well, it is on the whole far less interesting and therefore probably out of the scope of this PEP.

Although producers can be told to stop producing entirely, the two most interesting methods they have are pause and resume. These are usually called by the consumer, to signify whether it is ready to process ("consume") more data or not. Consumers and producers cooperate to make flow control possible.

In addition to the Twisted IPushProducer [7] interface, producers have a half_register method which is called with the consumer when the consumer tries to register that producer. In most cases, this will just be a case of setting self.consumer = consumer, but some producers may require more complex preconditions or behavior when a consumer is registered. End-users are not supposed to call this method directly.

Considered API alternatives

Generators as producers

Generators have been suggested as way to implement producers. However, there appear to be a few problems with this.

First of all, there is a conceptual problem. A generator, in a sense, is "passive". It needs to be told, through a method call, to take action. A producer is "active": it initiates those method calls. A real producer has a symmetric relationship with it's consumer. In the case of a generator-turned-producer, only the consumer would have a reference, and the producer is blissfully unaware of the consumer's existence.

This conceptual problem translates into a few technical issues as well. After a successful write method call on its consumer, a (push) producer is free to take action once more. In the case of a generator, it would need to be told, either by asking for the next object through the iteration protocol (a process which could block indefinitely), or perhaps by throwing some kind of signal exception into it.

This signaling setup may provide a technically feasible solution, but it is still unsatisfactory. For one, this introduces unwarranted complexity in the consumer, which now not only needs to understand how to receive and process data, but also how to ask for new data and deal with the case of no new data being available.

This latter edge case is particularly problematic. It needs to be taken care of, since the entire operation is not allowed to block. However, generators can not raise an exception on iteration without terminating, thereby losing the state of the generator. As a result, signaling a lack of available data would have to be done using a sentinel value, instead of being done using th exception mechanism.

Last but not least, nobody produced actually working code demonstrating how they could be used.

pep-3154 Pickle protocol version 4

PEP:3154
Title:Pickle protocol version 4
Version:$Revision$
Last-Modified:$Date$
Author:Antoine Pitrou <solipsis at pitrou.net>
Status:Final
Type:Standards Track
Content-Type:text/x-rst
Created:2011-08-11
Python-Version:3.4
Post-History:http://mail.python.org/pipermail/python-dev/2011-August/112821.html
Resolution:https://mail.python.org/pipermail/python-dev/2013-November/130439.html

Abstract

Data serialized using the pickle module must be portable across Python versions. It should also support the latest language features as well as implementation-specific features. For this reason, the pickle module knows about several protocols (currently numbered from 0 to 3), each of which appeared in a different Python version. Using a low-numbered protocol version allows to exchange data with old Python versions, while using a high-numbered protocol allows access to newer features and sometimes more efficient resource use (both CPU time required for (de)serializing, and disk size / network bandwidth required for data transfer).

Rationale

The latest current protocol, coincidentally named protocol 3, appeared with Python 3.0 and supports the new incompatible features in the language (mainly, unicode strings by default and the new bytes object). The opportunity was not taken at the time to improve the protocol in other ways.

This PEP is an attempt to foster a number of incremental improvements in a new pickle protocol version. The PEP process is used in order to gather as many improvements as possible, because the introduction of a new pickle protocol should be a rare occurrence.

Proposed changes

Framing

Traditionally, when unpickling an object from a stream (by calling load() rather than loads()), many small read() calls can be issued on the file-like object, with a potentially huge performance impact.

Protocol 4, by contrast, features binary framing. The general structure of a pickle is thus the following:

+------+------+
| 0x80 | 0x04 |              protocol header (2 bytes)
+------+------+
|  OP  |                     FRAME opcode (1 byte)
+------+------+-----------+
| MM MM MM MM MM MM MM MM |  frame size (8 bytes, little-endian)
+------+------------------+
| .... |                     first frame contents (M bytes)
+------+
|  OP  |                     FRAME opcode (1 byte)
+------+------+-----------+
| NN NN NN NN NN NN NN NN |  frame size (8 bytes, little-endian)
+------+------------------+
| .... |                     second frame contents (N bytes)
+------+
  etc.

To keep the implementation simple, it is forbidden for a pickle opcode to straddle frame boundaries. The pickler takes care not to produce such pickles, and the unpickler refuses them. Also, there is no "last frame" marker. The last frame is simply the one which ends with a STOP opcode.

A well-written C implementation doesn't need additional memory copies for the framing layer, preserving general (un)pickling efficiency.

Note

How the pickler decides to partition the pickle stream into frames is an implementation detail. For example, "closing" a frame as soon as it reaches ~64 KiB is a reasonable choice for both performance and pickle size overhead.

Binary encoding for all opcodes

The GLOBAL opcode, which is still used in protocol 3, uses the so-called "text" mode of the pickle protocol, which involves looking for newlines in the pickle stream. It also complicates the implementation of binary framing.

Protocol 4 forbids use of the GLOBAL opcode and replaces it with GLOBAL_STACK, a new opcode which takes its operand from the stack.

Serializing more "lookupable" objects

By default, pickle is only able to serialize module-global functions and classes. Supporting other kinds of objects, such as unbound methods [4], is a common request. Actually, third-party support for some of them, such as bound methods, is implemented in the multiprocessing module [5].

The __qualname__ attribute from PEP 3155 makes it possible to lookup many more objects by name. Making the GLOBAL_STACK opcode accept dot-separated names would allow the standard pickle implementation to support all those kinds of objects.

64-bit opcodes for large objects

Current protocol versions export object sizes for various built-in types (str, bytes) as 32-bit ints. This forbids serialization of large data [1]. New opcodes are required to support very large bytes and str objects.

Native opcodes for sets and frozensets

Many common built-in types (such as str, bytes, dict, list, tuple) have dedicated opcodes to improve resource consumption when serializing and deserializing them; however, sets and frozensets don't. Adding such opcodes would be an obvious improvement. Also, dedicated set support could help remove the current impossibility of pickling self-referential sets [2].

Calling __new__ with keyword arguments

Currently, classes whose __new__ mandates the use of keyword-only arguments can not be pickled (or, rather, unpickled) [3]. Both a new special method (__getnewargs_ex__) and a new opcode (NEWOBJ_EX) are needed. The __getnewargs_ex__ method, if it exists, must return a two-tuple (args, kwargs) where the first item is the tuple of positional arguments and the second item is the dict of keyword arguments for the class's __new__ method.

Better string encoding

Short str objects currently have their length coded as a 4-bytes integer, which is wasteful. A specific opcode with a 1-byte length would make many pickles smaller.

Smaller memoization

The PUT opcodes all require an explicit index to select in which entry of the memo dictionary the top-of-stack is memoized. However, in practice those numbers are allocated in sequential order. A new opcode, MEMOIZE, will instead store the top-of-stack in at the index equal to the current size of the memo dictionary. This allows for shorter pickles, since PUT opcodes are emitted for all non-atomic datatypes.

Summary of new opcodes

These reflect the state of the proposed implementation (thanks mostly to Alexandre Vassalotti's work):

  • FRAME: introduce a new frame (followed by the 8-byte frame size and the frame contents).
  • SHORT_BINUNICODE: push a utf8-encoded str object with a one-byte size prefix (therefore less than 256 bytes long).
  • BINUNICODE8: push a utf8-encoded str object with a eight-byte size prefix (for strings longer than 2**32 bytes, which therefore cannot be serialized using BINUNICODE).
  • BINBYTES8: push a bytes object with a eight-byte size prefix (for bytes objects longer than 2**32 bytes, which therefore cannot be serialized using BINBYTES).
  • EMPTY_SET: push a new empty set object on the stack.
  • ADDITEMS: add the topmost stack items to the set (to be used with EMPTY_SET).
  • FROZENSET: create a frozenset object from the topmost stack items, and push it on the stack.
  • NEWOBJ_EX: take the three topmost stack items cls, args and kwargs, and push the result of calling cls.__new__(*args, **kwargs).
  • STACK_GLOBAL: take the two topmost stack items module_name and qualname, and push the result of looking up the dotted qualname in the module named module_name.
  • MEMOIZE: store the top-of-stack object in the memo dictionary with an index equal to the current size of the memo dictionary.

Alternative ideas

Prefetching

Serhiy Storchaka suggested to replace framing with a special PREFETCH opcode (with a 2- or 4-bytes argument) to declare known pickle chunks explicitly. Large data may be pickled outside such chunks. A naĂŻve unpickler should be able to skip the PREFETCH opcode and still decode pickles properly, but good error handling would require checking that the PREFETCH length falls on an opcode boundary.

Acknowledgments

In alphabetic order:

  • Alexandre Vassalotti, for starting the second PEP 3154 implementation [6]
  • Serhiy Storchaka, for discussing the framing proposal [6]
  • Stefan Mihaila, for starting the first PEP 3154 implementation as a Google Summer of Code project mentored by Alexandre Vassalotti [7].

References

[1]"pickle not 64-bit ready": http://bugs.python.org/issue11564
[2]"Cannot pickle self-referencing sets": http://bugs.python.org/issue9269
[3]"pickle/copyreg doesn't support keyword only arguments in __new__": http://bugs.python.org/issue4727
[4]"pickle should support methods": http://bugs.python.org/issue9276
[5]Lib/multiprocessing/forking.py: http://hg.python.org/cpython/file/baea9f5f973c/Lib/multiprocessing/forking.py#l54
[6](1, 2) Implement PEP 3154, by Alexandre Vassalotti http://bugs.python.org/issue17810
[7]Implement PEP 3154, by Stefan Mihaila http://bugs.python.org/issue15642

pep-3155 Qualified name for classes and functions

PEP:3155
Title:Qualified name for classes and functions
Version:$Revision$
Last-Modified:$Date$
Author:Antoine Pitrou <solipsis at pitrou.net>
Status:Final
Type:Standards Track
Content-Type:text/x-rst
Created:2011-10-29
Python-Version:3.3
Post-History:
Resolution:http://mail.python.org/pipermail/python-dev/2011-November/114545.html

Rationale

Python's introspection facilities have long had poor support for nested classes. Given a class object, it is impossible to know whether it was defined inside another class or at module top-level; and, if the former, it is also impossible to know in which class it was defined. While use of nested classes is often considered poor style, the only reason for them to have second class introspection support is a lousy pun.

Python 3 adds insult to injury by dropping what was formerly known as unbound methods. In Python 2, given the following definition:

class C:
    def f():
        pass

you can then walk up from the C.f object to its defining class:

>>> C.f.im_class
<class '__main__.C'>

This possibility is gone in Python 3:

>>> C.f.im_class
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'function' object has no attribute 'im_class'
>>> dir(C.f)
['__annotations__', '__call__', '__class__', '__closure__', '__code__',
'__defaults__', '__delattr__', '__dict__', '__dir__', '__doc__',
'__eq__', '__format__', '__ge__', '__get__', '__getattribute__',
'__globals__', '__gt__', '__hash__', '__init__', '__kwdefaults__',
'__le__', '__lt__', '__module__', '__name__', '__ne__', '__new__',
'__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__',
'__str__', '__subclasshook__']

This limits again the introspection capabilities available to the user. It can produce actual issues when porting software to Python 3, for example Twisted Core where the issue of introspecting method objects came up several times. It also limits pickling support [1].

Proposal

This PEP proposes the addition of a __qualname__ attribute to functions and classes. For top-level functions and classes, the __qualname__ attribute is equal to the __name__ attribute. For nested classed, methods, and nested functions, the __qualname__ attribute contains a dotted path leading to the object from the module top-level. A function's local namespace is represented in that dotted path by a component named <locals>.

The repr() and str() of functions and classes is modified to use __qualname__ rather than __name__.

Example with nested classes

>>> class C:
...   def f(): pass
...   class D:
...     def g(): pass
...
>>> C.__qualname__
'C'
>>> C.f.__qualname__
'C.f'
>>> C.D.__qualname__
'C.D'
>>> C.D.g.__qualname__
'C.D.g'

Example with nested functions

>>> def f():
...   def g(): pass
...   return g
...
>>> f.__qualname__
'f'
>>> f().__qualname__
'f.<locals>.g'

Limitations

With nested functions (and classes defined inside functions), the dotted path will not be walkable programmatically as a function's namespace is not available from the outside. It will still be more helpful to the human reader than the bare __name__.

As the __name__ attribute, the __qualname__ attribute is computed statically and it will not automatically follow rebinding.

Discussion

Excluding the module name

As __name__, __qualname__ doesn't include the module name. This makes it independent of module aliasing and rebinding, and also allows to compute it at compile time.

Reviving unbound methods

Reviving unbound methods would only solve a fraction of the problems this PEP solves, at a higher price (an additional object type and an additional indirection, rather than an additional attribute).

Naming choice

"Qualified name" is the best approximation, as a short phrase, of what the additional attribute is about. It is not a "full name" or "fully qualified name" since it (deliberately) does not include the module name. Calling it a "path" would risk confusion with filesystem paths and the __file__ attribute.

The first proposal for the attribute name was to call it __qname__ but many people (who are not aware of previous use of such jargon in e.g. the XML specification [2]) found it obscure and non-obvious, which is why the slightly less short and more explicit __qualname__ was finally chosen.

References

[1]"pickle should support methods": http://bugs.python.org/issue9276
[2]"QName" entry in Wikipedia: http://en.wikipedia.org/wiki/QName

pep-3156 Asynchronous IO Support Rebooted: the "asyncio" Module

PEP:3156
Title:Asynchronous IO Support Rebooted: the "asyncio" Module
Version:$Revision$
Last-Modified:$Date$
Author:Guido van Rossum <guido at python.org>
BDFL-Delegate:Antoine Pitrou <antoine@python.org>
Discussions-To:<python-tulip at googlegroups.com>
Status:Final
Type:Standards Track
Content-Type:text/x-rst
Created:12-Dec-2012
Post-History:21-Dec-2012
Resolution:https://mail.python.org/pipermail/python-dev/2013-November/130419.html

Abstract

This is a proposal for asynchronous I/O in Python 3, starting at Python 3.3. Consider this the concrete proposal that is missing from PEP 3153. The proposal includes a pluggable event loop, transport and protocol abstractions similar to those in Twisted, and a higher-level scheduler based on yield from (PEP 380). The proposed package name is asyncio.

Introduction

Status

A reference implementation exists under the code name Tulip. The Tulip repo is linked from the References section at the end. Packages based on this repo will be provided on PyPI (see References) to enable using the asyncio package with Python 3.3 installations.

As of October 20th 2013, the asyncio package has been checked into the Python 3.4 repository and released with Python 3.4-alpha-4, with "provisional" API status. This is an expression of confidence and intended to increase early feedback on the API, and not intended to force acceptance of the PEP. The expectation is that the package will keep provisional status in Python 3.4 and progress to final status in Python 3.5. Development continues to occur primarily in the Tulip repo, with changes occasionally merged into the CPython repo.

Dependencies

Python 3.3 is required for many of the proposed features. The reference implementation (Tulip) requires no new language or standard library features beyond Python 3.3, no third-party modules or packages, and no C code, except for the (optional) IOCP support on Windows.

Module Namespace

The specification here lives in a new top-level package, asyncio. Different components live in separate submodules of the package. The package will import common APIs from their respective submodules and make them available as package attributes (similar to the way the email package works). For such common APIs, the name of the submodule that actually defines them is not part of the specification. Less common APIs may have to explicitly be imported from their respective submodule, and in this case the submodule name is part of the specification.

Classes and functions defined without a submodule name are assumed to live in the namespace of the top-level package. (But do not confuse these with methods of various classes, which for brevity are also used without a namespace prefix in certain contexts.)

Interoperability

The event loop is the place where most interoperability occurs. It should be easy for (Python 3.3 ports of) frameworks like Twisted, Tornado, or even gevents to either adapt the default event loop implementation to their needs using a lightweight adapter or proxy, or to replace the default event loop implementation with an adaptation of their own event loop implementation. (Some frameworks, like Twisted, have multiple event loop implementations. This should not be a problem since these all have the same interface.)

In most cases it should be possible for two different third-party frameworks to interoperate, either by sharing the default event loop implementation (each using its own adapter), or by sharing the event loop implementation of either framework. In the latter case two levels of adaptation would occur (from framework A's event loop to the standard event loop interface, and from there to framework B's event loop). Which event loop implementation is used should be under control of the main program (though a default policy for event loop selection is provided).

For this interoperability to be effective, the preferred direction of adaptation in third party frameworks is to keep the default event loop and adapt it to the framework's API. Ideally all third party frameworks would give up their own event loop implementation in favor of the standard implementation. But not all frameworks may be satisfied with the functionality provided by the standard implementation.

In order to support both directions of adaptation, two separate APIs are specified:

  • An interface for managing the current event loop
  • The interface of a conforming event loop

An event loop implementation may provide additional methods and guarantees, as long as these are called out in the documentation as non-standard. An event loop implementation may also leave certain methods unimplemented if they cannot be implemented in the given environment; however, such deviations from the standard API should be considered only as a last resort, and only if the platform or environment forces the issue. (An example would be a platform where there is a system event loop that cannot be started or stopped; see "Embedded Event Loops" below.)

The event loop API does not depend on yield from. Rather, it uses a combination of callbacks, additional interfaces (transports and protocols), and Futures. The latter are similar to those defined in PEP 3148, but have a different implementation and are not tied to threads. In particular, the result() method raises an exception instead of blocking when a result is not yet ready; the user is expected to use callbacks (or yield from) to wait for the result.

All event loop methods specified as returning a coroutine are allowed to return either a Future or a coroutine, at the implementation's choice (the standard implementation always returns coroutines). All event loop methods documented as accepting coroutine arguments must accept both Futures and coroutines for such arguments. (A convenience function, async(), exists to convert an argument that is either a coroutine or a Future into a Future.)

For users (like myself) who don't like using callbacks, a scheduler is provided for writing asynchronous I/O code as coroutines using the PEP 380 yield from expressions. The scheduler is not pluggable; pluggability occurs at the event loop level, and the standard scheduler implementation should work with any conforming event loop implementation. (In fact this is an important litmus test for conforming implementations.)

For interoperability between code written using coroutines and other async frameworks, the scheduler defines a Task class that behaves like a Future. A framework that interoperates at the event loop level can wait for a Future to complete by adding a callback to the Future. Likewise, the scheduler offers an operation to suspend a coroutine until a callback is called.

The event loop API provides limited interoperability with threads: there is an API to submit a function to an executor (see PEP 3148) which returns a Future that is compatible with the event loop, and there is a method to schedule a callback with an event loop from another thread in a thread-safe manner.

Transports and Protocols

For those not familiar with Twisted, a quick explanation of the relationship between transports and protocols is in order. At the highest level, the transport is concerned with how bytes are transmitted, while the protocol determines which bytes to transmit (and to some extent when).

A different way of saying the same thing: a transport is an abstraction for a socket (or similar I/O endpoint) while a protocol is an abstraction for an application, from the transport's point of view.

Yet another view is simply that the transport and protocol interfaces together define an abstract interface for using network I/O and interprocess I/O.

There is almost always a 1:1 relationship between transport and protocol objects: the protocol calls transport methods to send data, while the transport calls protocol methods to pass it data that has been received. Neither transport not protocol methods "block" -- they set events into motion and then return.

The most common type of transport is a bidirectional stream transport. It represents a pair of buffered streams (one in each direction) that each transmit a sequence of bytes. The most common example of a bidirectional stream transport is probably a TCP connection. Another common example is an SSL/TLS connection. But there are some other things that can be viewed this way, for example an SSH session or a pair of UNIX pipes. Typically there aren't many different transport implementations, and most of them come with the event loop implementation. However, there is no requirement that all transports must be created by calling an event loop method: a third party module may well implement a new transport and provide a constructor or factory function for it that simply takes an event loop as an argument or calls get_event_loop().

Note that transports don't need to use sockets, not even if they use TCP -- sockets are a platform-specific implementation detail.

A bidirectional stream transport has two "ends": one end talks to the network (or another process, or whatever low-level interface it wraps), and the other end talks to the protocol. The former uses whatever API is necessary to implement the transport; but the interface between transport and protocol is standardized by this PEP.

A protocol can represent some kind of "application-level" protocol such as HTTP or SMTP; it can also implement an abstraction shared by multiple protocols, or a whole application. A protocol's primary interface is with the transport. While some popular protocols (and other abstractions) may have standard implementations, often applications implement custom protocols. It also makes sense to have libraries of useful third party protocol implementations that can be downloaded and installed from PyPI.

There general notion of transport and protocol includes other interfaces, where the transport wraps some other communication abstraction. Examples include interfaces for sending and receiving datagrams (e.g. UDP), or a subprocess manager. The separation of concerns is the same as for bidirectional stream transports and protocols, but the specific interface between transport and protocol is different in each case.

Details of the interfaces defined by the various standard types of transports and protocols are given later.

Event Loop Interface Specification

Event Loop Policy: Getting and Setting the Current Event Loop

Event loop management is controlled by an event loop policy, which is a global (per-process) object. There is a default policy, and an API to change the policy. A policy defines the notion of context; a policy manages a separate event loop per context. The default policy's notion of context is defined as the current thread.

Certain platforms or programming frameworks may change the default policy to something more suitable to the expectations of the users of that platform or framework. Such platforms or frameworks must document their policy and at what point during their initialization sequence the policy is set, in order to avoid undefined behavior when multiple active frameworks want to override the default policy. (See also "Embedded Event Loops" below.)

To get the event loop for current context, use get_event_loop(). This returns an event loop object implementing the interface specified below, or raises an exception in case no event loop has been set for the current context and the current policy does not specify to create one. It should never return None.

To set the event loop for the current context, use set_event_loop(event_loop), where event_loop is an event loop object, i.e. an instance of AbstractEventLoop, or None. It is okay to set the current event loop to None, in which case subsequent calls to get_event_loop() will raise an exception. This is useful for testing code that should not depend on the existence of a default event loop.

It is expected that get_event_loop() returns a different event loop object depending on the context (in fact, this is the definition of context). It may create a new event loop object if none is set and creation is allowed by the policy. The default policy will create a new event loop only in the main thread (as defined by threading.py, which uses a special subclass for the main thread), and only if get_event_loop() is called before set_event_loop() is ever called. (To reset this state, reset the policy.) In other threads an event loop must be explicitly set. Other policies may behave differently. Event loop by the default policy creation is lazy; i.e. the first call to get_event_loop() creates an event loop instance if necessary and specified by the current policy.

For the benefit of unit tests and other special cases there's a third policy function: new_event_loop(), which creates and returns a new event loop object according to the policy's default rules. To make this the current event loop, you must call set_event_loop() with it.

To change the event loop policy, call set_event_loop_policy(policy), where policy is an event loop policy object or None. If not None, the policy object must be an instance of AbstractEventLoopPolicy that defines methods get_event_loop(), set_event_loop(loop) and new_event_loop(), all behaving like the functions described above.

Passing a policy value of None restores the default event loop policy (overriding the alternate default set by the platform or framework). The default event loop policy is an instance of the class DefaultEventLoopPolicy. The current event loop policy object can be retrieved by calling get_event_loop_policy().

Passing an Event Loop Around Explicitly

It is possible to write code that uses an event loop without relying on a global or per-thread default event loop. For this purpose, all APIs that need access to the current event loop (and aren't methods on an event class) take an optional keyword argument named loop. If this argument is None or unspecified, such APIs will call get_event_loop() to get the default event loop, but if the loop keyword argument is set to an event loop object, they will use that event loop, and pass it along to any other such APIs they call. For example, Future(loop=my_loop) will create a Future tied to the event loop my_loop. When the default current event is None, the loop keyword argument is effectively mandatory.

Note that an explicitly passed event loop must still belong to the current thread; the loop keyword argument does not magically change the constraints on how an event loop can be used.

Specifying Times

As usual in Python, all timeouts, intervals and delays are measured in seconds, and may be ints or floats. However, absolute times are not specified as POSIX timestamps. The accuracy, precision and epoch of the clock are up to the implementation.

The default implementation uses time.monotonic(). Books could be written about the implications of this choice. Better read the docs for the standard library time module.

Embedded Event Loops

On some platforms an event loop is provided by the system. Such a loop may already be running when the user code starts, and there may be no way to stop or close it without exiting from the program. In this case, the methods for starting, stopping and closing the event loop may not be implementable, and is_running() may always return True.

Event Loop Classes

There is no actual class named EventLoop. There is an AbstractEventLoop class which defines all the methods without implementations, and serves primarily as documentation. The following concrete classes are defined:

  • SelectorEventLoop is a concrete implementation of the full API based on the selectors module (new in Python 3.4). The constructor takes one optional argument, a selectors.Selector object. By default an instance of selectors.DefaultSelector is created and used.
  • ProactorEventLoop is a concrete implementation of the API except for the I/O event handling and signal handling methods. It is only defined on Windows (or on other platforms which support a similar API for "overlapped I/O"). The constructor takes one optional argument, a Proactor object. By default an instance of IocpProactor is created and used. (The IocpProactor class is not specified by this PEP; it is just an implementation detail of the ProactorEventLoop class.)

Event Loop Methods Overview

The methods of a conforming event loop are grouped into several categories. The first set of categories must be supported by all conforming event loop implementations, with the exception that embedded event loops may not implement the methods for starting, stopping and closing. (However, a partially-conforming event loop is still better than nothing. :-)

  • Starting, stopping and closing: run_forever(), run_until_complete(), stop(), is_running(), close().
  • Basic and timed callbacks: call_soon(), call_later(), call_at(), time().
  • Thread interaction: call_soon_threadsafe(), run_in_executor(), set_default_executor().
  • Internet name lookups: getaddrinfo(), getnameinfo().
  • Internet connections: create_connection(), create_server(), create_datagram_endpoint().
  • Wrapped socket methods: sock_recv(), sock_sendall(), sock_connect(), sock_accept().

The second set of categories may be supported by conforming event loop implementations. If not supported, they will raise NotImplementedError. (In the default implementation, SelectorEventLoop on UNIX systems supports all of these; SelectorEventLoop on Windows supports the I/O event handling category; ProactorEventLoop on Windows supports the pipes and subprocess category.)

  • I/O callbacks: add_reader(), remove_reader(), add_writer(), remove_writer().
  • Pipes and subprocesses: connect_read_pipe(), connect_write_pipe(), subprocess_shell(), subprocess_exec().
  • Signal callbacks: add_signal_handler(), remove_signal_handler().

Event Loop Methods

Starting, Stopping and Closing

An (unclosed) event loop can be in one of two states: running or stopped. These methods deal with starting and stopping an event loop:

  • run_forever(). Runs the event loop until stop() is called. This cannot be called when the event loop is already running. (This has a long name in part to avoid confusion with earlier versions of this PEP, where run() had different behavior, in part because there are already too many APIs that have a method named run(), and in part because there shouldn't be many places where this is called anyway.)
  • run_until_complete(future). Runs the event loop until the Future is done. If the Future is done, its result is returned, or its exception is raised. This cannot be called when the event loop is already running.
  • stop(). Stops the event loop as soon as it is convenient. It is fine to restart the loop with run_forever() or run_until_complete() subsequently; no scheduled callbacks will be lost if this is done. Note: stop() returns normally and the current callback is allowed to continue. How soon after this point the event loop stops is up to the implementation, but the intention is to stop short of polling for I/O, and not to run any callbacks scheduled in the future; the major freedom an implementation has is how much of the "ready queue" (callbacks already scheduled with call_soon()) it processes before stopping.
  • is_running(). Returns True if the event loop is currently running, False if it is stopped.
  • close(). Closes the event loop, releasing any resources it may hold, such as the file descriptor used by epoll() or kqueue(), and the default executor. This should not be called while the event loop is running. After it has been called the event loop should not be used again. It may be called multiple times; subsequent calls are no-ops.

Basic Callbacks

Callbacks associated with the same event loop are strictly serialized: one callback must finish before the next one will be called. This is an important guarantee: when two or more callbacks use or modify shared state, each callback is guaranteed that while it is running, the shared state isn't changed by another callback.

  • call_soon(callback, *args). This schedules a callback to be called as soon as possible. Returns a Handle (see below) representing the callback, whose cancel() method can be used to cancel the callback. It guarantees that callbacks are called in the order in which they were scheduled.
  • call_later(delay, callback, *args). Arrange for callback(*args) to be called approximately delay seconds in the future, once, unless cancelled. Returns a Handle representing the callback, whose cancel() method can be used to cancel the callback. Callbacks scheduled in the past or at exactly the same time will be called in an undefined order.
  • call_at(when, callback, *args). This is like call_later(), but the time is expressed as an absolute time. Returns a similar Handle. There is a simple equivalency: loop.call_later(delay, callback, *args) is the same as loop.call_at(loop.time() + delay, callback, *args).
  • time(). Returns the current time according to the event loop's clock. This may be time.time() or time.monotonic() or some other system-specific clock, but it must return a float expressing the time in units of approximately one second since some epoch. (No clock is perfect -- see PEP 418.)

Note: A previous version of this PEP defined a method named call_repeatedly(), which promised to call a callback at regular intervals. This has been withdrawn because the design of such a function is overspecified. On the one hand, a simple timer loop can easily be emulated using a callback that reschedules itself using call_later(); it is also easy to write coroutine containing a loop and a sleep() call (a toplevel function in the module, see below). On the other hand, due to the complexities of accurate timekeeping there are many traps and pitfalls here for the unaware (see PEP 418), and different use cases require different behavior in edge cases. It is impossible to offer an API for this purpose that is bullet-proof in all cases, so it is deemed better to let application designers decide for themselves what kind of timer loop to implement.

Thread interaction

  • call_soon_threadsafe(callback, *args). Like call_soon(callback, *args), but when called from another thread while the event loop is blocked waiting for I/O, unblocks the event loop. Returns a Handle. This is the only method that is safe to call from another thread. (To schedule a callback for a later time in a threadsafe manner, you can use loop.call_soon_threadsafe(loop.call_later, when, callback, *args).) Note: this is not safe to call from a signal handler (since it may use locks). In fact, no API is signal-safe; if you want to handle signals, use add_signal_handler() described below.
  • run_in_executor(executor, callback, *args). Arrange to call callback(*args) in an executor (see PEP 3148). Returns an asyncio.Future instance whose result on success is the return value of that call. This is equivalent to wrap_future(executor.submit(callback, *args)). If executor is None, the default executor set by set_default_executor() is used. If no default executor has been set yet, a ThreadPoolExecutor with a default number of threads is created and set as the default executor. (The default implementation uses 5 threads in this case.)
  • set_default_executor(executor). Set the default executor used by run_in_executor(). The argument must be a PEP 3148 Executor instance or None, in order to reset the default executor.

See also the wrap_future() function described in the section about Futures.

Internet name lookups

These methods are useful if you want to connect or bind a socket to an address without the risk of blocking for the name lookup. They are usually called implicitly by create_connection(), create_server() or create_datagram_endpoint().

  • getaddrinfo(host, port, family=0, type=0, proto=0, flags=0). Similar to the socket.getaddrinfo() function but returns a Future. The Future's result on success will be a list of the same format as returned by socket.getaddrinfo(), i.e. a list of (address_family, socket_type, socket_protocol, canonical_name, address) where address is a 2-tuple (ipv4_address, port) for IPv4 addresses and a 4-tuple (ipv4_address, port, flow_info, scope_id) for IPv6 addresses. If the family argument is zero or unspecified, the list returned may contain a mixture of IPv4 and IPv6 addresses; otherwise the addresses returned are constrained by the family value (similar for proto and flags). The default implementation calls socket.getaddrinfo() using run_in_executor(), but other implementations may choose to implement their own DNS lookup. The optional arguments must be specified as keyword arguments.

    Note: implementations are allowed to implement a subset of the full socket.getaddrinfo() interface; e.g. they may not support symbolic port names, or they may ignore or incompletely implement the type, proto and flags arguments. However, if type and proto are ignored, the argument values passed in should be copied unchanged into the return tuples' socket_type and socket_protocol elements. (You can't ignore family, since IPv4 and IPv6 addresses must be looked up differently. The only permissible values for family are socket.AF_UNSPEC (0), socket.AF_INET and socket.AF_INET6, and the latter only if it is defined by the platform.)

  • getnameinfo(sockaddr, flags=0). Similar to socket.getnameinfo() but returns a Future. The Future's result on success will be a tuple (host, port). Same implementation remarks as for getaddrinfo().

Internet connections

These are the high-level interfaces for managing internet connections. Their use is recommended over the corresponding lower-level interfaces because they abstract away the differences between selector-based and proactor-based event loops.

Note that the client and server side of stream connections use the same transport and protocol interface. However, datagram endpoints use a different transport and protocol interface.

  • create_connection(protocol_factory, host, port, <options>). Creates a stream connection to a given internet host and port. This is a task that is typically called from the client side of the connection. It creates an implementation-dependent bidirectional stream Transport to represent the connection, then calls protocol_factory() to instantiate (or retrieve) the user's Protocol implementation, and finally ties the two together. (See below for the definitions of Transport and Protocol.) The user's Protocol implementation is created or retrieved by calling protocol_factory() without arguments(*). The coroutine's result on success is the (transport, protocol) pair; if a failure prevents the creation of a successful connection, an appropriate exception will be raised. Note that when the coroutine completes, the protocol's connection_made() method has not yet been called; that will happen when the connection handshake is complete.

    (*) There is no requirement that protocol_factory is a class. If your protocol class needs to have specific arguments passed to its constructor, you can use lambda. You can also pass a trivial lambda that returns a previously constructed Protocol instance.

    The <options> are all specified using optional keyword arguments:

    • ssl: Pass True to create an SSL/TLS transport (by default a plain TCP transport is created). Or pass an ssl.SSLContext object to override the default SSL context object to be used. If a default context is created it is up to the implementation to configure reasonable defaults. The reference implementation currently uses PROTOCOL_SSLv23 and sets the OP_NO_SSLv2 option, calls set_default_verify_paths() and sets verify_mode to CERT_REQUIRED. In addition, whenever the context (default or otherwise) specifies a verify_mode of CERT_REQUIRED or CERT_OPTIONAL, if a hostname is given, immediately after a successful handshake ssl.match_hostname(peercert, hostname) is called, and if this raises an exception the conection is closed. (To avoid this behavior, pass in an SSL context that has verify_mode set to CERT_NONE. But this means you are not secure, and vulnerable to for example man-in-the-middle attacks.)
    • family, proto, flags: Address family, protocol and flags to be passed through to getaddrinfo(). These all default to 0, which means "not specified". (The socket type is always SOCK_STREAM.) If any of these values are not specified, the getaddrinfo() method will choose appropriate values. Note: proto has nothing to do with the high-level Protocol concept or the protocol_factory argument.
    • sock: An optional socket to be used instead of using the host, port, family, proto and flags arguments. If this is given, host and port must be explicitly set to None.
    • local_addr: If given, a (host, port) tuple used to bind the socket to locally. This is rarely needed but on multi-homed servers you occasionally need to force a connection to come from a specific address. This is how you would do that. The host and port are looked up using getaddrinfo().
    • server_hostname: This is only relevant when using SSL/TLS; it should not be used when ssl is not set. When ssl is set, this sets or overrides the hostname that will be verified. By default the value of the host argument is used. If host is empty, there is no default and you must pass a value for server_hostname. To disable hostname verification (which is a serious security risk) you must pass an empty string here and pass an ssl.SSLContext object whose verify_mode is set to ssl.CERT_NONE as the ssl argument.
  • create_server(protocol_factory, host, port, <options>). Enters a serving loop that accepts connections. This is a coroutine that completes once the serving loop is set up to serve. The return value is a Server object which can be used to stop the serving loop in a controlled fashion (see below). Multiple sockets may be bound if the specified address allows both IPv4 and IPv6 connections.

    Each time a connection is accepted, protocol_factory is called without arguments(**) to create a Protocol, a bidirectional stream Transport is created to represent the network side of the connection, and the two are tied together by calling protocol.connection_made(transport).

    (**) See previous footnote for create_connection(). However, since protocol_factory() is called once for each new incoming connection, it should return a new Protocol object each time it is called.

    The <options> are all specified using optional keyword arguments:

    • ssl: Pass an ssl.SSLContext object (or an object with the same interface) to override the default SSL context object to be used. (Unlike for create_connection(), passing True does not make sense here -- the SSLContext object is needed to specify the certificate and key.)

    • backlog: Backlog value to be passed to the listen() call. The default is implementation-dependent; in the default implementation the default value is 100.

    • reuse_address: Whether to set the SO_REUSEADDR option on the socket. The default is True on UNIX, False on Windows.

    • family, flags: Address family and flags to be passed

      through to getaddrinfo(). The family defaults to AF_UNSPEC; the flags default to AI_PASSIVE. (The socket type is always SOCK_STREAM; the socket protocol always set to 0, to let getaddrinfo() choose.)

    • sock: An optional socket to be used instead of using the host, port, family and flags arguments. If this is given, host and port must be explicitly set to None.

  • create_datagram_endpoint(protocol_factory, local_addr=None, remote_addr=None, <options>). Creates an endpoint for sending and receiving datagrams (typically UDP packets). Because of the nature of datagram traffic, there are no separate calls to set up client and server side, since usually a single endpoint acts as both client and server. This is a coroutine that returns a (transport, protocol) pair on success, or raises an exception on failure. If the coroutine returns successfully, the transport will call callbacks on the protocol whenever a datagram is received or the socket is closed; it is up to the protocol to call methods on the protocol to send datagrams. The transport returned is a DatagramTransport. The protocol returned is a DatagramProtocol. These are described later.

    Mandatory positional argument:

    • protocol_factory: A class or factory function that will be called exactly once, without arguments, to construct the protocol object to be returned. The interface between datagram transport and protocol is described below.

    Optional arguments that may be specified positionally or as keyword arguments:

    • local_addr: An optional tuple indicating the address to which the socket will be bound. If given this must be a (host, port) pair. It will be passed to getaddrinfo() to be resolved and the result will be passed to the bind() method of the socket created. If getaddrinfo() returns more than one address, they will be tried in turn. If omitted, no bind() call will be made.
    • remote_addr: An optional tuple indicating the address to which the socket will be "connected". (Since there is no such thing as a datagram connection, this just specifies a default value for the destination address of outgoing datagrams.) If given this must be a (host, port) pair. It will be passed to getaddrinfo() to be resolved and the result will be passed to sock_connect() together with the socket created. If getaddrinfo() returns more than one address, they will be tried in turn. If omitted, no sock_connect() call will be made.

    The <options> are all specified using optional keyword arguments:

    • family, proto, flags: Address family, protocol and flags to be passed through to getaddrinfo(). These all default to 0, which means "not specified". (The socket type is always SOCK_DGRAM.) If any of these values are not specified, the getaddrinfo() method will choose appropriate values.

    Note that if both local_addr and remote_addr are present, all combinations of local and remote addresses with matching address family will be tried.

Wrapped Socket Methods

The following methods for doing async I/O on sockets are not for general use. They are primarily meant for transport implementations working with IOCP through the ProactorEventLoop class. However, they are easily implementable for other event loop types, so there is no reason not to require them. The socket argument has to be a non-blocking socket.

  • sock_recv(sock, n). Receive up to n bytes from socket sock. Returns a Future whose result on success will be a bytes object.
  • sock_sendall(sock, data). Send bytes data to socket sock. Returns a Future whose result on success will be None. Note: the name uses sendall instead of send, to reflect that the semantics and signature of this method echo those of the standard library socket method sendall() rather than send().
  • sock_connect(sock, address). Connect to the given address. Returns a Future whose result on success will be None.
  • sock_accept(sock). Accept a connection from a socket. The socket must be in listening mode and bound to an address. Returns a Future whose result on success will be a tuple (conn, peer) where conn is a connected non-blocking socket and peer is the peer address.

I/O Callbacks

These methods are primarily meant for transport implementations working with a selector. They are implemented by SelectorEventLoop but not by ProactorEventLoop. Custom event loop implementations may or may not implement them.

The fd arguments below may be integer file descriptors, or "file-like" objects with a fileno() method that wrap integer file descriptors. Not all file-like objects or file descriptors are acceptable. Sockets (and socket file descriptors) are always accepted. On Windows no other types are supported. On UNIX, pipes and possibly tty devices are also supported, but disk files are not. Exactly which special file types are supported may vary by platform and per selector implementation. (Experimentally, there is at least one kind of pseudo-tty on OS X that is supported by select and poll but not by kqueue: it is used by Emacs shell windows.)

  • add_reader(fd, callback, *args). Arrange for callback(*args) to be called whenever file descriptor fd is deemed ready for reading. Calling add_reader() again for the same file descriptor implies a call to remove_reader() for the same file descriptor.
  • add_writer(fd, callback, *args). Like add_reader(), but registers the callback for writing instead of for reading.
  • remove_reader(fd). Cancels the current read callback for file descriptor fd, if one is set. If no callback is currently set for the file descriptor, this is a no-op and returns False. Otherwise, it removes the callback arrangement and returns True.
  • remove_writer(fd). This is to add_writer() as remove_reader() is to add_reader().

Pipes and Subprocesses

These methods are supported by SelectorEventLoop on UNIX and ProactorEventLoop on Windows.

The transports and protocols used with pipes and subprocesses differ from those used with regular stream connections. These are described later.

Each of the methods below has a protocol_factory argument, similar to create_connection(); this will be called exactly once, without arguments, to construct the protocol object to be returned.

Each method is a coroutine that returns a (transport, protocol) pair on success, or raises an exception on failure.

  • connect_read_pipe(protocol_factory, pipe): Create a unidrectional stream connection from a file-like object wrapping the read end of a UNIX pipe, which must be in non-blocking mode. The transport returned is a ReadTransport.
  • connect_write_pipe(protocol_factory, pipe): Create a unidrectional stream connection from a file-like object wrapping the write end of a UNIX pipe, which must be in non-blocking mode. The transport returned is a WriteTransport; it does not have any read-related methods. The protocol returned is a BaseProtocol.
  • subprocess_shell(protocol_factory, cmd, <options>): Create a subprocess from cmd, which is a string using the platform's "shell" syntax. This is similar to the standard library subprocess.Popen() class called with shell=True. The remaining arguments and return value are described below.
  • subprocess_exec(protocol_factory, *args, <options>): Create a subprocess from one or more string arguments, where the first string specifies the program to execute, and the remaining strings specify the program's arguments. (Thus, together the string arguments form the sys.argv value of the program, assuming it is a Python script.) This is similar to the standard library subprocess.Popen() class called with shell=False and the list of strings passed as the first argument; however, where Popen() takes a single argument which is list of strings, subprocess_exec() takes multiple string arguments. The remaining arguments and return value are described below.

Apart from the way the program to execute is specified, the two subprocess_*() methods behave the same. The transport returned is a SubprocessTransport which has a different interface than the common bidirectional stream transport. The protocol returned is a SubprocessProtocol which also has a custom interface.

The <options> are all specified using optional keyword arguments:

  • stdin: Either a file-like object representing the pipe to be connected to the subprocess's standard input stream using connect_write_pipe(), or the constant subprocess.PIPE (the default). By default a new pipe will be created and connected.
  • stdout: Either a file-like object representing the pipe to be connected to the subprocess's standard output stream using connect_read_pipe(), or the constant subprocess.PIPE (the default). By default a new pipe will be created and connected.
  • stderr: Either a file-like object representing the pipe to be connected to the subprocess's standard error stream using connect_read_pipe(), or one of the constants subprocess.PIPE (the default) or subprocess.STDOUT. By default a new pipe will be created and connected. When subprocess.STDOUT is specified, the subprocess's standard error stream will be connected to the same pipe as the standard output stream.
  • bufsize: The buffer size to be used when creating a pipe; this is passed to subprocess.Popen(). In the default implementation this defaults to zero, and on Windows it must be zero; these defaults deviate from subprocess.Popen().
  • executable, preexec_fn, close_fds, cwd, env, startupinfo, creationflags, restore_signals, start_new_session, pass_fds: These optional arguments are passed to subprocess.Popen() without interpretation.

Signal callbacks

These methods are only supported on UNIX.

  • add_signal_handler(sig, callback, *args). Whenever signal sig is received, arrange for callback(*args) to be called. Specifying another callback for the same signal replaces the previous handler (only one handler can be active per signal). The sig must be a valid signal number defined in the signal module. If the signal cannot be handled this raises an exception: ValueError if it is not a valid signal or if it is an uncatchable signal (e.g. SIGKILL), RuntimeError if this particular event loop instance cannot handle signals (since signals are global per process, only an event loop associated with the main thread can handle signals).
  • remove_signal_handler(sig). Removes the handler for signal sig, if one is set. Raises the same exceptions as add_signal_handler() (except that it may return False instead raising RuntimeError for uncatchable signals). Returns True if a handler was removed successfully, False if no handler was set.

Note: If these methods are statically known to be unsupported, they may raise NotImplementedError instead of RuntimeError.

Mutual Exclusion of Callbacks

An event loop should enforce mutual exclusion of callbacks, i.e. it should never start a callback while a previously callback is still running. This should apply across all types of callbacks, regardless of whether they are scheduled using call_soon(), call_later(), call_at(), call_soon_threadsafe(), add_reader(), add_writer(), or add_signal_handler().

Exceptions

There are two categories of exceptions in Python: those that derive from the Exception class and those that derive from BaseException. Exceptions deriving from Exception will generally be caught and handled appropriately; for example, they will be passed through by Futures, and they will be logged and ignored when they occur in a callback.

However, exceptions deriving only from BaseException are typically not caught, and will usually cause the program to terminate with a traceback. In some cases they are caught and re-raised. (Examples of this category include KeyboardInterrupt and SystemExit; it is usually unwise to treat these the same as most other exceptions.)

Handles

The various methods for registering one-off callbacks (call_soon(), call_later(), call_at() and call_soon_threadsafe()) all return an object representing the registration that can be used to cancel the callback. This object is called a Handle. Handles are opaque and have only one public method:

  • cancel(): Cancel the callback.

Note that add_reader(), add_writer() and add_signal_handler() do not return Handles.

Servers

The create_server() method returns a Server instance, which wraps the sockets (or other network objects) used to accept requests. This class has two public methods:

  • close(): Close the service. This stops accepting new requests but does not cancel requests that have already been accepted and are currently being handled.
  • wait_closed(): A coroutine that blocks until the service is closed and all accepted requests have been handled.

Futures

The asyncio.Future class here is intentionally similar to the concurrent.futures.Future class specified by PEP 3148, but there are slight differences. Whenever this PEP talks about Futures or futures this should be understood to refer to asyncio.Future unless concurrent.futures.Future is explicitly mentioned. The supported public API is as follows, indicating the differences with PEP 3148:

  • cancel(). If the Future is already done (or cancelled), do nothing and return False. Otherwise, this attempts to cancel the Future and returns True. If the the cancellation attempt is successful, eventually the Future's state will change to cancelled (so that cancelled() will return True) and the callbacks will be scheduled. For regular Futures, cancellation will always succeed immediately; but for Tasks (see below) the task may ignore or delay the cancellation attempt.
  • cancelled(). Returns True if the Future was successfully cancelled.
  • done(). Returns True if the Future is done. Note that a cancelled Future is considered done too (here and everywhere).
  • result(). Returns the result set with set_result(), or raises the exception set with set_exception(). Raises CancelledError if cancelled. Difference with PEP 3148: This has no timeout argument and does not wait; if the future is not yet done, it raises an exception.
  • exception(). Returns the exception if set with set_exception(), or None if a result was set with set_result(). Raises CancelledError if cancelled. Difference with PEP 3148: This has no timeout argument and does not wait; if the future is not yet done, it raises an exception.
  • add_done_callback(fn). Add a callback to be run when the Future becomes done (or is cancelled). If the Future is already done (or cancelled), schedules the callback to using call_soon(). Difference with PEP 3148: The callback is never called immediately, and always in the context of the caller -- typically this is a thread. You can think of this as calling the callback through call_soon(). Note that in order to match PEP 3148, the callback (unlike all other callbacks defined in this PEP, and ignoring the convention from the section "Callback Style" below) is always called with a single argument, the Future object. (The motivation for strictly serializing callbacks scheduled with call_soon() applies here too.)
  • remove_done_callback(fn). Remove the argument from the list of callbacks. This method is not defined by PEP 3148. The argument must be equal (using ==) to the argument passed to add_done_callback(). Returns the number of times the callback was removed.
  • set_result(result). The Future must not be done (nor cancelled) already. This makes the Future done and schedules the callbacks. Difference with PEP 3148: This is a public API.
  • set_exception(exception). The Future must not be done (nor cancelled) already. This makes the Future done and schedules the callbacks. Difference with PEP 3148: This is a public API.

The internal method set_running_or_notify_cancel() is not supported; there is no way to set the running state. Likewise, the method running() is not supported.

The following exceptions are defined:

  • InvalidStateError. Raised whenever the Future is not in a state acceptable to the method being called (e.g. calling set_result() on a Future that is already done, or calling result() on a Future that is not yet done).
  • InvalidTimeoutError. Raised by result() and exception() when a nonzero timeout argument is given.
  • CancelledError. An alias for concurrent.futures.CancelledError. Raised when result() or exception() is called on a Future that is cancelled.
  • TimeoutError. An alias for concurrent.futures.TimeoutError. May be raised by run_until_complete().

A Future is associated with an event loop when it is created.

A asyncio.Future object is not acceptable to the wait() and as_completed() functions in the concurrent.futures package. However, there are similar APIs asyncio.wait() and asyncio.as_completed(), described below.

A asyncio.Future object is acceptable to a yield from expression when used in a coroutine. This is implemented through the __iter__() interface on the Future. See the section "Coroutines and the Scheduler" below.

When a Future is garbage-collected, if it has an associated exception but neither result() nor exception() has ever been called, the exception is logged. (When a coroutine uses yield from to wait for a Future, that Future's result() method is called once the coroutine is resumed.)

In the future (pun intended) we may unify asyncio.Future and concurrent.futures.Future, e.g. by adding an __iter__() method to the latter that works with yield from. To prevent accidentally blocking the event loop by calling e.g. result() on a Future that's not done yet, the blocking operation may detect that an event loop is active in the current thread and raise an exception instead. However the current PEP strives to have no dependencies beyond Python 3.3, so changes to concurrent.futures.Future are off the table for now.

There are some public functions related to Futures:

  • asyncio.async(arg). This takes an argument that is either a coroutine object or a Future (i.e., anything you can use with yield from) and returns a Future. If the argument is a Future, it is returned unchanged; if it is a coroutine object, it wraps it in a Task (remember that Task is a subclass of Future).
  • asyncio.wrap_future(future). This takes a PEP 3148 Future (i.e., an instance of concurrent.futures.Future) and returns a Future compatible with the event loop (i.e., a asyncio.Future instance).

Transports

Transports and protocols are strongly influenced by Twisted and PEP 3153. Users rarely implement or instantiate transports -- rather, event loops offer utility methods to set up transports.

Transports work in conjunction with protocols. Protocols are typically written without knowing or caring about the exact type of transport used, and transports can be used with a wide variety of protocols. For example, an HTTP client protocol implementation may be used with either a plain socket transport or an SSL/TLS transport. The plain socket transport can be used with many different protocols besides HTTP (e.g. SMTP, IMAP, POP, FTP, IRC, SPDY).

The most common type of transport is a bidirectional stream transport. There are also unidirectional stream transports (used for pipes) and datagram transports (used by the create_datagram_endpoint() method).

Methods For All Transports

  • get_extra_info(name, default=None). This is a catch-all method that returns implementation-specific information about a transport. The first argument is the name of the extra field to be retrieved. The optional second argument is a default value to be returned. Consult the implementation documentation to find out the supported extra field names. For an unsupported name, the default is always returned.

Bidirectional Stream Transports

A bidrectional stream transport is an abstraction on top of a socket or something similar (for example, a pair of UNIX pipes or an SSL/TLS connection).

Most connections have an asymmetric nature: the client and server usually have very different roles and behaviors. Hence, the interface between transport and protocol is also asymmetric. From the protocol's point of view, writing data is done by calling the write() method on the transport object; this buffers the data and returns immediately. However, the transport takes a more active role in reading data: whenever some data is read from the socket (or other data source), the transport calls the protocol's data_received() method.

Nevertheless, the interface between transport and protocol used by bidirectional streams is the same for clients as it is for servers, since the connection between a client and a server is essentially a pair of streams, one in each direction.

Bidirectional stream transports have the following public methods:

  • write(data). Write some bytes. The argument must be a bytes object. Returns None. The transport is free to buffer the bytes, but it must eventually cause the bytes to be transferred to the entity at the other end, and it must maintain stream behavior. That is, t.write(b'abc'); t.write(b'def') is equivalent to t.write(b'abcdef'), as well as to:

    t.write(b'a')
    t.write(b'b')
    t.write(b'c')
    t.write(b'd')
    t.write(b'e')
    t.write(b'f')
    
  • writelines(iterable). Equivalent to:

    for data in iterable:
        self.write(data)
    
  • write_eof(). Close the writing end of the connection. Subsequent calls to write() are not allowed. Once all buffered data is transferred, the transport signals to the other end that no more data will be received. Some protocols don't support this operation; in that case, calling write_eof() will raise an exception. (Note: This used to be called half_close(), but unless you already know what it is for, that name doesn't indicate which end is closed.)

  • can_write_eof(). Return True if the protocol supports write_eof(), False if it does not. (This method typically returns a fixed value that depends only on the specific Transport class, not on the state of the Transport object. It is needed because some protocols need to change their behavior when write_eof() is unavailable. For example, in HTTP, to send data whose size is not known ahead of time, the end of the data is typically indicated using write_eof(); however, SSL/TLS does not support this, and an HTTP protocol implementation would have to use the "chunked" transfer encoding in this case. But if the data size is known ahead of time, the best approach in both cases is to use the Content-Length header.)

  • get_write_buffer_size(). Return the current size of the transport's write buffer in bytes. This only knows about the write buffer managed explicitly by the transport; buffering in other layers of the network stack or elsewhere of the network is not reported.

  • set_write_buffer_limits(high=None, low=None). Set the high- and low-water limits for flow control.

    These two values control when to call the protocol's pause_writing() and resume_writing() methods. If specified, the low-water limit must be less than or equal to the high-water limit. Neither value can be negative.

    The defaults are implementation-specific. If only the high-water limit is given, the low-water limit defaults to a implementation-specific value less than or equal to the high-water limit. Setting high to zero forces low to zero as well, and causes pause_writing() to be called whenever the buffer becomes non-empty. Setting low to zero causes resume_writing() to be called only once the buffer is empty. Use of zero for either limit is generally sub-optimal as it reduces opportunities for doing I/O and computation concurrently.

  • pause_reading(). Suspend delivery of data to the protocol until a subsequent resume_reading() call. Between pause_reading() and resume_reading(), the protocol's data_received() method will not be called.

  • resume_reading(). Restart delivery of data to the protocol via data_received(). Note that "paused" is a binary state -- pause_reading() should only be called when the transport is not paused, while resume_reading() should only be called when the transport is paused.

  • close(). Sever the connection with the entity at the other end. Any data buffered by write() will (eventually) be transferred before the connection is actually closed. The protocol's data_received() method will not be called again. Once all buffered data has been flushed, the protocol's connection_lost() method will be called with None as the argument. Note that this method does not wait for all that to happen.

  • abort(). Immediately sever the connection. Any data still buffered by the transport is thrown away. Soon, the protocol's connection_lost() method will be called with None as argument.

Unidirectional Stream Transports

A writing stream transport supports the write(), writelines(), write_eof(), can_write_eof(), close() and abort() methods described for bidrectional stream transports.

A reading stream transport supports the pause_reading(), resume_reading() and close() methods described for bidrectional stream transports.

A writing stream transport calls only connection_made() and connection_lost() on its associated protocol.

A reading stream transport can call all protocol methods specified in the Protocols section below (i.e., the previous two plus data_received() and eof_received()).

Datagram Transports

Datagram transports have these methods:

  • sendto(data, addr=None). Sends a datagram (a bytes object). The optional second argument is the destination address. If omitted, remote_addr must have been specified in the create_datagram_endpoint() call that created this transport. If present, and remote_addr was specified, they must match. The (data, addr) pair may be sent immediately or buffered. The return value is None.
  • abort(). Immediately close the transport. Buffered data will be discarded.
  • close(). Close the transport. Buffered data will be transmitted asynchronously.

Datagram transports call the following methods on the associated protocol object: connection_made(), connection_lost(), error_received() and datagram_received(). ("Connection" in these method names is a slight misnomer, but the concepts still exist: connection_made() means the transport representing the endpoint has been created, and connection_lost() means the transport is closed.)

Subprocess Transports

Subprocess transports have the following methods:

  • get_pid(). Return the process ID of the subprocess.
  • get_returncode(). Return the process return code, if the process has exited; otherwise None.
  • get_pipe_transport(fd). Return the pipe transport (a unidirectional stream transport) corresponding to the argument, which should be 0, 1 or 2 representing stdin, stdout or stderr (of the subprocess). If there is no such pipe transport, return None. For stdin, this is a writing transport; for stdout and stderr this is a reading transport. You must use this method to get a transport you can use to write to the subprocess's stdin.
  • send_signal(signal). Send a signal to the subprocess.
  • terminate(). Terminate the subprocess.
  • kill(). Kill the subprocess. On Windows this is an alias for terminate().
  • close(). This is an alias for terminate().

Note that send_signal(), terminate() and kill() wrap the corresponding methods in the standard library subprocess module.

Protocols

Protocols are always used in conjunction with transports. While a few common protocols are provided (e.g. decent though not necessarily excellent HTTP client and server implementations), most protocols will be implemented by user code or third-party libraries.

Like for transports, we distinguish between stream protocols, datagram protocols, and perhaps other custom protocols. The most common type of protocol is a bidirectional stream protocol. (There are no unidirectional protocols.)

Stream Protocols

A (bidirectional) stream protocol must implement the following methods, which will be called by the transport. Think of these as callbacks that are always called by the event loop in the right context. (See the "Context" section way above.)

  • connection_made(transport). Indicates that the transport is ready and connected to the entity at the other end. The protocol should probably save the transport reference as an instance variable (so it can call its write() and other methods later), and may write an initial greeting or request at this point.

  • data_received(data). The transport has read some bytes from the connection. The argument is always a non-empty bytes object. There are no guarantees about the minimum or maximum size of the data passed along this way. p.data_received(b'abcdef') should be treated exactly equivalent to:

    p.data_received(b'abc')
    p.data_received(b'def')
    
  • eof_received(). This is called when the other end called write_eof() (or something equivalent). If this returns a false value (including None), the transport will close itself. If it returns a true value, closing the transport is up to the protocol. However, for SSL/TLS connections this is ignored, because the TLS standard requires that no more data is sent and the connection is closed as soon as a "closure alert" is received.

    The default implementation returns None.

  • pause_writing(). Asks that the protocol temporarily stop writing data to the transport. Heeding the request is optional, but the transport's buffer may grow without bounds if you keep writing. The buffer size at which this is called can be controlled through the transport's set_write_buffer_limits() method.

  • resume_writing(). Tells the protocol that it is safe to start writing data to the transport again. Note that this may be called directly by the transport's write() method (as opposed to being called indirectly using call_soon()), so that the protocol may be aware of its paused state immediately after write() returns.

  • connection_lost(exc). The transport has been closed or aborted, has detected that the other end has closed the connection cleanly, or has encountered an unexpected error. In the first three cases the argument is None; for an unexpected error, the argument is the exception that caused the transport to give up.

Here is a table indicating the order and multiplicity of the basic calls:

  1. connection_made() -- exactly once
  2. data_received() -- zero or more times
  3. eof_received() -- at most once
  4. connection_lost() -- exactly once

Calls to pause_writing() and resume_writing() occur in pairs and only between #1 and #4. These pairs will not be nested. The final resume_writing() call may be omitted; i.e. a paused connection may be lost and never be resumed.

Datagram Protocols

Datagram protocols have connection_made() and connection_lost() methods with the same signatures as stream protocols. (As explained in the section about datagram transports, we prefer the slightly odd nomenclature over defining different method names to indicating the opening and closing of the socket.)

In addition, they have the following methods:

  • datagram_received(data, addr). Indicates that a datagram data (a bytes objects) was received from remote address addr (an IPv4 2-tuple or an IPv6 4-tuple).
  • error_received(exc). Indicates that a send or receive operation raised an OSError exception. Since datagram errors may be transient, it is up to the protocol to call the transport's close() method if it wants to close the endpoint.

Here is a chart indicating the order and multiplicity of calls:

  1. connection_made() -- exactly once
  2. datagram_received(), error_received() -- zero or more times
  3. connection_lost() -- exactly once

Subprocess Protocol

Subprocess protocols have connection_made(), connection_lost(), pause_writing() and resume_writing() methods with the same signatures as stream protocols. In addition, they have the following methods:

  • pipe_data_received(fd, data). Called when the subprocess writes data to its stdout or stderr. fd is the file descriptor (1 for stdout, 2 for stderr). data is a bytes object. (TBD: No pipe_eof_received()?)
  • pipe_connection_lost(fd, exc). Called when the subprocess closes its stdin, stdout or stderr. fd is the file descriptor. exc is an exception or None.
  • process_exited(). Called when the subprocess has exited. To retrieve the exit status, use the transport's get_returncode() method.

Note that depending on the behavior of the subprocess it is possible that process_exited() is called either before or after pipe_connection_lost(). For example, if the subprocess creates a sub-subprocess that shares its stdin/stdout/stderr and then itself exits, process_exited() may be called while all the pipes are still open. On the other hand when the subprocess closes its stdin/stdout/stderr but does not exit, pipe_connection_lost() may be called for all three pipes without process_exited() being called. If (as is the more common case) the subprocess exits and thereby implicitly closes all pipes, the calling order is undefined.

Callback Style

Most interfaces taking a callback also take positional arguments. For instance, to arrange for foo("abc", 42) to be called soon, you call loop.call_soon(foo, "abc", 42). To schedule the call foo(), use loop.call_soon(foo). This convention greatly reduces the number of small lambdas required in typical callback programming.

This convention specifically does not support keyword arguments. Keyword arguments are used to pass optional extra information about the callback. This allows graceful evolution of the API without having to worry about whether a keyword might be significant to a callee somewhere. If you have a callback that must be called with a keyword argument, you can use a lambda. For example:

loop.call_soon(lambda: foo('abc', repeat=42))

Coroutines and the Scheduler

This is a separate toplevel section because its status is different from the event loop interface. Usage of coroutines is optional, and it is perfectly fine to write code using callbacks only. On the other hand, there is only one implementation of the scheduler/coroutine API, and if you're using coroutines, that's the one you're using.

Coroutines

A coroutine is a generator that follows certain conventions. For documentation purposes, all coroutines should be decorated with @asyncio.coroutine, but this cannot be strictly enforced.

Coroutines use the yield from syntax introduced in PEP 380, instead of the original yield syntax.

The word "coroutine", like the word "generator", is used for two different (though related) concepts:

  • The function that defines a coroutine (a function definition decorated with asyncio.coroutine). If disambiguation is needed we will call this a coroutine function.
  • The object obtained by calling a coroutine function. This object represents a computation or an I/O operation (usually a combination) that will complete eventually. If disambiguation is needed we will call it a coroutine object.

Things a coroutine can do:

  • result = yield from future -- suspends the coroutine until the future is done, then returns the future's result, or raises an exception, which will be propagated. (If the future is cancelled, it will raise a CancelledError exception.) Note that tasks are futures, and everything said about futures also applies to tasks.
  • result = yield from coroutine -- wait for another coroutine to produce a result (or raise an exception, which will be propagated). The coroutine expression must be a call to another coroutine.
  • return expression -- produce a result to the coroutine that is waiting for this one using yield from.
  • raise exception -- raise an exception in the coroutine that is waiting for this one using yield from.

Calling a coroutine does not start its code running -- it is just a generator, and the coroutine object returned by the call is really a generator object, which doesn't do anything until you iterate over it. In the case of a coroutine object, there are two basic ways to start it running: call yield from coroutine from another coroutine (assuming the other coroutine is already running!), or convert it to a Task (see below).

Coroutines (and tasks) can only run when the event loop is running.

Waiting for Multiple Coroutines

To wait for multiple coroutines or Futures, two APIs similar to the wait() and as_completed() APIs in the concurrent.futures package are provided:

  • asyncio.wait(fs, timeout=None, return_when=ALL_COMPLETED). This is a coroutine that waits for the Futures or coroutines given by fs to complete. Coroutine arguments will be wrapped in Tasks (see below). This returns a Future whose result on success is a tuple of two sets of Futures, (done, pending), where done is the set of original Futures (or wrapped coroutines) that are done (or cancelled), and pending is the rest, i.e. those that are still not done (nor cancelled). Note that with the defaults for timeout and return_when, done will always be an empty list. Optional arguments timeout and return_when have the same meaning and defaults as for concurrent.futures.wait(): timeout, if not None, specifies a timeout for the overall operation; return_when, specifies when to stop. The constants FIRST_COMPLETED, FIRST_EXCEPTION, ALL_COMPLETED are defined with the same values and the same meanings as in PEP 3148:

    • ALL_COMPLETED (default): Wait until all Futures are done (or until the timeout occurs).
    • FIRST_COMPLETED: Wait until at least one Future is done (or until the timeout occurs).
    • FIRST_EXCEPTION: Wait until at least one Future is done but not cancelled with an exception set. (The exclusion of cancelled Futures from the condition is surprising, but PEP 3148 does it this way.)
  • asyncio.as_completed(fs, timeout=None). Returns an iterator whose values are Futures or coroutines; waiting for successive values waits until the next Future or coroutine from the set fs completes, and returns its result (or raises its exception). The optional argument timeout has the same meaning and default as it does for concurrent.futures.wait(): when the timeout occurs, the next Future returned by the iterator will raise TimeoutError when waited for. Example of use:

    for f in as_completed(fs):
        result = yield from f  # May raise an exception.
        # Use result.
    

    Note: if you do not wait for the values produced by the iterator, your for loop may not make progress (since you are not allowing other tasks to run).

  • asyncio.wait_for(f, timeout). This is a convenience to wait for a single coroutine or Future with a timeout. When a timeout occurs, it cancels the task and raises TimeoutError. To avoid the task cancellation, wrap it in shield().

  • asyncio.gather(f1, f2, ...). Returns a Future which waits until all arguments (Futures or coroutines) are done and return a list of their corresponding results. If one or more of the arguments is cancelled or raises an exception, the returned Future is cancelled or has its exception set (matching what happened to the first argument), and the remaining arguments are left running in the background. Cancelling the returned Future does not affect the arguments. Note that coroutine arguments are converted to Futures using asyncio.async().

  • asyncio.shield(f). Wait for a Future, shielding it from cancellation. This returns a Future whose result or exception is exactly the same as the argument; however, if the returned Future is cancelled, the argument Future is unaffected.

    A use case for this function would be a coroutine that caches a query result for a coroutine that handles a request in an HTTP server. When the request is cancelled by the client, we could (arguably) want the query-caching coroutine to continue to run, so that when the client reconnects, the query result is (hopefully) cached. This could be written e.g. as follows:

    @asyncio.coroutine
    def handle_request(self, request):
        ...
        cached_query = self.get_cache(...)
        if cached_query is None:
            cached_query = yield from asyncio.shield(self.fill_cache(...))
        ...
    

Sleeping

The coroutine asyncio.sleep(delay) returns after a given time delay.

Tasks

A Task is an object that manages an independently running coroutine. The Task interface is the same as the Future interface, and in fact Task is a subclass of Future. The task becomes done when its coroutine returns or raises an exception; if it returns a result, that becomes the task's result, if it raises an exception, that becomes the task's exception.

Cancelling a task that's not done yet throws an asyncio.CancelledError exception into the coroutine. If the coroutine doesn't catch this (or if it re-raises it) the task will be marked as cancelled (i.e., cancelled() will return True); but if the coroutine somehow catches and ignores the exception it may continue to execute (and cancelled() will return False).

Tasks are also useful for interoperating between coroutines and callback-based frameworks like Twisted. After converting a coroutine into a Task, callbacks can be added to the Task.

To convert a coroutine into a task, call the coroutine function and pass the resulting coroutine object to the asyncio.Task() constructor. You may also use asyncio.async() for this purpose.

You may ask, why not automatically convert all coroutines to Tasks? The @asyncio.coroutine decorator could do this. However, this would slow things down considerably in the case where one coroutine calls another (and so on), as switching to a "bare" coroutine has much less overhead than switching to a Task.

The Scheduler

The scheduler has no public interface. You interact with it by using yield from future and yield from task. In fact, there is no single object representing the scheduler -- its behavior is implemented by the Task and Future classes using only the public interface of the event loop, so it will work with third-party event loop implementations, too.

Convenience Utilities

A few functions and classes are provided to simplify the writing of basic stream-based clients and servers, such as FTP or HTTP. Thes are:

  • asyncio.open_connection(host, port): A wrapper for EventLoop.create_connection() that does not require you to provide a Protocol factory or class. This is a coroutine that returns a (reader, writer) pair, where reader is an instance of StreamReader and writer is an instance of StreamWriter (both described below).

  • asyncio.start_server(client_connected_cb, host, port): A wrapper for EventLoop.create_server() that takes a simple callback function rather than a Protocol factory or class. This is a coroutine that returns a Server object just as create_server() does. Each time a client connection is accepted, client_connected_cb(reader, writer) is called, where reader is an instance of StreamReader and writer is an instance of StreamWriter (both described below). If the result returned by client_connected_cb() is a coroutine, it is automatically wrapped in a Task.

  • StreamReader: A class offering an interface not unlike that of a read-only binary stream, except that the various reading methods are coroutines. It is normally driven by a StreamReaderProtocol instance. Note that there should be only one reader. The interface for the reader is:

    • readline(): A coroutine that reads a string of bytes representing a line of text ending in '\n', or until the end of the stream, whichever comes first.
    • read(n): A coroutine that reads up to n bytes. If n is omitted or negative, it reads until the end of the stream.
    • readexactly(n): A coroutine that reads exactly n bytes, or until the end of the stream, whichever comes first.
    • exception(): Return the exception that has been set on the stream using set_exception(), or None if no exception is set.

    The interface for the driver is:

    • feed_data(data): Append data (a bytes object) to the internal buffer. This unblocks a blocked reading coroutine if it provides sufficient data to fulfill the reader's contract.
    • feed_eof(): Signal the end of the buffer. This unblocks a blocked reading coroutine. No more data should be fed to the reader after this call.
    • set_exception(exc): Set an exception on the stream. All subsequent reading methods will raise this exception. No more data should be fed to the reader after this call.
  • StreamWriter: A class offering an interface not unlike that of a write-only binary stream. It wraps a transport. The interface is an extended subset of the transport interface: the following methods behave the same as the corresponding transport methods: write(), writelines(), write_eof(), can_write_eof(), get_extra_info(), close(). Note that the writing methods are _not_ coroutines (this is the same as for transports, but different from the StreamReader class). The following method is in addition to the transport interface:

    • drain(): This should be called with yield from after writing significant data, for the purpose of flow control. The intended use is like this:

      writer.write(data)
      yield from writer.drain()
      

      Note that this is not technically a coroutine: it returns either a Future or an empty tuple (both can be passed to yield from). Use of this method is optional. However, when it is not used, the internal buffer of the transport underlying the StreamWriter may fill up with all data that was ever written to the writer. If an app does not have a strict limit on how much data it writes, it _should_ call yield from drain() occasionally to avoid filling up the transport buffer.

  • StreamReaderProtocol: A protocol implementation used as an adapter between the bidirectional stream transport/protocol interface and the StreamReader and StreamWriter classes. It acts as a driver for a specific StreamReader instance, calling its methods feed_data(), feed_eof(), and set_exception() in response to various protocol callbacks. It also controls the behavior of the drain() method of the StreamWriter instance.

Synchronization

Locks, events, conditions and semaphores modeled after those in the threading module are implemented and can be accessed by importing the asyncio.locks submodule. Queus modeled after those in the queue module are implemented and can be accessed by importing the asyncio.queues submodule.

In general these have a close correspondence to their threaded counterparts, however, blocking methods (e.g. acquire() on locks, put() and get() on queues) are coroutines, and timeout parameters are not provided (you can use asyncio.wait_for() to add a timeout to a blocking call, however).

The docstrings in the modules provide more complete documentation.

Locks

The following classes are provided by asyncio.locks. For all these except Event, the with statement may be used in combination with yield from to acquire the lock and ensure that the lock is released regardless of how the with block is left, as follows:

with (yield from my_lock):
    ...
  • Lock: a basic mutex, with methods acquire() (a coroutine), locked(), and release().
  • Event: an event variable, with methods wait() (a coroutine), set(), clear(), and is_set().
  • Condition: a condition variable, with methods acquire(), wait(), wait_for(predicate) (all three coroutines), locked(), release(), notify(), and notify_all().
  • Semaphore: a semaphore, with methods acquire() (a coroutine), locked(), and release(). The constructor argument is the initial value (default 1).
  • BoundedSemaphore: a bounded semaphore; this is similar to Semaphore but the initial value is also the maximum value.

Queues

The following classes and exceptions are provided by asyncio.queues.

  • Queue: a standard queue, with methods get(), put() (both coroutines), get_nowait(), put_nowait(), empty(), full(), qsize(), and maxsize().
  • PriorityQueue: a subclass of Queue that retrieves entries in priority order (lowest first).
  • LifoQueue: a subclass of Queue that retrieves the most recently added entries first.
  • JoinableQueue: a subclass of Queue with task_done() and join() methods (the latter a coroutine).
  • Empty, Full: exceptions raised when get_nowait() or put_nowait() is called on a queue that is empty or full, respectively.

Miscellaneous

Logging

All logging performed by the asyncio package uses a single logging.Logger object, asyncio.logger. To customize logging you can use the standard Logger API on this object. (Do not replace the object though.)

SIGCHLD handling on UNIX

Efficient implementation of the process_exited() method on subprocess protocols requires a SIGCHLD signal handler. However, signal handlers can only be set on the event loop associated with the main thread. In order to support spawning subprocesses from event loops running in other threads, a mechanism exists to allow sharing a SIGCHLD handler between multiple event loops. There are two additional functions, asyncio.get_child_watcher() and asyncio.set_child_watcher(), and corresponding methods on the event loop policy.

There are two child watcher implementation classes, FastChildWatcher and SafeChildWatcher. Both use SIGCHLD. The SafeChildWatcher class is used by default; it is inefficient when many subprocesses exist simultaneously. The FastChildWatcher class is efficient, but it may interfere with other code (either C code or Python code) that spawns subprocesses without using an asyncio event loop. If you are sure you are not using other code that spawns subprocesses, to use the fast implementation, run the following in your main thread:

watcher = asyncio.FastChildWatcher()
asyncio.set_child_watcher(watcher)

Wish List

(There is agreement that these features are desirable, but no implementation was available when Python 3.4 beta 1 was released, and the feature freeze for the rest of the Python 3.4 release cycle prohibits adding them in this late stage. However, they will hopefully be added in Python 3.5, and perhaps earlier in the PyPI distribution.)

  • Support a "start TLS" operation to upgrade a TCP socket to SSL/TLS.

Former wish list items that have since been implemented (but aren't specified by the PEP):

  • UNIX domain sockets.
  • A per-loop error handling callback.

Open Issues

(Note that these have been resolved de facto in favor of the status quo by the acceptance of the PEP. However, the PEP's provisional status allows revising these decisions for Python 3.5.)

  • Why do create_connection() and create_datagram_endpoint() have a proto argument but not create_server()? And why are the family, flag, proto arguments for getaddrinfo() sometimes zero and sometimes named constants (whose value is also zero)?
  • Do we need another inquiry method to tell whether the loop is in the process of stopping?
  • A fuller public API for Handle? What's the use case?
  • A debugging API? E.g. something that logs a lot of stuff, or logs unusual conditions (like queues filling up faster than they drain) or even callbacks taking too much time...
  • Do we need introspection APIs? E.g. asking for the read callback given a file descriptor. Or when the next scheduled call is. Or the list of file descriptors registered with callbacks. Right now these all require using internals.
  • Do we need more socket I/O methods, e.g. sock_sendto() and sock_recvfrom(), and perhaps others like pipe_read()? I guess users can write their own (it's not rocket science).
  • We may need APIs to control various timeouts. E.g. we may want to limit the time spent in DNS resolution, connecting, ssl/tls handshake, idle connection, close/shutdown, even per session. Possibly it's sufficient to add timeout keyword arguments to some methods, and other timeouts can probably be implemented by clever use of call_later() and Task.cancel(). But it's possible that some operations need default timeouts, and we may want to change the default for a specific operation globally (i.e., per event loop).

References

Acknowledgments

Apart from PEP 3153, influences include PEP 380 and Greg Ewing's tutorial for yield from, Twisted, Tornado, ZeroMQ, pyftpdlib, and wattle (Steve Dower's counter-proposal). My previous work on asynchronous support in the NDB library for Google App Engine provided an important starting point.

I am grateful for the numerous discussions on python-ideas from September through December 2012, and many more on python-tulip since then; a Skype session with Steve Dower and Dino Viehland; email exchanges with and a visit by Ben Darnell; an audience with Niels Provos (original author of libevent); and in-person meetings (as well as frequent email exchanges) with several Twisted developers, including Glyph, Brian Warner, David Reid, and Duncan McGreggor.

Contributors to the implementation include Eli Bendersky, Gustavo Carneiro (Gambit Research), Saúl Ibarra Corretgé, Geert Jansen, A. Jesse Jiryu Davis, Nikolay Kim, Charles-François Natali, Richard Oudkerk, Antoine Pitrou, Giampaolo Rodolá, Andrew Svetlov, and many others who submitted bugs and/or fixes.

I thank Antoine Pitrou for his feedback in his role of official PEP BDFL.

pep-3333 Python Web Server Gateway Interface v1.0.1

PEP:3333
Title:Python Web Server Gateway Interface v1.0.1
Version:$Revision$
Last-Modified:$Date$
Author:P.J. Eby <pje at telecommunity.com>
Discussions-To:Python Web-SIG <web-sig at python.org>
Status:Final
Type:Informational
Content-Type:text/x-rst
Created:26-Sep-2010
Post-History:26-Sep-2010, 04-Oct-2010
Replaces:333

Preface for Readers of PEP 333

This is an updated version of PEP 333, modified slightly to improve usability under Python 3, and to incorporate several long-standing de-facto amendments to the WSGI protocol. (Its code samples have also been ported to Python 3.)

While for procedural reasons [6], this must be a distinct PEP, no changes were made that invalidate previously-compliant servers or applications under Python 2.x. If your 2.x application or server is compliant to PEP 333, it is also compliant with this PEP.

Under Python 3, however, your app or server must also follow the rules outlined in the sections below titled, A Note On String Types, and Unicode Issues.

For detailed, line-by-line diffs between this document and PEP 333, you may view its SVN revision history [7], from revision 84854 forward.

Abstract

This document specifies a proposed standard interface between web servers and Python web applications or frameworks, to promote web application portability across a variety of web servers.

Original Rationale and Goals (from PEP 333)

Python currently boasts a wide variety of web application frameworks, such as Zope, Quixote, Webware, SkunkWeb, PSO, and Twisted Web -- to name just a few [1]. This wide variety of choices can be a problem for new Python users, because generally speaking, their choice of web framework will limit their choice of usable web servers, and vice versa.

By contrast, although Java has just as many web application frameworks available, Java's "servlet" API makes it possible for applications written with any Java web application framework to run in any web server that supports the servlet API.

The availability and widespread use of such an API in web servers for Python -- whether those servers are written in Python (e.g. Medusa), embed Python (e.g. mod_python), or invoke Python via a gateway protocol (e.g. CGI, FastCGI, etc.) -- would separate choice of framework from choice of web server, freeing users to choose a pairing that suits them, while freeing framework and server developers to focus on their preferred area of specialization.

This PEP, therefore, proposes a simple and universal interface between web servers and web applications or frameworks: the Python Web Server Gateway Interface (WSGI).

But the mere existence of a WSGI spec does nothing to address the existing state of servers and frameworks for Python web applications. Server and framework authors and maintainers must actually implement WSGI for there to be any effect.

However, since no existing servers or frameworks support WSGI, there is little immediate reward for an author who implements WSGI support. Thus, WSGI must be easy to implement, so that an author's initial investment in the interface can be reasonably low.

Thus, simplicity of implementation on both the server and framework sides of the interface is absolutely critical to the utility of the WSGI interface, and is therefore the principal criterion for any design decisions.

Note, however, that simplicity of implementation for a framework author is not the same thing as ease of use for a web application author. WSGI presents an absolutely "no frills" interface to the framework author, because bells and whistles like response objects and cookie handling would just get in the way of existing frameworks' handling of these issues. Again, the goal of WSGI is to facilitate easy interconnection of existing servers and applications or frameworks, not to create a new web framework.

Note also that this goal precludes WSGI from requiring anything that is not already available in deployed versions of Python. Therefore, new standard library modules are not proposed or required by this specification, and nothing in WSGI requires a Python version greater than 2.2.2. (It would be a good idea, however, for future versions of Python to include support for this interface in web servers provided by the standard library.)

In addition to ease of implementation for existing and future frameworks and servers, it should also be easy to create request preprocessors, response postprocessors, and other WSGI-based "middleware" components that look like an application to their containing server, while acting as a server for their contained applications.

If middleware can be both simple and robust, and WSGI is widely available in servers and frameworks, it allows for the possibility of an entirely new kind of Python web application framework: one consisting of loosely-coupled WSGI middleware components. Indeed, existing framework authors may even choose to refactor their frameworks' existing services to be provided in this way, becoming more like libraries used with WSGI, and less like monolithic frameworks. This would then allow application developers to choose "best-of-breed" components for specific functionality, rather than having to commit to all the pros and cons of a single framework.

Of course, as of this writing, that day is doubtless quite far off. In the meantime, it is a sufficient short-term goal for WSGI to enable the use of any framework with any server.

Finally, it should be mentioned that the current version of WSGI does not prescribe any particular mechanism for "deploying" an application for use with a web server or server gateway. At the present time, this is necessarily implementation-defined by the server or gateway. After a sufficient number of servers and frameworks have implemented WSGI to provide field experience with varying deployment requirements, it may make sense to create another PEP, describing a deployment standard for WSGI servers and application frameworks.

Specification Overview

The WSGI interface has two sides: the "server" or "gateway" side, and the "application" or "framework" side. The server side invokes a callable object that is provided by the application side. The specifics of how that object is provided are up to the server or gateway. It is assumed that some servers or gateways will require an application's deployer to write a short script to create an instance of the server or gateway, and supply it with the application object. Other servers and gateways may use configuration files or other mechanisms to specify where an application object should be imported from, or otherwise obtained.

In addition to "pure" servers/gateways and applications/frameworks, it is also possible to create "middleware" components that implement both sides of this specification. Such components act as an application to their containing server, and as a server to a contained application, and can be used to provide extended APIs, content transformation, navigation, and other useful functions.

Throughout this specification, we will use the term "a callable" to mean "a function, method, class, or an instance with a __call__ method". It is up to the server, gateway, or application implementing the callable to choose the appropriate implementation technique for their needs. Conversely, a server, gateway, or application that is invoking a callable must not have any dependency on what kind of callable was provided to it. Callables are only to be called, not introspected upon.

A Note On String Types

In general, HTTP deals with bytes, which means that this specification is mostly about handling bytes.

However, the content of those bytes often has some kind of textual interpretation, and in Python, strings are the most convenient way to handle text.

But in many Python versions and implementations, strings are Unicode, rather than bytes. This requires a careful balance between a usable API and correct translations between bytes and text in the context of HTTP... especially to support porting code between Python implementations with different str types.

WSGI therefore defines two kinds of "string":

  • "Native" strings (which are always implemented using the type named str) that are used for request/response headers and metadata
  • "Bytestrings" (which are implemented using the bytes type in Python 3, and str elsewhere), that are used for the bodies of requests and responses (e.g. POST/PUT input data and HTML page outputs).

Do not be confused however: even if Python's str type is actually Unicode "under the hood", the content of native strings must still be translatable to bytes via the Latin-1 encoding! (See the section on Unicode Issues later in this document for more details.)

In short: where you see the word "string" in this document, it refers to a "native" string, i.e., an object of type str, whether it is internally implemented as bytes or unicode. Where you see references to "bytestring", this should be read as "an object of type bytes under Python 3, or type str under Python 2".

And so, even though HTTP is in some sense "really just bytes", there are many API conveniences to be had by using whatever Python's default str type is.

The Application/Framework Side

The application object is simply a callable object that accepts two arguments. The term "object" should not be misconstrued as requiring an actual object instance: a function, method, class, or instance with a __call__ method are all acceptable for use as an application object. Application objects must be able to be invoked more than once, as virtually all servers/gateways (other than CGI) will make such repeated requests.

(Note: although we refer to it as an "application" object, this should not be construed to mean that application developers will use WSGI as a web programming API! It is assumed that application developers will continue to use existing, high-level framework services to develop their applications. WSGI is a tool for framework and server developers, and is not intended to directly support application developers.)

Here are two example application objects; one is a function, and the other is a class:

HELLO_WORLD = b"Hello world!\n"

def simple_app(environ, start_response):
    """Simplest possible application object"""
    status = '200 OK'
    response_headers = [('Content-type', 'text/plain')]
    start_response(status, response_headers)
    return [HELLO_WORLD]

class AppClass:
    """Produce the same output, but using a class

    (Note: 'AppClass' is the "application" here, so calling it
    returns an instance of 'AppClass', which is then the iterable
    return value of the "application callable" as required by
    the spec.

    If we wanted to use *instances* of 'AppClass' as application
    objects instead, we would have to implement a '__call__'
    method, which would be invoked to execute the application,
    and we would need to create an instance for use by the
    server or gateway.
    """

    def __init__(self, environ, start_response):
        self.environ = environ
        self.start = start_response

    def __iter__(self):
        status = '200 OK'
        response_headers = [('Content-type', 'text/plain')]
        self.start(status, response_headers)
        yield HELLO_WORLD

The Server/Gateway Side

The server or gateway invokes the application callable once for each request it receives from an HTTP client, that is directed at the application. To illustrate, here is a simple CGI gateway, implemented as a function taking an application object. Note that this simple example has limited error handling, because by default an uncaught exception will be dumped to sys.stderr and logged by the web server.

import os, sys

enc, esc = sys.getfilesystemencoding(), 'surrogateescape'

def unicode_to_wsgi(u):
    # Convert an environment variable to a WSGI "bytes-as-unicode" string
    return u.encode(enc, esc).decode('iso-8859-1')

def wsgi_to_bytes(s):
    return s.encode('iso-8859-1')

def run_with_cgi(application):
    environ = {k: unicode_to_wsgi(v) for k,v in os.environ.items()}
    environ['wsgi.input']        = sys.stdin.buffer
    environ['wsgi.errors']       = sys.stderr
    environ['wsgi.version']      = (1, 0)
    environ['wsgi.multithread']  = False
    environ['wsgi.multiprocess'] = True
    environ['wsgi.run_once']     = True

    if environ.get('HTTPS', 'off') in ('on', '1'):
        environ['wsgi.url_scheme'] = 'https'
    else:
        environ['wsgi.url_scheme'] = 'http'

    headers_set = []
    headers_sent = []

    def write(data):
        out = sys.stdout.buffer

        if not headers_set:
             raise AssertionError("write() before start_response()")

        elif not headers_sent:
             # Before the first output, send the stored headers
             status, response_headers = headers_sent[:] = headers_set
             out.write(wsgi_to_bytes('Status: %s\r\n' % status))
             for header in response_headers:
                 out.write(wsgi_to_bytes('%s: %s\r\n' % header))
             out.write(wsgi_to_bytes('\r\n'))

        out.write(data)
        out.flush()

    def start_response(status, response_headers, exc_info=None):
        if exc_info:
            try:
                if headers_sent:
                    # Re-raise original exception if headers sent
                    raise exc_info[1].with_traceback(exc_info[2])
            finally:
                exc_info = None     # avoid dangling circular ref
        elif headers_set:
            raise AssertionError("Headers already set!")

        headers_set[:] = [status, response_headers]

        # Note: error checking on the headers should happen here,
        # *after* the headers are set.  That way, if an error
        # occurs, start_response can only be re-called with
        # exc_info set.

        return write

    result = application(environ, start_response)
    try:
        for data in result:
            if data:    # don't send headers until body appears
                write(data)
        if not headers_sent:
            write('')   # send headers now if body was empty
    finally:
        if hasattr(result, 'close'):
            result.close()

Middleware: Components that Play Both Sides

Note that a single object may play the role of a server with respect to some application(s), while also acting as an application with respect to some server(s). Such "middleware" components can perform such functions as:

  • Routing a request to different application objects based on the target URL, after rewriting the environ accordingly.
  • Allowing multiple applications or frameworks to run side-by-side in the same process
  • Load balancing and remote processing, by forwarding requests and responses over a network
  • Perform content postprocessing, such as applying XSL stylesheets

The presence of middleware in general is transparent to both the "server/gateway" and the "application/framework" sides of the interface, and should require no special support. A user who desires to incorporate middleware into an application simply provides the middleware component to the server, as if it were an application, and configures the middleware component to invoke the application, as if the middleware component were a server. Of course, the "application" that the middleware wraps may in fact be another middleware component wrapping another application, and so on, creating what is referred to as a "middleware stack".

For the most part, middleware must conform to the restrictions and requirements of both the server and application sides of WSGI. In some cases, however, requirements for middleware are more stringent than for a "pure" server or application, and these points will be noted in the specification.

Here is a (tongue-in-cheek) example of a middleware component that converts text/plain responses to pig latin, using Joe Strout's piglatin.py. (Note: a "real" middleware component would probably use a more robust way of checking the content type, and should also check for a content encoding. Also, this simple example ignores the possibility that a word might be split across a block boundary.)

from piglatin import piglatin

class LatinIter:

    """Transform iterated output to piglatin, if it's okay to do so

    Note that the "okayness" can change until the application yields
    its first non-empty bytestring, so 'transform_ok' has to be a mutable
    truth value.
    """

    def __init__(self, result, transform_ok):
        if hasattr(result, 'close'):
            self.close = result.close
        self._next = iter(result).__next__
        self.transform_ok = transform_ok

    def __iter__(self):
        return self

    def __next__(self):
        if self.transform_ok:
            return piglatin(self._next())   # call must be byte-safe on Py3
        else:
            return self._next()

class Latinator:

    # by default, don't transform output
    transform = False

    def __init__(self, application):
        self.application = application

    def __call__(self, environ, start_response):

        transform_ok = []

        def start_latin(status, response_headers, exc_info=None):

            # Reset ok flag, in case this is a repeat call
            del transform_ok[:]

            for name, value in response_headers:
                if name.lower() == 'content-type' and value == 'text/plain':
                    transform_ok.append(True)
                    # Strip content-length if present, else it'll be wrong
                    response_headers = [(name, value)
                        for name, value in response_headers
                            if name.lower() != 'content-length'
                    ]
                    break

            write = start_response(status, response_headers, exc_info)

            if transform_ok:
                def write_latin(data):
                    write(piglatin(data))   # call must be byte-safe on Py3
                return write_latin
            else:
                return write

        return LatinIter(self.application(environ, start_latin), transform_ok)


# Run foo_app under a Latinator's control, using the example CGI gateway
from foo_app import foo_app
run_with_cgi(Latinator(foo_app))

Specification Details

The application object must accept two positional arguments. For the sake of illustration, we have named them environ and start_response, but they are not required to have these names. A server or gateway must invoke the application object using positional (not keyword) arguments. (E.g. by calling result = application(environ, start_response) as shown above.)

The environ parameter is a dictionary object, containing CGI-style environment variables. This object must be a builtin Python dictionary (not a subclass, UserDict or other dictionary emulation), and the application is allowed to modify the dictionary in any way it desires. The dictionary must also include certain WSGI-required variables (described in a later section), and may also include server-specific extension variables, named according to a convention that will be described below.

The start_response parameter is a callable accepting two required positional arguments, and one optional argument. For the sake of illustration, we have named these arguments status, response_headers, and exc_info, but they are not required to have these names, and the application must invoke the start_response callable using positional arguments (e.g. start_response(status, response_headers)).

The status parameter is a status string of the form "999 Message here", and response_headers is a list of (header_name, header_value) tuples describing the HTTP response header. The optional exc_info parameter is described below in the sections on The start_response() Callable and Error Handling. It is used only when the application has trapped an error and is attempting to display an error message to the browser.

The start_response callable must return a write(body_data) callable that takes one positional parameter: a bytestring to be written as part of the HTTP response body. (Note: the write() callable is provided only to support certain existing frameworks' imperative output APIs; it should not be used by new applications or frameworks if it can be avoided. See the Buffering and Streaming section for more details.)

When called by the server, the application object must return an iterable yielding zero or more bytestrings. This can be accomplished in a variety of ways, such as by returning a list of bytestrings, or by the application being a generator function that yields bytestrings, or by the application being a class whose instances are iterable. Regardless of how it is accomplished, the application object must always return an iterable yielding zero or more bytestrings.

The server or gateway must transmit the yielded bytestrings to the client in an unbuffered fashion, completing the transmission of each bytestring before requesting another one. (In other words, applications should perform their own buffering. See the Buffering and Streaming section below for more on how application output must be handled.)

The server or gateway should treat the yielded bytestrings as binary byte sequences: in particular, it should ensure that line endings are not altered. The application is responsible for ensuring that the bytestring(s) to be written are in a format suitable for the client. (The server or gateway may apply HTTP transfer encodings, or perform other transformations for the purpose of implementing HTTP features such as byte-range transmission. See Other HTTP Features, below, for more details.)

If a call to len(iterable) succeeds, the server must be able to rely on the result being accurate. That is, if the iterable returned by the application provides a working __len__() method, it must return an accurate result. (See the Handling the Content-Length Header section for information on how this would normally be used.)

If the iterable returned by the application has a close() method, the server or gateway must call that method upon completion of the current request, whether the request was completed normally, or terminated early due to an application error during iteration or an early disconnect of the browser. (The close() method requirement is to support resource release by the application. This protocol is intended to complement PEP 342's generator support, and other common iterables with close() methods.)

Applications returning a generator or other custom iterator should not assume the entire iterator will be consumed, as it may be closed early by the server.

(Note: the application must invoke the start_response() callable before the iterable yields its first body bytestring, so that the server can send the headers before any body content. However, this invocation may be performed by the iterable's first iteration, so servers must not assume that start_response() has been called before they begin iterating over the iterable.)

Finally, servers and gateways must not directly use any other attributes of the iterable returned by the application, unless it is an instance of a type specific to that server or gateway, such as a "file wrapper" returned by wsgi.file_wrapper (see Optional Platform-Specific File Handling). In the general case, only attributes specified here, or accessed via e.g. the PEP 234 iteration APIs are acceptable.

environ Variables

The environ dictionary is required to contain these CGI environment variables, as defined by the Common Gateway Interface specification [2]. The following variables must be present, unless their value would be an empty string, in which case they may be omitted, except as otherwise noted below.

REQUEST_METHOD
The HTTP request method, such as "GET" or "POST". This cannot ever be an empty string, and so is always required.
SCRIPT_NAME
The initial portion of the request URL's "path" that corresponds to the application object, so that the application knows its virtual "location". This may be an empty string, if the application corresponds to the "root" of the server.
PATH_INFO
The remainder of the request URL's "path", designating the virtual "location" of the request's target within the application. This may be an empty string, if the request URL targets the application root and does not have a trailing slash.
QUERY_STRING
The portion of the request URL that follows the "?", if any. May be empty or absent.
CONTENT_TYPE
The contents of any Content-Type fields in the HTTP request. May be empty or absent.
CONTENT_LENGTH
The contents of any Content-Length fields in the HTTP request. May be empty or absent.
SERVER_NAME, SERVER_PORT
When combined with SCRIPT_NAME and PATH_INFO, these two strings can be used to complete the URL. Note, however, that HTTP_HOST, if present, should be used in preference to SERVER_NAME for reconstructing the request URL. See the URL Reconstruction section below for more detail. SERVER_NAME and SERVER_PORT can never be empty strings, and so are always required.
SERVER_PROTOCOL
The version of the protocol the client used to send the request. Typically this will be something like "HTTP/1.0" or "HTTP/1.1" and may be used by the application to determine how to treat any HTTP request headers. (This variable should probably be called REQUEST_PROTOCOL, since it denotes the protocol used in the request, and is not necessarily the protocol that will be used in the server's response. However, for compatibility with CGI we have to keep the existing name.)
HTTP_ Variables
Variables corresponding to the client-supplied HTTP request headers (i.e., variables whose names begin with "HTTP_"). The presence or absence of these variables should correspond with the presence or absence of the appropriate HTTP header in the request.

A server or gateway should attempt to provide as many other CGI variables as are applicable. In addition, if SSL is in use, the server or gateway should also provide as many of the Apache SSL environment variables [5] as are applicable, such as HTTPS=on and SSL_PROTOCOL. Note, however, that an application that uses any CGI variables other than the ones listed above are necessarily non-portable to web servers that do not support the relevant extensions. (For example, web servers that do not publish files will not be able to provide a meaningful DOCUMENT_ROOT or PATH_TRANSLATED.)

A WSGI-compliant server or gateway should document what variables it provides, along with their definitions as appropriate. Applications should check for the presence of any variables they require, and have a fallback plan in the event such a variable is absent.

Note: missing variables (such as REMOTE_USER when no authentication has occurred) should be left out of the environ dictionary. Also note that CGI-defined variables must be native strings, if they are present at all. It is a violation of this specification for any CGI variable's value to be of any type other than str.

In addition to the CGI-defined variables, the environ dictionary may also contain arbitrary operating-system "environment variables", and must contain the following WSGI-defined variables:

Variable Value
wsgi.version The tuple (1, 0), representing WSGI version 1.0.
wsgi.url_scheme A string representing the "scheme" portion of the URL at which the application is being invoked. Normally, this will have the value "http" or "https", as appropriate.
wsgi.input An input stream (file-like object) from which the HTTP request body bytes can be read. (The server or gateway may perform reads on-demand as requested by the application, or it may pre- read the client's request body and buffer it in-memory or on disk, or use any other technique for providing such an input stream, according to its preference.)
wsgi.errors

An output stream (file-like object) to which error output can be written, for the purpose of recording program or other errors in a standardized and possibly centralized location. This should be a "text mode" stream; i.e., applications should use "\n" as a line ending, and assume that it will be converted to the correct line ending by the server/gateway.

(On platforms where the str type is unicode, the error stream should accept and log arbitary unicode without raising an error; it is allowed, however, to substitute characters that cannot be rendered in the stream's encoding.)

For many servers, wsgi.errors will be the server's main error log. Alternatively, this may be sys.stderr, or a log file of some sort. The server's documentation should include an explanation of how to configure this or where to find the recorded output. A server or gateway may supply different error streams to different applications, if this is desired.

wsgi.multithread This value should evaluate true if the application object may be simultaneously invoked by another thread in the same process, and should evaluate false otherwise.
wsgi.multiprocess This value should evaluate true if an equivalent application object may be simultaneously invoked by another process, and should evaluate false otherwise.
wsgi.run_once This value should evaluate true if the server or gateway expects (but does not guarantee!) that the application will only be invoked this one time during the life of its containing process. Normally, this will only be true for a gateway based on CGI (or something similar).

Finally, the environ dictionary may also contain server-defined variables. These variables should be named using only lower-case letters, numbers, dots, and underscores, and should be prefixed with a name that is unique to the defining server or gateway. For example, mod_python might define variables with names like mod_python.some_variable.

Input and Error Streams

The input and error streams provided by the server must support the following methods:

Method Stream Notes
read(size) input 1
readline() input 1, 2
readlines(hint) input 1, 3
__iter__() input  
flush() errors 4
write(str) errors  
writelines(seq) errors  

The semantics of each method are as documented in the Python Library Reference, except for these notes as listed in the table above:

  1. The server is not required to read past the client's specified Content-Length, and should simulate an end-of-file condition if the application attempts to read past that point. The application should not attempt to read more data than is specified by the CONTENT_LENGTH variable.

    A server should allow read() to be called without an argument, and return the remainder of the client's input stream.

    A server should return empty bytestrings from any attempt to read from an empty or exhausted input stream.

  2. Servers should support the optional "size" argument to readline(), but as in WSGI 1.0, they are allowed to omit support for it.

    (In WSGI 1.0, the size argument was not supported, on the grounds that it might have been complex to implement, and was not often used in practice... but then the cgi module started using it, and so practical servers had to start supporting it anyway!)

  3. Note that the hint argument to readlines() is optional for both caller and implementer. The application is free not to supply it, and the server or gateway is free to ignore it.

  4. Since the errors stream may not be rewound, servers and gateways are free to forward write operations immediately, without buffering. In this case, the flush() method may be a no-op. Portable applications, however, cannot assume that output is unbuffered or that flush() is a no-op. They must call flush() if they need to ensure that output has in fact been written. (For example, to minimize intermingling of data from multiple processes writing to the same error log.)

The methods listed in the table above must be supported by all servers conforming to this specification. Applications conforming to this specification must not use any other methods or attributes of the input or errors objects. In particular, applications must not attempt to close these streams, even if they possess close() methods.

The start_response() Callable

The second parameter passed to the application object is a callable of the form start_response(status, response_headers, exc_info=None). (As with all WSGI callables, the arguments must be supplied positionally, not by keyword.) The start_response callable is used to begin the HTTP response, and it must return a write(body_data) callable (see the Buffering and Streaming section, below).

The status argument is an HTTP "status" string like "200 OK" or "404 Not Found". That is, it is a string consisting of a Status-Code and a Reason-Phrase, in that order and separated by a single space, with no surrounding whitespace or other characters. (See RFC 2616, Section 6.1.1 for more information.) The string must not contain control characters, and must not be terminated with a carriage return, linefeed, or combination thereof.

The response_headers argument is a list of (header_name, header_value) tuples. It must be a Python list; i.e. type(response_headers) is ListType, and the server may change its contents in any way it desires. Each header_name must be a valid HTTP header field-name (as defined by RFC 2616, Section 4.2), without a trailing colon or other punctuation.

Each header_value must not include any control characters, including carriage returns or linefeeds, either embedded or at the end. (These requirements are to minimize the complexity of any parsing that must be performed by servers, gateways, and intermediate response processors that need to inspect or modify response headers.)

In general, the server or gateway is responsible for ensuring that correct headers are sent to the client: if the application omits a header required by HTTP (or other relevant specifications that are in effect), the server or gateway must add it. For example, the HTTP Date: and Server: headers would normally be supplied by the server or gateway.

(A reminder for server/gateway authors: HTTP header names are case-insensitive, so be sure to take that into consideration when examining application-supplied headers!)

Applications and middleware are forbidden from using HTTP/1.1 "hop-by-hop" features or headers, any equivalent features in HTTP/1.0, or any headers that would affect the persistence of the client's connection to the web server. These features are the exclusive province of the actual web server, and a server or gateway should consider it a fatal error for an application to attempt sending them, and raise an error if they are supplied to start_response(). (For more specifics on "hop-by-hop" features and headers, please see the Other HTTP Features section below.)

Servers should check for errors in the headers at the time start_response is called, so that an error can be raised while the application is still running.

However, the start_response callable must not actually transmit the response headers. Instead, it must store them for the server or gateway to transmit only after the first iteration of the application return value that yields a non-empty bytestring, or upon the application's first invocation of the write() callable. In other words, response headers must not be sent until there is actual body data available, or until the application's returned iterable is exhausted. (The only possible exception to this rule is if the response headers explicitly include a Content-Length of zero.)

This delaying of response header transmission is to ensure that buffered and asynchronous applications can replace their originally intended output with error output, up until the last possible moment. For example, the application may need to change the response status from "200 OK" to "500 Internal Error", if an error occurs while the body is being generated within an application buffer.

The exc_info argument, if supplied, must be a Python sys.exc_info() tuple. This argument should be supplied by the application only if start_response is being called by an error handler. If exc_info is supplied, and no HTTP headers have been output yet, start_response should replace the currently-stored HTTP response headers with the newly-supplied ones, thus allowing the application to "change its mind" about the output when an error has occurred.

However, if exc_info is provided, and the HTTP headers have already been sent, start_response must raise an error, and should re-raise using the exc_info tuple. That is:

raise exc_info[1].with_traceback(exc_info[2])

This will re-raise the exception trapped by the application, and in principle should abort the application. (It is not safe for the application to attempt error output to the browser once the HTTP headers have already been sent.) The application must not trap any exceptions raised by start_response, if it called start_response with exc_info. Instead, it should allow such exceptions to propagate back to the server or gateway. See Error Handling below, for more details.

The application may call start_response more than once, if and only if the exc_info argument is provided. More precisely, it is a fatal error to call start_response without the exc_info argument if start_response has already been called within the current invocation of the application. This includes the case where the first call to start_response raised an error. (See the example CGI gateway above for an illustration of the correct logic.)

Note: servers, gateways, or middleware implementing start_response should ensure that no reference is held to the exc_info parameter beyond the duration of the function's execution, to avoid creating a circular reference through the traceback and frames involved. The simplest way to do this is something like:

def start_response(status, response_headers, exc_info=None):
    if exc_info:
         try:
             # do stuff w/exc_info here
         finally:
             exc_info = None    # Avoid circular ref.

The example CGI gateway provides another illustration of this technique.

Handling the Content-Length Header

If the application supplies a Content-Length header, the server should not transmit more bytes to the client than the header allows, and should stop iterating over the response when enough data has been sent, or raise an error if the application tries to write() past that point. (Of course, if the application does not provide enough data to meet its stated Content-Length, the server should close the connection and log or otherwise report the error.)

If the application does not supply a Content-Length header, a server or gateway may choose one of several approaches to handling it. The simplest of these is to close the client connection when the response is completed.

Under some circumstances, however, the server or gateway may be able to either generate a Content-Length header, or at least avoid the need to close the client connection. If the application does not call the write() callable, and returns an iterable whose len() is 1, then the server can automatically determine Content-Length by taking the length of the first bytestring yielded by the iterable.

And, if the server and client both support HTTP/1.1 "chunked encoding" [3], then the server may use chunked encoding to send a chunk for each write() call or bytestring yielded by the iterable, thus generating a Content-Length header for each chunk. This allows the server to keep the client connection alive, if it wishes to do so. Note that the server must comply fully with RFC 2616 when doing this, or else fall back to one of the other strategies for dealing with the absence of Content-Length.

(Note: applications and middleware must not apply any kind of Transfer-Encoding to their output, such as chunking or gzipping; as "hop-by-hop" operations, these encodings are the province of the actual web server/gateway. See Other HTTP Features below, for more details.)

Buffering and Streaming

Generally speaking, applications will achieve the best throughput by buffering their (modestly-sized) output and sending it all at once. This is a common approach in existing frameworks such as Zope: the output is buffered in a StringIO or similar object, then transmitted all at once, along with the response headers.

The corresponding approach in WSGI is for the application to simply return a single-element iterable (such as a list) containing the response body as a single bytestring. This is the recommended approach for the vast majority of application functions, that render HTML pages whose text easily fits in memory.

For large files, however, or for specialized uses of HTTP streaming (such as multipart "server push"), an application may need to provide output in smaller blocks (e.g. to avoid loading a large file into memory). It's also sometimes the case that part of a response may be time-consuming to produce, but it would be useful to send ahead the portion of the response that precedes it.

In these cases, applications will usually return an iterator (often a generator-iterator) that produces the output in a block-by-block fashion. These blocks may be broken to coincide with mulitpart boundaries (for "server push"), or just before time-consuming tasks (such as reading another block of an on-disk file).

WSGI servers, gateways, and middleware must not delay the transmission of any block; they must either fully transmit the block to the client, or guarantee that they will continue transmission even while the application is producing its next block. A server/gateway or middleware may provide this guarantee in one of three ways:

  1. Send the entire block to the operating system (and request that any O/S buffers be flushed) before returning control to the application, OR
  2. Use a different thread to ensure that the block continues to be transmitted while the application produces the next block.
  3. (Middleware only) send the entire block to its parent gateway/server

By providing this guarantee, WSGI allows applications to ensure that transmission will not become stalled at an arbitrary point in their output data. This is critical for proper functioning of e.g. multipart "server push" streaming, where data between multipart boundaries should be transmitted in full to the client.

Middleware Handling of Block Boundaries

In order to better support asynchronous applications and servers, middleware components must not block iteration waiting for multiple values from an application iterable. If the middleware needs to accumulate more data from the application before it can produce any output, it must yield an empty bytestring.

To put this requirement another way, a middleware component must yield at least one value each time its underlying application yields a value. If the middleware cannot yield any other value, it must yield an empty bytestring.

This requirement ensures that asynchronous applications and servers can conspire to reduce the number of threads that are required to run a given number of application instances simultaneously.

Note also that this requirement means that middleware must return an iterable as soon as its underlying application returns an iterable. It is also forbidden for middleware to use the write() callable to transmit data that is yielded by an underlying application. Middleware may only use their parent server's write() callable to transmit data that the underlying application sent using a middleware-provided write() callable.

The write() Callable

Some existing application framework APIs support unbuffered output in a different manner than WSGI. Specifically, they provide a "write" function or method of some kind to write an unbuffered block of data, or else they provide a buffered "write" function and a "flush" mechanism to flush the buffer.

Unfortunately, such APIs cannot be implemented in terms of WSGI's "iterable" application return value, unless threads or other special mechanisms are used.

Therefore, to allow these frameworks to continue using an imperative API, WSGI includes a special write() callable, returned by the start_response callable.

New WSGI applications and frameworks should not use the write() callable if it is possible to avoid doing so. The write() callable is strictly a hack to support imperative streaming APIs. In general, applications should produce their output via their returned iterable, as this makes it possible for web servers to interleave other tasks in the same Python thread, potentially providing better throughput for the server as a whole.

The write() callable is returned by the start_response() callable, and it accepts a single parameter: a bytestring to be written as part of the HTTP response body, that is treated exactly as though it had been yielded by the output iterable. In other words, before write() returns, it must guarantee that the passed-in bytestring was either completely sent to the client, or that it is buffered for transmission while the application proceeds onward.

An application must return an iterable object, even if it uses write() to produce all or part of its response body. The returned iterable may be empty (i.e. yield no non-empty bytestrings), but if it does yield non-empty bytestrings, that output must be treated normally by the server or gateway (i.e., it must be sent or queued immediately). Applications must not invoke write() from within their return iterable, and therefore any bytestrings yielded by the iterable are transmitted after all bytestrings passed to write() have been sent to the client.

Unicode Issues

HTTP does not directly support Unicode, and neither does this interface. All encoding/decoding must be handled by the application; all strings passed to or from the server must be of type str or bytes, never unicode. The result of using a unicode object where a string object is required, is undefined.

Note also that strings passed to start_response() as a status or as response headers must follow RFC 2616 with respect to encoding. That is, they must either be ISO-8859-1 characters, or use RFC 2047 MIME encoding.

On Python platforms where the str or StringType type is in fact Unicode-based (e.g. Jython, IronPython, Python 3, etc.), all "strings" referred to in this specification must contain only code points representable in ISO-8859-1 encoding (\u0000 through \u00FF, inclusive). It is a fatal error for an application to supply strings containing any other Unicode character or code point. Similarly, servers and gateways must not supply strings to an application containing any other Unicode characters.

Again, all objects referred to in this specification as "strings" must be of type str or StringType, and must not be of type unicode or UnicodeType. And, even if a given platform allows for more than 8 bits per character in str/StringType objects, only the lower 8 bits may be used, for any value referred to in this specification as a "string".

For values referred to in this specification as "bytestrings" (i.e., values read from wsgi.input, passed to write() or yielded by the application), the value must be of type bytes under Python 3, and str in earlier versions of Python.

Error Handling

In general, applications should try to trap their own, internal errors, and display a helpful message in the browser. (It is up to the application to decide what "helpful" means in this context.)

However, to display such a message, the application must not have actually sent any data to the browser yet, or else it risks corrupting the response. WSGI therefore provides a mechanism to either allow the application to send its error message, or be automatically aborted: the exc_info argument to start_response. Here is an example of its use:

try:
    # regular application code here
    status = "200 Froody"
    response_headers = [("content-type", "text/plain")]
    start_response(status, response_headers)
    return ["normal body goes here"]
except:
    # XXX should trap runtime issues like MemoryError, KeyboardInterrupt
    #     in a separate handler before this bare 'except:'...
    status = "500 Oops"
    response_headers = [("content-type", "text/plain")]
    start_response(status, response_headers, sys.exc_info())
    return ["error body goes here"]

If no output has been written when an exception occurs, the call to start_response will return normally, and the application will return an error body to be sent to the browser. However, if any output has already been sent to the browser, start_response will reraise the provided exception. This exception should not be trapped by the application, and so the application will abort. The server or gateway can then trap this (fatal) exception and abort the response.

Servers should trap and log any exception that aborts an application or the iteration of its return value. If a partial response has already been written to the browser when an application error occurs, the server or gateway may attempt to add an error message to the output, if the already-sent headers indicate a text/* content type that the server knows how to modify cleanly.

Some middleware may wish to provide additional exception handling services, or intercept and replace application error messages. In such cases, middleware may choose to not re-raise the exc_info supplied to start_response, but instead raise a middleware-specific exception, or simply return without an exception after storing the supplied arguments. This will then cause the application to return its error body iterable (or invoke write()), allowing the middleware to capture and modify the error output. These techniques will work as long as application authors:

  1. Always provide exc_info when beginning an error response
  2. Never trap errors raised by start_response when exc_info is being provided

HTTP 1.1 Expect/Continue

Servers and gateways that implement HTTP 1.1 must provide transparent support for HTTP 1.1's "expect/continue" mechanism. This may be done in any of several ways:

  1. Respond to requests containing an Expect: 100-continue request with an immediate "100 Continue" response, and proceed normally.
  2. Proceed with the request normally, but provide the application with a wsgi.input stream that will send the "100 Continue" response if/when the application first attempts to read from the input stream. The read request must then remain blocked until the client responds.
  3. Wait until the client decides that the server does not support expect/continue, and sends the request body on its own. (This is suboptimal, and is not recommended.)

Note that these behavior restrictions do not apply for HTTP 1.0 requests, or for requests that are not directed to an application object. For more information on HTTP 1.1 Expect/Continue, see RFC 2616, sections 8.2.3 and 10.1.1.

Other HTTP Features

In general, servers and gateways should "play dumb" and allow the application complete control over its output. They should only make changes that do not alter the effective semantics of the application's response. It is always possible for the application developer to add middleware components to supply additional features, so server/gateway developers should be conservative in their implementation. In a sense, a server should consider itself to be like an HTTP "gateway server", with the application being an HTTP "origin server". (See RFC 2616, section 1.3, for the definition of these terms.)

However, because WSGI servers and applications do not communicate via HTTP, what RFC 2616 calls "hop-by-hop" headers do not apply to WSGI internal communications. WSGI applications must not generate any "hop-by-hop" headers [4], attempt to use HTTP features that would require them to generate such headers, or rely on the content of any incoming "hop-by-hop" headers in the environ dictionary. WSGI servers must handle any supported inbound "hop-by-hop" headers on their own, such as by decoding any inbound Transfer-Encoding, including chunked encoding if applicable.

Applying these principles to a variety of HTTP features, it should be clear that a server may handle cache validation via the If-None-Match and If-Modified-Since request headers and the Last-Modified and ETag response headers. However, it is not required to do this, and the application should perform its own cache validation if it wants to support that feature, since the server/gateway is not required to do such validation.

Similarly, a server may re-encode or transport-encode an application's response, but the application should use a suitable content encoding on its own, and must not apply a transport encoding. A server may transmit byte ranges of the application's response if requested by the client, and the application doesn't natively support byte ranges. Again, however, the application should perform this function on its own if desired.

Note that these restrictions on applications do not necessarily mean that every application must reimplement every HTTP feature; many HTTP features can be partially or fully implemented by middleware components, thus freeing both server and application authors from implementing the same features over and over again.

Thread Support

Thread support, or lack thereof, is also server-dependent. Servers that can run multiple requests in parallel, should also provide the option of running an application in a single-threaded fashion, so that applications or frameworks that are not thread-safe may still be used with that server.

Implementation/Application Notes

Server Extension APIs

Some server authors may wish to expose more advanced APIs, that application or framework authors can use for specialized purposes. For example, a gateway based on mod_python might wish to expose part of the Apache API as a WSGI extension.

In the simplest case, this requires nothing more than defining an environ variable, such as mod_python.some_api. But, in many cases, the possible presence of middleware can make this difficult. For example, an API that offers access to the same HTTP headers that are found in environ variables, might return different data if environ has been modified by middleware.

In general, any extension API that duplicates, supplants, or bypasses some portion of WSGI functionality runs the risk of being incompatible with middleware components. Server/gateway developers should not assume that nobody will use middleware, because some framework developers specifically intend to organize or reorganize their frameworks to function almost entirely as middleware of various kinds.

So, to provide maximum compatibility, servers and gateways that provide extension APIs that replace some WSGI functionality, must design those APIs so that they are invoked using the portion of the API that they replace. For example, an extension API to access HTTP request headers must require the application to pass in its current environ, so that the server/gateway may verify that HTTP headers accessible via the API have not been altered by middleware. If the extension API cannot guarantee that it will always agree with environ about the contents of HTTP headers, it must refuse service to the application, e.g. by raising an error, returning None instead of a header collection, or whatever is appropriate to the API.

Similarly, if an extension API provides an alternate means of writing response data or headers, it should require the start_response callable to be passed in, before the application can obtain the extended service. If the object passed in is not the same one that the server/gateway originally supplied to the application, it cannot guarantee correct operation and must refuse to provide the extended service to the application.

These guidelines also apply to middleware that adds information such as parsed cookies, form variables, sessions, and the like to environ. Specifically, such middleware should provide these features as functions which operate on environ, rather than simply stuffing values into environ. This helps ensure that information is calculated from environ after any middleware has done any URL rewrites or other environ modifications.

It is very important that these "safe extension" rules be followed by both server/gateway and middleware developers, in order to avoid a future in which middleware developers are forced to delete any and all extension APIs from environ to ensure that their mediation isn't being bypassed by applications using those extensions!

Application Configuration

This specification does not define how a server selects or obtains an application to invoke. These and other configuration options are highly server-specific matters. It is expected that server/gateway authors will document how to configure the server to execute a particular application object, and with what options (such as threading options).

Framework authors, on the other hand, should document how to create an application object that wraps their framework's functionality. The user, who has chosen both the server and the application framework, must connect the two together. However, since both the framework and the server now have a common interface, this should be merely a mechanical matter, rather than a significant engineering effort for each new server/framework pair.

Finally, some applications, frameworks, and middleware may wish to use the environ dictionary to receive simple string configuration options. Servers and gateways should support this by allowing an application's deployer to specify name-value pairs to be placed in environ. In the simplest case, this support can consist merely of copying all operating system-supplied environment variables from os.environ into the environ dictionary, since the deployer in principle can configure these externally to the server, or in the CGI case they may be able to be set via the server's configuration files.

Applications should try to keep such required variables to a minimum, since not all servers will support easy configuration of them. Of course, even in the worst case, persons deploying an application can create a script to supply the necessary configuration values:

from the_app import application

def new_app(environ, start_response):
    environ['the_app.configval1'] = 'something'
    return application(environ, start_response)

But, most existing applications and frameworks will probably only need a single configuration value from environ, to indicate the location of their application or framework-specific configuration file(s). (Of course, applications should cache such configuration, to avoid having to re-read it upon each invocation.)

URL Reconstruction

If an application wishes to reconstruct a request's complete URL, it may do so using the following algorithm, contributed by Ian Bicking:

from urllib import quote
url = environ['wsgi.url_scheme']+'://'

if environ.get('HTTP_HOST'):
    url += environ['HTTP_HOST']
else:
    url += environ['SERVER_NAME']

    if environ['wsgi.url_scheme'] == 'https':
        if environ['SERVER_PORT'] != '443':
           url += ':' + environ['SERVER_PORT']
    else:
        if environ['SERVER_PORT'] != '80':
           url += ':' + environ['SERVER_PORT']

url += quote(environ.get('SCRIPT_NAME', ''))
url += quote(environ.get('PATH_INFO', ''))
if environ.get('QUERY_STRING'):
    url += '?' + environ['QUERY_STRING']

Note that such a reconstructed URL may not be precisely the same URI as requested by the client. Server rewrite rules, for example, may have modified the client's originally requested URL to place it in a canonical form.

Supporting Older (<2.2) Versions of Python

Some servers, gateways, or applications may wish to support older (<2.2) versions of Python. This is especially important if Jython is a target platform, since as of this writing a production-ready version of Jython 2.2 is not yet available.

For servers and gateways, this is relatively straightforward: servers and gateways targeting pre-2.2 versions of Python must simply restrict themselves to using only a standard "for" loop to iterate over any iterable returned by an application. This is the only way to ensure source-level compatibility with both the pre-2.2 iterator protocol (discussed further below) and "today's" iterator protocol (see PEP 234).

(Note that this technique necessarily applies only to servers, gateways, or middleware that are written in Python. Discussion of how to use iterator protocol(s) correctly from other languages is outside the scope of this PEP.)

For applications, supporting pre-2.2 versions of Python is slightly more complex:

  • You may not return a file object and expect it to work as an iterable, since before Python 2.2, files were not iterable. (In general, you shouldn't do this anyway, because it will perform quite poorly most of the time!) Use wsgi.file_wrapper or an application-specific file wrapper class. (See Optional Platform-Specific File Handling for more on wsgi.file_wrapper, and an example class you can use to wrap a file as an iterable.)
  • If you return a custom iterable, it must implement the pre-2.2 iterator protocol. That is, provide a __getitem__ method that accepts an integer key, and raises IndexError when exhausted. (Note that built-in sequence types are also acceptable, since they also implement this protocol.)

Finally, middleware that wishes to support pre-2.2 versions of Python, and iterates over application return values or itself returns an iterable (or both), must follow the appropriate recommendations above.

(Note: It should go without saying that to support pre-2.2 versions of Python, any server, gateway, application, or middleware must also use only language features available in the target version, use 1 and 0 instead of True and False, etc.)

Optional Platform-Specific File Handling

Some operating environments provide special high-performance file- transmission facilities, such as the Unix sendfile() call. Servers and gateways may expose this functionality via an optional wsgi.file_wrapper key in the environ. An application may use this "file wrapper" to convert a file or file-like object into an iterable that it then returns, e.g.:

if 'wsgi.file_wrapper' in environ:
    return environ['wsgi.file_wrapper'](filelike, block_size)
else:
    return iter(lambda: filelike.read(block_size), '')

If the server or gateway supplies wsgi.file_wrapper, it must be a callable that accepts one required positional parameter, and one optional positional parameter. The first parameter is the file-like object to be sent, and the second parameter is an optional block size "suggestion" (which the server/gateway need not use). The callable must return an iterable object, and must not perform any data transmission until and unless the server/gateway actually receives the iterable as a return value from the application. (To do otherwise would prevent middleware from being able to interpret or override the response data.)

To be considered "file-like", the object supplied by the application must have a read() method that takes an optional size argument. It may have a close() method, and if so, the iterable returned by wsgi.file_wrapper must have a close() method that invokes the original file-like object's close() method. If the "file-like" object has any other methods or attributes with names matching those of Python built-in file objects (e.g. fileno()), the wsgi.file_wrapper may assume that these methods or attributes have the same semantics as those of a built-in file object.

The actual implementation of any platform-specific file handling must occur after the application returns, and the server or gateway checks to see if a wrapper object was returned. (Again, because of the presence of middleware, error handlers, and the like, it is not guaranteed that any wrapper created will actually be used.)

Apart from the handling of close(), the semantics of returning a file wrapper from the application should be the same as if the application had returned iter(filelike.read, ''). In other words, transmission should begin at the current position within the "file" at the time that transmission begins, and continue until the end is reached, or until Content-Length bytes have been written. (If the application doesn't supply a Content-Length, the server may generate one from the file using its knowledge of the underlying file implementation.)

Of course, platform-specific file transmission APIs don't usually accept arbitrary "file-like" objects. Therefore, a wsgi.file_wrapper has to introspect the supplied object for things such as a fileno() (Unix-like OSes) or a java.nio.FileChannel (under Jython) in order to determine if the file-like object is suitable for use with the platform-specific API it supports.

Note that even if the object is not suitable for the platform API, the wsgi.file_wrapper must still return an iterable that wraps read() and close(), so that applications using file wrappers are portable across platforms. Here's a simple platform-agnostic file wrapper class, suitable for old (pre 2.2) and new Pythons alike:

class FileWrapper:

    def __init__(self, filelike, blksize=8192):
        self.filelike = filelike
        self.blksize = blksize
        if hasattr(filelike, 'close'):
            self.close = filelike.close

    def __getitem__(self, key):
        data = self.filelike.read(self.blksize)
        if data:
            return data
        raise IndexError

and here is a snippet from a server/gateway that uses it to provide access to a platform-specific API:

environ['wsgi.file_wrapper'] = FileWrapper
result = application(environ, start_response)

try:
    if isinstance(result, FileWrapper):
        # check if result.filelike is usable w/platform-specific
        # API, and if so, use that API to transmit the result.
        # If not, fall through to normal iterable handling
        # loop below.

    for data in result:
        # etc.

finally:
    if hasattr(result, 'close'):
        result.close()

Questions and Answers

  1. Why must environ be a dictionary? What's wrong with using a subclass?

    The rationale for requiring a dictionary is to maximize portability between servers. The alternative would be to define some subset of a dictionary's methods as being the standard and portable interface. In practice, however, most servers will probably find a dictionary adequate to their needs, and thus framework authors will come to expect the full set of dictionary features to be available, since they will be there more often than not. But, if some server chooses not to use a dictionary, then there will be interoperability problems despite that server's "conformance" to spec. Therefore, making a dictionary mandatory simplifies the specification and guarantees interoperabilty.

    Note that this does not prevent server or framework developers from offering specialized services as custom variables inside the environ dictionary. This is the recommended approach for offering any such value-added services.

  2. Why can you call write() and yield bytestrings/return an iterable? Shouldn't we pick just one way?

    If we supported only the iteration approach, then current frameworks that assume the availability of "push" suffer. But, if we only support pushing via write(), then server performance suffers for transmission of e.g. large files (if a worker thread can't begin work on a new request until all of the output has been sent). Thus, this compromise allows an application framework to support both approaches, as appropriate, but with only a little more burden to the server implementor than a push-only approach would require.

  3. What's the close() for?

    When writes are done during the execution of an application object, the application can ensure that resources are released using a try/finally block. But, if the application returns an iterable, any resources used will not be released until the iterable is garbage collected. The close() idiom allows an application to release critical resources at the end of a request, and it's forward-compatible with the support for try/finally in generators that's proposed by PEP 325.

  4. Why is this interface so low-level? I want feature X! (e.g. cookies, sessions, persistence, ...)

    This isn't Yet Another Python Web Framework. It's just a way for frameworks to talk to web servers, and vice versa. If you want these features, you need to pick a web framework that provides the features you want. And if that framework lets you create a WSGI application, you should be able to run it in most WSGI-supporting servers. Also, some WSGI servers may offer additional services via objects provided in their environ dictionary; see the applicable server documentation for details. (Of course, applications that use such extensions will not be portable to other WSGI-based servers.)

  5. Why use CGI variables instead of good old HTTP headers? And why mix them in with WSGI-defined variables?

    Many existing web frameworks are built heavily upon the CGI spec, and existing web servers know how to generate CGI variables. In contrast, alternative ways of representing inbound HTTP information are fragmented and lack market share. Thus, using the CGI "standard" seems like a good way to leverage existing implementations. As for mixing them with WSGI variables, separating them would just require two dictionary arguments to be passed around, while providing no real benefits.

  6. What about the status string? Can't we just use the number, passing in 200 instead of "200 OK"?

    Doing this would complicate the server or gateway, by requiring them to have a table of numeric statuses and corresponding messages. By contrast, it is easy for an application or framework author to type the extra text to go with the specific response code they are using, and existing frameworks often already have a table containing the needed messages. So, on balance it seems better to make the application/framework responsible, rather than the server or gateway.

  7. Why is wsgi.run_once not guaranteed to run the app only once?

    Because it's merely a suggestion to the application that it should "rig for infrequent running". This is intended for application frameworks that have multiple modes of operation for caching, sessions, and so forth. In a "multiple run" mode, such frameworks may preload caches, and may not write e.g. logs or session data to disk after each request. In "single run" mode, such frameworks avoid preloading and flush all necessary writes after each request.

    However, in order to test an application or framework to verify correct operation in the latter mode, it may be necessary (or at least expedient) to invoke it more than once. Therefore, an application should not assume that it will definitely not be run again, just because it is called with wsgi.run_once set to True.

  8. Feature X (dictionaries, callables, etc.) are ugly for use in application code; why don't we use objects instead?

    All of these implementation choices of WSGI are specifically intended to decouple features from one another; recombining these features into encapsulated objects makes it somewhat harder to write servers or gateways, and an order of magnitude harder to write middleware that replaces or modifies only small portions of the overall functionality.

    In essence, middleware wants to have a "Chain of Responsibility" pattern, whereby it can act as a "handler" for some functions, while allowing others to remain unchanged. This is difficult to do with ordinary Python objects, if the interface is to remain extensible. For example, one must use __getattr__ or __getattribute__ overrides, to ensure that extensions (such as attributes defined by future WSGI versions) are passed through.

    This type of code is notoriously difficult to get 100% correct, and few people will want to write it themselves. They will therefore copy other people's implementations, but fail to update them when the person they copied from corrects yet another corner case.

    Further, this necessary boilerplate would be pure excise, a developer tax paid by middleware developers to support a slightly prettier API for application framework developers. But, application framework developers will typically only be updating one framework to support WSGI, and in a very limited part of their framework as a whole. It will likely be their first (and maybe their only) WSGI implementation, and thus they will likely implement with this specification ready to hand. Thus, the effort of making the API "prettier" with object attributes and suchlike would likely be wasted for this audience.

    We encourage those who want a prettier (or otherwise improved) WSGI interface for use in direct web application programming (as opposed to web framework development) to develop APIs or frameworks that wrap WSGI for convenient use by application developers. In this way, WSGI can remain conveniently low-level for server and middleware authors, while not being "ugly" for application developers.

Proposed/Under Discussion

These items are currently being discussed on the Web-SIG and elsewhere, or are on the PEP author's "to-do" list:

  • Should wsgi.input be an iterator instead of a file? This would help for asynchronous applications and chunked-encoding input streams.
  • Optional extensions are being discussed for pausing iteration of an application's output until input is available or until a callback occurs.
  • Add a section about synchronous vs. asynchronous apps and servers, the relevant threading models, and issues/design goals in these areas.

Acknowledgements

Thanks go to the many folks on the Web-SIG mailing list whose thoughtful feedback made this revised draft possible. Especially:

  • Gregory "Grisha" Trubetskoy, author of mod_python, who beat up on the first draft as not offering any advantages over "plain old CGI", thus encouraging me to look for a better approach.
  • Ian Bicking, who helped nag me into properly specifying the multithreading and multiprocess options, as well as badgering me to provide a mechanism for servers to supply custom extension data to an application.
  • Tony Lownds, who came up with the concept of a start_response function that took the status and headers, returning a write function. His input also guided the design of the exception handling facilities, especially in the area of allowing for middleware that overrides application error messages.
  • Alan Kennedy, whose courageous attempts to implement WSGI-on-Jython (well before the spec was finalized) helped to shape the "supporting older versions of Python" section, as well as the optional wsgi.file_wrapper facility, and some of the early bytes/unicode decisions.
  • Mark Nottingham, who reviewed the spec extensively for issues with HTTP RFC compliance, especially with regard to HTTP/1.1 features that I didn't even know existed until he pointed them out.
  • Graham Dumpleton, who worked tirelessly (even in the face of my laziness and stupidity) to get some sort of Python 3 version of WSGI out, who proposed the "native strings" vs. "byte strings" concept, and thoughtfully wrestled through a great many HTTP, wsgi.input, and other amendments. Most, if not all, of the credit for this new PEP belongs to him.

References

[1]The Python Wiki "Web Programming" topic (http://www.python.org/cgi-bin/moinmoin/WebProgramming)
[2]The Common Gateway Interface Specification, v 1.1, 3rd Draft (http://ken.coar.org/cgi/draft-coar-cgi-v11-03.txt)
[3]"Chunked Transfer Coding" -- HTTP/1.1, section 3.6.1 (http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.6.1)
[4]"End-to-end and Hop-by-hop Headers" -- HTTP/1.1, Section 13.5.1 (http://www.w3.org/Protocols/rfc2616/rfc2616-sec13.html#sec13.5.1)
[5]mod_ssl Reference, "Environment Variables" (http://www.modssl.org/docs/2.8/ssl_reference.html#ToC25)
[6]Procedural issues regarding modifications to PEP 333 (http://mail.python.org/pipermail/python-dev/2010-September/104114.html)
[7]SVN revision history for PEP 3333, showing differences from PEP 333 (http://svn.python.org/view/peps/trunk/pep-3333.txt?r1=84854&r2=HEAD)